Crawling plays a vital role for all websites, regardless of their size. Without having your content crawled, there is no opportunity to achieve visibility on Google’s platforms.
Now, let’s delve into optimizing crawling to ensure that your content receives the recognition it rightfully deserves.
What Is Crawling
In the realm of SEO, crawling refers to the systematic process through which search engine bots (commonly known as web crawlers, bots or spiders) discover content present on a website. This content can take various forms, such as text, images, videos, or other file types, and it is exclusively located through the use of links.
Crawling in SEO refers to the process by which search engine bots systematically navigate through web pages on the internet to discover and index content. These bots, also known as web crawlers or spiders, follow links from one page to another, collecting information about each page they encounter.
During the crawling process, search engine bots analyze the content and structure of web pages, including text, images, videos, and other media files. They gather data about the page’s URL, title, headings, meta tags, and other relevant elements.
Crawling is an essential component of search engine optimization (SEO) because it allows search engines to understand the content and structure of a website. By ensuring that your website is easily crawlable, you increase the chances of your web pages being indexed and ranked by search engines, leading to improved visibility and organic traffic to your site.
How Does Web Crawling Work?
A web crawler operates by discovering URLs and downloading the corresponding page content. During this process, the crawler may analyze the content and pass it on to the search engine’s index. Additionally, it extracts links found within the page.
These discovered links are categorized as follows:
- New URLs: These are unknown to the search engine and will be visited for the first time.
- Known URLs with no crawling guidance: These URLs are periodically revisited to check for any changes in their content that require updating the search engine’s index.
- Known URLs with clear guidance: If these URLs have been updated and provide clear instructions, such as through an XML sitemap’s last modification timestamp, they should be recrawled and reindexed.
- Known URLs with no updates: If these URLs have not been updated and provide clear instructions, such as a HTTP 304 Not Modified response header, they should not be recrawled or reindexed.
- Inaccessible URLs: These URLs cannot or should not be followed, such as those behind a login form or links blocked by a “nofollow” robots tag.
- Disallowed URLs: These URLs are explicitly blocked by the robots.txt file, preventing search engine bots from crawling them.
All allowed URLs are added to a list known as the crawl queue, which determines the pages to be visited in the future. However, different URLs may have varying levels of priority. The priority assigned to each page depends not only on its link categorization but also on various factors that determine its relative importance in the eyes of each search engine.
Different search engines, such as Googlebot, Bingbot, DuckDuckBot, Yandex Bot, or Yahoo Slurp, employ their own bots with specific algorithms to decide which pages to crawl and when. Consequently, each search engine’s crawling behavior may vary.
Why Is It Crucial for Your Website to be Crawlable?
If a page on a website is not crawled, it will not be ranked in search results and is unlikely to be indexed. However, the importance of crawling extends beyond these reasons. Swift crawling is crucial for time-sensitive content. If content is not crawled and given visibility promptly, it quickly becomes irrelevant to users. For instance, last week’s breaking news, a past event, or a sold-out product will fail to engage audiences.
Even in industries where time-to-market is not a critical factor, speedy crawling remains advantageous. When you update an article or implement a significant on-page SEO change, the faster Googlebot crawls it, the sooner you can benefit from the optimization or identify any mistakes and make corrections.
If Googlebot crawls slowly, you cannot iterate quickly and learn from your failures. Consider crawling as the foundation of SEO, as your organic visibility relies entirely on its effective implementation on your website.
How to Measure Crawl Budget and Crawl Efficacy
Contrary to what many believe, Google’s objective is not to crawl and index all the content from every website on the internet. There is no guarantee that every page will be crawled. In reality, a significant portion of pages on most websites has never been visited by Googlebot.
If you encounter the exclusion message “Discovered – currently not indexed” in the Google Search Console page indexing report, it indicates that this issue is affecting your website. However, the absence of this exclusion message does not necessarily mean that you have no crawling issues.
There is a prevalent misunderstanding regarding which metrics hold significance when evaluating crawling processes.
The Misconception Surrounding Crawl Budget.
In the world of SEO, professionals often consider the concept of crawl budget, which refers to the number of URLs that Googlebot can and intends to crawl within a specific timeframe for a given website. This concept emphasizes the importance of maximizing crawling. It is further reinforced by the crawl status report in Google Search Console, which displays the total number of crawl requests.
However, the notion that more crawling is inherently better is entirely misguided. The total number of crawls is merely a vanity metric. Merely increasing the number of crawls per day by a factor of 10 does not necessarily lead to faster indexing or reindexing of the content that truly matters to you. It only results in additional server load and increased costs.
Instead of focusing on increasing the overall quantity of crawling, the emphasis should be on quality crawling that generates tangible SEO value.
The Significance of Crawl Efficacy.
Optimizing crawling involves improving the crawl efficacy, which is the reduction of time between publishing or making significant updates to an SEO-relevant page and the subsequent visit by Googlebot. To assess crawl efficacy, it is recommended to extract the created or updated datetime value from the database and compare it with the timestamp of the next Googlebot crawl in the server log files.
If this approach is not feasible, an alternative is to calculate it using the lastmod date in the XML sitemaps and periodically query the relevant URLs using the Search Console URL Inspection API until it returns the last crawl status.
By quantifying the time delay between publishing and crawling, you can effectively measure the impact of crawl optimizations using a meaningful metric. As crawl efficacy decreases, new or updated SEO-relevant content will be displayed to your audience on Google surfaces at a faster pace.
If your site’s crawl efficacy score indicates that Googlebot is taking too long to visit crucial content, what steps can you take to optimize crawling?
Support from Search Engines Regarding Website Crawling.
In recent years, there has been significant discussion about the efforts made by search engines and their partners to enhance crawling processes.
This focus on improving crawling is driven by their vested interests. Streamlining crawling not only grants search engines access to superior content for their results but also contributes to environmental sustainability by reducing greenhouse gas emissions.
Much of the conversation revolves around two APIs that aim to optimize crawling.
The concept behind these APIs is to empower websites to directly send relevant URLs to search engines, triggering a crawl. This approach allows for faster indexing of fresh content and provides an effective means to remove outdated URLs—a functionality that is currently not well-supported by search engines.
In essence, these APIs offer a pathway for websites to have more control over which URLs are crawled and indexed, resulting in improved efficiency and a more up-to-date search index.
Support for IndexNow from Non-Google Search Engines.
The primary API in question is called IndexNow, which is supported by search engines like Bing, Yandex, and Seznam. It’s worth noting that Google does not support this API. Additionally, IndexNow is integrated into various SEO tools, CRMs, and CDNs, potentially reducing the development effort required to utilize its functionality.
While it may initially appear to be a convenient solution for SEO purposes, it is essential to exercise caution. Consider whether a significant portion of your target audience relies on the search engines supported by IndexNow. If not, triggering crawls specifically from their bots may offer limited value.
Tips to Help Streamline Your Business Through Increased Use of Tech
Moreover, it is crucial to evaluate the impact of integrating with IndexNow on server load versus the improvement in crawl efficacy score for those specific search engines. It’s possible that the costs outweigh the benefits, so a careful cost-benefit analysis is necessary.
Google Support From The Indexing API
The other API under discussion is the Google Indexing API. Google has stated that this API should only be used for crawling pages with a job posting or broadcast event markup. However, through testing, it has been proven that this statement is false.
Submitting non-compliant URLs to the Google Indexing API may indeed result in a significant increase in crawling. However, this situation perfectly illustrates why the concept of “crawl budget optimization” and making decisions solely based on the quantity of crawling can be misguided.
For non-compliant URLs, submitting them to the Google Indexing API has no impact on indexing. When you think about it, this logic makes perfect sense. When you submit a URL, Google will quickly crawl the page to check if it contains the specified structured data. If it does, indexing will be expedited. If it doesn’t, Google will simply ignore it.
Therefore, calling the API for non-compliant pages achieves nothing except adding unnecessary server load and wasting development resources without any actual benefits or gains.
Support Provided by Google within the Google Search Console.
Another method supported by Google for crawling is manual submission through Google Search Console.
When URLs are submitted through this method, they are typically crawled, and their indexing status is updated within an hour. However, there is a quota limit of 10 URLs within a 24-hour period, which poses a challenge when dealing with larger-scale scenarios.
Nonetheless, it’s important not to disregard this approach entirely. To address the scalability issue, you can automate the submission of priority URLs by scripting actions that simulate user behavior. This can help accelerate crawling and indexing for those select few URLs.
Lastly, it’s worth noting that, based on my testing so far, clicking the “Validate fix” button on exclusions marked as “discovered currently not indexed” does not significantly expedite crawling.
If search engines offer limited assistance in this regard, how can we take matters into our own hands to optimize crawling?
Strategies for Achieving Efficient Website Crawling.
Ensure a Speedy and Robust Server Response.
Having a highly performant server is crucial for optimal crawling. It needs to effectively handle the crawling demands imposed by Googlebot without negatively affecting server response time or encountering frequent errors.
To ensure your server is up to par, check the host status in Google Search Console and ensure it shows a green status. Additionally, monitor and strive to keep 5xx errors below 1% of the total requests. Furthermore, aim for server response times that trend below 300 milliseconds to maintain optimal performance.
Eliminate Content that has No Value.
When a website contains a considerable amount of low-quality, outdated, or duplicated content, it not only distracts crawlers from visiting new or recently updated content but also leads to index bloat.
To initiate the cleanup process swiftly, begin by examining the pages report in Google Search Console, specifically looking for the exclusion message ‘Crawled – currently not indexed.’
Within the sample provided, pay attention to folder patterns or any other indicators of issues. Once identified, address these problems by merging similar content using a 301 redirect or deleting irrelevant content with a 404 response as appropriate.
Provide directives to Googlebot Regarding Content to Avoid Crawling.
While using rel=canonical links and noindex tags can effectively maintain a clean Google index for your website, it’s important to be aware that they come with a cost in terms of crawling.
While there are instances where using these directives is necessary, it’s worth considering whether certain pages need to be crawled at all. If not, you can prevent Googlebot from crawling them by using a robots.txt disallow directive.
To identify situations where blocking the crawler might be more beneficial than providing indexing instructions, examine the Google Search Console coverage report for exclusions related to canonicals or noindex tags.
Additionally, review the sample of URLs marked as ‘Indexed, not submitted in sitemap’ and ‘Discovered – currently not indexed’ in Google Search Console. Identify and block non-SEO relevant routes, such as parameter pages like “?sort=oldest,” functional pages like “shopping cart,” infinite spaces generated by calendar pages, unimportant images, scripts, or style files, and API URLs.
Furthermore, assess how your pagination strategy is affecting crawling to ensure it is optimized for efficient crawling.
Instructions to Googlebot Regarding Which Content to Crawl.
An optimized XML sitemap is a valuable tool for directing Googlebot to SEO-relevant URLs.
To ensure it is optimized, the XML sitemap should dynamically update with minimal delay. It should also include the last modification date and time, providing search engines with information about when the page was last significantly changed and whether it should be recrawled.
Facilitate Crawling through Internal Links.
As we are aware, crawling relies on the presence of links. While external links are valuable but often difficult to build in large numbers while maintaining quality, XML sitemaps provide a good starting point. However, internal links are relatively easier to scale and can greatly improve crawl efficacy.
Pay special attention to internal links within your mobile sitewide navigation, breadcrumbs, quick filters, and related content sections. Ensure that these links do not rely on JavaScript, as this can hinder crawling efforts. By optimizing these internal linking elements, you can enhance the crawlability of your website.
Make sure You Optimize Your Web Crawling
Website crawling holds the utmost importance in SEO. With the inclusion of crawl efficacy as a tangible Key Performance Indicator (KPI), you now have a measurable metric to assess the impact of optimizations. This empowers you to elevate your organic performance to new heights.
Would you like to read more about effective website crawling optimization strategies-related articles? If so, we invite you to take a look at our other tech topics before you leave!