Digital Marketing Agency | SEO, Paid Social & PPC

Robots.txt: Introducing a New Meta Tag for LLM and AI Products

Share This Post

Google is initiating a conversation about attributing credit and respecting copyright while utilizing large language models (LLMs) for generative AI products, with a particular emphasis on the robots.txt file.

Nonetheless, I believe that focusing solely on the robots.txt file might not be the most appropriate approach.

New Meta Tag for LLM and AI Products

What is robots.txt

The robots.txt file is a plain text file that webmasters create to instruct web robots or search engine crawlers on how to interact with their websites. It serves as a set of guidelines for search engine bots, telling them which pages or directories on the site they are allowed to crawl and index and which ones they should avoid.

When search engine bots like Googlebot visit a website, they first look for the robots.txt file in the root directory (e.g., www.example.com/robots.txt). The file contains specific instructions using a standardized syntax, known as the Robots Exclusion Protocol (REP), which helps search engines understand the site’s preferences regarding crawling and indexing.

The basic structure of the robots.txt file typically includes two types of directives:

  • User-agent: This directive specifies which web robot or search engine crawler the following rules apply to. For example, “User-agent: Googlebot” means the rules that follow are for Google’s crawler.
  • Disallow: This directive indicates the specific pages or directories that the web robot should not crawl or index. For instance, “Disallow: /private/” instructs the search engine not to crawl pages in the “/private/” directory.

By using the robots.txt file, website owners can control which parts of their site are accessible to search engines, which is useful for managing crawl budgets, protecting sensitive information, and preventing duplicate content issues. However, it’s essential to use the file carefully to avoid accidentally blocking search engine crawlers from important parts of the website, potentially affecting its search engine rankings and visibility.

How to Optimize Crawl Budget for SEO

What are the reasons for not using robots.txt?

Initiating the discussion on respecting publishers’ copyright by focusing on robots.txt is an unsuitable starting point for several reasons.

Some LLMs do not use crawlers and fail to identify themselves.

The responsibility falls on the website operator to identify and block specific crawlers that may exploit and distribute their data for generative AI products. This adds a significant amount of additional and often unnecessary work, especially for smaller publishers.

Furthermore, this approach assumes that the publisher has editing access to their robots.txt file, which is not always the case with hosted solutions.

As the number of crawlers keeps increasing, this approach becomes unsustainable.

Under the newly proposed robots.txt standard, the maximum usable file size for a robots.txt file is limited to 500 kb.

As a result, large publishers may encounter difficulties with their robots.txt file when they have a substantial number of LLM crawlers and/or refined URL patterns to block, in addition to other bots. The file size limitation can pose challenges in managing and accommodating all the necessary rules and directives within the constraints of the robots.txt file.

An ‘all or nothing’ approach is unacceptable

When dealing with larger crawlers like Googlebot and Bingbot, it becomes challenging to differentiate between the data utilized for traditional search engine results pages (where a “citation” to the original source usually exists as an agreement between the publisher and the search engine) and generative AI products.

Blocking Googlebot or Bingbot for their generative AI products also means forfeiting potential visibility in their respective search results. This situation creates an undesirable dilemma for publishers, as they are forced to make an “all or nothing” decision, lacking a more nuanced approach.

Robots.txt primarily deals with managing crawling, while the copyright discussion revolves around how the data is utilized

The focus on the indexation/processing phase makes robots.txt less pertinent to this discussion. Instead, it should serve as a last resort if no other solutions are viable, rather than being the starting point.

Robots.txt files function well for regular crawlers and do not require alterations for LLMs. While LLM crawlers should identify themselves, the crucial aspect that demands attention is the indexation/processing of the crawled data. This should be the primary topic of discussion, addressing the handling and usage of data collected by LLM crawlers.

Importance of Meta Tags in SEO: Explained

Reinventing the wheel

Fortunately, the web already offers well-established solutions for managing data usage in compliance with copyrights, known as Creative Commons licenses.

Most of the Creative Commons licenses are suitable for the purpose of LLMs. Here are some examples to illustrate:

  • CC0: This license allows LLMs to freely distribute, remix, adapt, and build upon the material in any medium or format without any conditions or restrictions.
  • CC BY-NC-ND: LLMs can copy and distribute the material in its original form for noncommercial purposes, while providing attribution to the creator and not creating any derivatives or adaptations of the work.
  • CC BY: LLMs can distribute, remix, adapt, and build upon the material in any medium or format, as long as they give appropriate attribution to the creator. Commercial use is permitted, but credit must be given.
  • CC BY-NC-SA: Similar to CC BY-NC, this license allows LLMs to distribute, remix, adapt, and build upon the material for noncommercial purposes, with proper attribution, and any modifications must be licensed under identical terms.
  • CC BY-ND: LLMs can copy and distribute the material in its original form, giving credit to the creator. Commercial use is permitted, but no derivatives or adaptations are allowed.
  • CC BY-NC: LLMs can distribute, remix, adapt, and build upon the material for noncommercial purposes only, while still giving attribution to the creator.
  • CC BY-SA: LLMs can distribute, remix, adapt, and build upon the material, provided they attribute the original creator. Commercial use is allowed, and if LLMs create modified versions, they must license the derivatives under the same terms.

By using these Creative Commons licenses, LLMs can respect copyright requirements and adhere to proper usage guidelines for the materials they utilize. The last two licenses are typically not suitable for LLMs.

On the other hand, the first five licenses mean that LLMs must carefully consider how they use the data they crawl or obtain and ensure compliance with the publishers’ requirements, including proper attribution when sharing the products derived from the data.

This approach places the responsibility on the “few” LLMs in the world rather than burdening the “many” publishers.

Additionally, the first three licenses support the “traditional” usage of the data, such as in search engine results, where attribution is given through links to the original websites. Meanwhile, the fourth and fifth licenses are conducive to research and development for open-source LLMs.

In-Depth Website Development Guide For Beginners

What is Meta Tag

A meta tag is a type of HTML tag used to provide metadata or information about a web page. It does not directly affect the content visible to website visitors but instead offers additional information for web browsers and search engines. Meta tags are placed in the head section of an HTML document and provide data like page description, authorship, keywords, character set, and viewport settings.

Some common meta tags include:

  • <meta charset=””>: Specifies the character encoding for the HTML document.
  • <meta name=”description” content=””>: Provides a brief description of the page’s content.
  • <meta name=”keywords” content=””>: Specifies keywords or phrases relevant to the page’s content.
  • <meta name=”author” content=””>: Indicates the author of the page.
  • <meta name=”viewport” content=””>: Sets the viewport properties for responsive design on different devices.
  • <meta name=”robots” content=””>: Informs search engine crawlers about how to handle the page (index, follow, noindex, nofollow, etc.).
  • <meta http-equiv=”refresh” content=””>: Automatically refreshes or redirects the page after a specified time.
  • <meta property=”og:title” content=””>: Used for Open Graph tags to specify the title of a page for social media sharing.

Meta tags play a significant role in helping search engines understand the content and context of web pages, which can impact their visibility and ranking in search results. They also facilitate proper rendering and display of web pages on various devices and browsers.

The Meta Tag is the solution

After a publisher identifies an appropriate license, the next step is to effectively communicate that license. However, the robots.txt approach appears unsuitable for this purpose.

Blocking a page from crawling by search engines doesn’t necessarily mean it can’t be utilized or isn’t valuable for LLMs. These are distinct use cases.

To address these different use cases more precisely and make the process easier for publishers, I recommend utilizing a meta tag instead.

7 SEO Tips to Find & Replace Broken Links

Meta tags are snippets of code that can be inserted on a page level, within a theme, or even within the content (though not technically correct, HTML’s flexibility allows this as a last resort when publishers have limited code access). Using meta tags doesn’t prevent crawling, unlike the meta noindex tag. However, they serve as a means to communicate the usage rights of the published data.

While there are existing copyright tags, rights-standard (an abandoned proposal), copyright-meta (which emphasizes the owner’s name rather than the license), and other attempts, their current implementation on some websites may conflict with the goals we aim to achieve here. A well-designed and standardized meta tag can provide a more effective solution for communicating licensing information and rights for LLMs and publishers.

A new meta tag may be necessary, although existing ones like “rights-standard” could also serve the purpose. For this discussion, I propose the following new meta tag:

<meta name=”usage-rights” content=”CC-BY-SA” />

Moreover, I recommend supporting this meta tag in HTTP Headers, similar to how “noindex” is supported in X-Robots-Tag, to assist LLMs crawlers in efficiently managing their crawl resources. By checking the HTTP Headers, LLMs can validate the usage rights easily.

For instance, the usage could be as follows in combination with other meta tags:

X-Robots-Tag: usage-rights: CC-BY-SA

In the example below, the page should not be used for search results, but it can be utilized for commercial LLMs as long as proper credit is given to the source:

X-Robots-Tag: usage-rights: CC-BY, noindex 

This standardized approach allows for better communication of usage rights between publishers and LLMs, enabling more efficient data access and proper attribution.

On-Page and Off-Page SEO Techniques You should do

 A Completely Reliable Solution

Indeed, there are malicious crawlers and unscrupulous actors developing LLMs and generative AI products.

While the suggested meta tag solution may not entirely prevent content from being used in such ways, the robots.txt file also lacks the capability to do so.

It’s essential to recognize that both approaches rely on the recognition and adherence of the companies utilizing the data for their AI products. Ensuring responsible and ethical usage ultimately lies in the hands of these entities.

Conclusion

This article aims to demonstrate that, in my opinion, utilizing robots.txt for managing data usage in LLMs is not the most suitable approach or starting point when addressing copyright concerns in the era of LLMs and generative AI products.

Instead, the proposed meta tag implementation offers a more effective solution, enabling publishers to specify copyright information at the page level using Creative Commons. Importantly, this approach does not hinder the crawling or indexing of pages for other purposes, such as search engine results. Moreover, it allows for comprehensive copyright declarations to cover various uses, including LLMs, generative AI products, and potential future AI products.

Would you like to read more about the new Meta Tag for LLM and AI-related articles? If so, we invite you to take a look at our other tech topics before you leave!

Use our Internet marketing service to help you rank on the first page of SERP.

Subscribe To Our Newsletter

Get updates and learn from the best