Google Opens Discussions on Source Attribution and Copyright Consideration over the Use of Large Language Models – LLMs – for Generative AI Services; Most Notably in Regard to the Robots.TXT File.
However, I don’t feel that this is a proper direction to take with only the robots.txt file.
What is robots.txt
The robots.txt file is a prepared text file by the webmaster that instructs web robots or search engine crawlers on behavior concerning their website. This very file serves like a set of directions for search engine bots-informing them which site pages or directories they may crawl and index and which they should not.
Each time the search engine bots, like Googlebot, access a website, they first search for the robots.txt file in the root directory; for example, www.example.com/robots.txt. The file should contain specific instructions in a standardized syntax, usually known as the Robots Exclusion Protocol-REP, which helps the search engines know what the site prefers about crawling and indexing.
The general structure of the robots.txt file usually consists of two kinds of directives:
- User-agent: This defines which search engine web robot the ensuing rules are for. For example, “User-agent: Googlebot” simply states that the ensuing rules are for Google’s crawler.
- Disallow: This defines which pages or directories on the website the web robot should not crawl or index. Example: “Disallow: /private/” tells the search engine not to crawl pages in the “/private/” directory.
You can hide parts of the website from search engines through the robots.txt file, which is useful in managing crawl budgets, protection of sensitive information, and issues of duplicate content. This file can be used, but only with extreme care, since the accidental blocking of search engine crawlers from important parts of the website will have a great effect on the search engine ranking and visibility of the website.
What are the reasons to not use robots.txt?
Engaging in a discussion about respecting publishers’ copyright by starting off with robots.txt is a poor starting point for several reasons.
Some LLMs do not utilize crawlers and do not identify themselves.
It falls upon the website operator to independently identify and block certain crawlers that would take advantage and distribute their data for generative AI products. This can add a significant amount of additional and often unnecessary work, particularly on smaller publishers.
This approach also assumes that the publisher can edit their robots.txt file, which may not be true with hosted solutions.
This approach is not scalable as the number of crawler increase.
The new proposed standard in the robot.txt limits the maximum file size that can be used to 500kb for a robots.txt file.
Large publishers will also find issues with their robots.txt file when they have quite a significant number of LLM crawlers and/or refined URL patterns to block on top of other bots. The file size limit may be challenging when it comes to managing and accommodating all the rules and directives within the limits of the robots.txt file.
An ‘all or nothing’ approach is not acceptable
There’s also a practical difficulty, when bigger crawlers such as Googlebot and Bingbot are concerned, in distinguishing the data used for conventional search engine results pages (where, conceptually at any rate, a “citation” of the original source may exist in a form of agreement between publisher and search engine) and generative AI products.
This also means that any blocking of Googlebot or Bingbot for their generative AI products will lose the potential for visibility in either search results. This puts the publishers in a rather undesirable position whereby they have to choose between “all or nothing” since there is no more granular approach to it.
While copyright concerns are more about the usage of the data, robots.txt is mainly about controlling crawling.
Indirectly, of course, this makes the first less applicable since at this indexing/processing step, the main focus of robots.txt is on crawling. This means robots.txt should be an option of last resort-not a starting point.
The Robotos.txt files work just fine for normal crawlers and don’t need changes for LLMs. While it is good that the LLM crawlers should identify themselves, the most important topic to be discussed here is the indexation/processing of the data crawled. It is this that one actually should be discussing with data handling and usage of the crawled results by LLM crawlers.
Reinventing the wheel
Fortunately, the web has already developed mature solutions for managing data usage with respect to copyright and are called Creative Commons licenses.
Most of the Creative Commons licenses are suitable for the purpose of LLMs. Some of those are explained below for illustration of the point:
- CC0: For any amount of distribution, remixing, adaptation or building upon the material in any form or medium is permissible in any way with no conditions or restrictions.
- CC BY-NC-ND: Creative LLMs can share and distribute the material in its original form for non-commercial purposes, giving appropriate attribution to the creator. The work cannot be remixed, adapted, or built upon.
- CC BY: Creative LLMs can distribute, remix, adapt, and build upon the material in any medium or format as long as they give appropriate credit to the creator. Commercial use is allowed provided credit is given.
- CC BY-NC-SA: This license provides conditions under which LLMs can distribute, remix, adapt, and build upon material for noncommercial purposes. Proper attribution should be provided, and the modified versions are required to bear the same terms of license.
- CC BY-ND: This license allows LLMs to copy and distribute the material in an unchanged, or unmodified, form, with full credit given to the author of the work. Commercial use is allowed in this case, but any acts of creating derivatives or adaptations are not permitted.
- CC BY-NC: LLMs are allowed to distribute, remix, adapt, and build upon the material for noncommercial purposes only, provided they give attribution to the creator.
- CC BY-SA: It allows LLMs to distribute, remix, adapt, and build upon the material, with attribute given to the original creator. Commercial use is allowed; if modified versions are created by LLMs, they should license the derivatives under exactly the same terms.
These Creative Commons licenses allow LLMs to follow copyright requirements and proper usage guidelines for the materials they use. The last two licenses are generally not appropriate for LLMs.
These first five licenses, on the other hand, are to mean that LLMs have to be very considerate in how they make use of the data that they crawl or obtain and make sure that the usage complies with the publishers’ requirements for proper attribution, among others, when it comes to sharing the products derived from the data.
This approach puts the burden on the “few” LLMs in this world and not on the “many” publishers.
Besides, the first three licenses provide “traditional” use of the data, say in results from search engines where attribution is made via links to original websites. On the other hand, the fourth and fifth licenses are friendly to research and development in open-sourced LLMs.
What is Meta Tag
A meta tag in general is an HTML tag that provides information or metadata about a web page. It does not affect the visible content to any website visitor directly but rather provides extra information to web browsers and search engines. Meta tags go in the head section of an HTML document and signify information such as page description, authorship, keywords, character set, and viewport settings.
Some common meta tags include:
- <meta charset=””>: Specifies the character encoding for the HTML document.
- <meta name=”description” content=””>: Provides a brief description of the page’s content.
- <meta name=”keywords” content=””>: Specifies keywords or phrases relevant to the page’s content.
- <meta name=”author” content=””>: Indicates the author of the page.
- <meta name=”viewport” content=””>: Sets the viewport properties for responsive design on different devices.
- <meta name=”robots” content=””>: Informs search engine crawlers about how to handle the page (index, follow, noindex, nofollow, etc.).
- <meta http-equiv=”refresh” content=””>: Automatically refreshes or redirects the page after a specified time.
- <meta property=”og:title” content=””>: Used for Open Graph tags to specify the title of a page for social media sharing.
Meta tags play a significant role in helping search engines understand the content and context of web pages, which can impact their visibility and ranking in search results. They also facilitate proper rendering and display of web pages on various devices and browsers.
The Meta Tag is the solution
Once a publisher has determined a suitable license, the second part of the process involves a mechanism to express that license. It seems the robots.txt method is not well-suited for that.
Blocking a page from being crawled by search doesn’t necessarily have anything to do with it not being useful or usable for LLMs. These are different use cases. Instead, I would recommend using a meta tag, which can more precisely target these various use cases, and which will ease the pain for publishers.
Meta tags are pieces of code that can be inserted at the page level, within a theme, or even in the content-something that, while not technically correct, HTML’s flexibility allows as a last resort when publishers have limited access to their code. Using meta tags does not impede crawling unlike the meta noindex tag. They do, however, provide a method to communicate the usage rights of the published data.
Although there are already copyright tags, the rights-standard abandoned proposal, copyright-meta placing emphasis on the owner’s name rather than the license, among others; their use in some sites at the moment perhaps may conflict with what is being sought here. A well-designed meta tag, standardized, can offer a better solution to communicate licensing and rights information for LLMs and publishers.
A new meta tag may be required, though it may equally be achieved using an existing one, such as “rights-standard”. For the purposes of this discussion, I will be using the following new meta tag:
<meta name=”usage-rights” content=”CC-BY-SA” />
Moreover, I recommend supporting this meta tag in HTTP Headers, similar to how “noindex” is supported in X-Robots-Tag, to assist LLMs crawlers in efficiently managing their crawl resources. By checking the HTTP Headers, LLMs can validate the usage rights easily.
For instance, the usage could be as follows in combination with other meta tags:
X-Robots-Tag: usage-rights: CC-BY-SA
In the example below, the page should not be used for search results, but it can be utilized for commercial LLMs as long as proper credit is given to the source:
X-Robots-Tag: usage-rights: CC-BY, noindex
This standardized approach allows for better communication of usage rights between publishers and LLMs, enabling more efficient data access and proper attribution.
A Fully Reliable Solution
Yes, there are malicious crawlers, and there are bad actors creating LLMs and generative AI products.
However, the presence of the meta tag solution described above does not completely prevent this sort of usage of content, and neither does the robots.txt file.
Both of them, in fact, depend on recognition and adherence from the firms which apply this information to their AI products. Ensuring responsible and ethical usage depends on them.
Conclusion
This article is intended to show that using robots.txt is a very bad way to proceed, let alone a starting point, on the part of those who create copyright for LLMs and generative AI products in these issues of data usage management.
Instead, this meta-tag proposal better allows publishers to declare copyright information at the page level using Creative Commons. Importantly, this doesn’t interfere with the crawling or indexing of pages for other purposes-for example, search engine results. Instead, this extends complete copyright declarations across other uses such as LLMs, generative AI products, and any future AI products.