How To Control Bard and Vertex AI Training Data Access on Your Websites

Share This Post

Generative AI is that form of intelligence in AI systems and models, which creates new contents, texts, images, or other forms of data. Unlike any other traditional AI model, which is designed for a certain task and acts based on predefined rules, generative AI systems can be trained on large datasets and produce novelty creative outputs.

That would make the generative model a very special subclass of generative AI, within which such approaches are included as Generative Adversarial Networks and Variational Autoencoders. They learn patterns and features from data already available to generate more new instances, which will be similar to the data they got trained on.

For instance, GANs consist of one network, the generator, which generates data, and another network, so to speak, a discriminator, which measures the generated data against real data. The result is a process, adversarial training, in which the generative powers of the system are developed.

It can be applied to areas like natural language processing, computer vision, and generating artwork, among many others. Applications can range from generating highly realistic content and simulating various scenarios to developing creative tasks. Still, however, it enables the creation of misinformation and deepfakes and puts a developing discussion on how AI technologies in general need to be treated responsibly and with ethics.

With the ever-changing generative AI landscape, Google has been seeking a balance in the ecosystem to work for web publishers and AI development.

Google introduced Google-Extended as a control mechanism that would help web publishers make decisions on the extent and manner in which their sites contribute to the improvement of Google Bard, Vertex AI generative APIs, and future AI models.

It’s an initiative that is really steeped in principles for the development of responsible AI in line with long-time values at Google and their dedication to consumer privacy.

Large Language Models Vs. Search Engines: What You Need to Know

What Is Google-Extended?

Google-Extended functions as a “standalone product token designed for web publishers to control their sites’ contribution to enhancing Bard and Vertex AI generative APIs” along with the associated AI models.

Although Google-Extended does not have a distinct HTTP request user agent string, crawling operations are conducted using existing Google user agent strings, leveraging the robots.txt user-agent token for control purposes.

Here’s an illustrative entry for your robots.txt file:

User-agent: Google-Extended
Disallow: /paywall-content/
Allow: /

In this instance:

User-agent: Google-Extended specifies that the ensuing rules pertain to Google-Extended.
Disallow: /paywall-content/ instructs Google-Extended to refrain from accessing or using content in the “paywall-content” directory for the improvement of Bard and Vertex AI generative APIs.
Allow: / instructs Google-Extended to access and utilize content from all other site directories for the improvement of future AI products.

This development emphasizes the delicate balance between advancing AI technology and safeguarding the autonomy and interests of web publishers.

Google is Testing Its Gemini AI within the Search Platform

Managing Artificial Intelligence’s Access to Website Content

As AI begins to have a much greater impact on a variety of areas in business, the manner in which web publishers manage access to their content used in training AI has become quite a challenging issue to manage. On that note, Google has reaffirmed its commitment to collaborating more closely with the web and AI communities in the pursuit of additional machine-readable ways by which web publishers will be given more choice and control.

On the other hand, publishers who would deny OpenAI access to their content in newer models can consider GPTbot a means of providing restrictive usage or restriction access.