A crucial element of the internet’s functioning lies within a small piece of code known as robots.txt, allowing website owners to control access to their online content by search engines like Google. This code has been instrumental in maintaining order on the web for decades, enabling websites to choose whether to permit or deny scraping of their content by tech giants. Most sites have opted to allow Google to scrape their content, given the significant traffic that the company drives. However, with the emergence of the AI wars, this longstanding agreement is facing disruption.
Unraveling the Bargain
The data scraped by search engines like Google has become the foundation for training powerful AI models, utilized by entities such as OpenAI, Google, Meta, and others. These AI models leverage the vast pool of online content to directly answer user queries, potentially diminishing the distribution of web traffic and challenging the established web bargain. In response to this paradigm shift, Google has introduced a new tool called Google-Extended, enabling websites to block the company from utilizing their content for training AI models.
Adoption of Google-Extended
Data provided by Originality.ai indicates that approximately 10% of the top 1,000 websites have implemented the Google-Extended snippet as of late March. Notable publications like The New York Times have embraced this tool, leveraging it to prevent Google and other entities from accessing their content for AI model training purposes. The move reflects the intensifying battles over AI copyright and the increasing reluctance of content creators to contribute to the training of AI models without explicit permission.
Comparison with Other Blockers
While Google-Extended has seen some adoption among prominent websites like CNN, BBC, Yelp, and Business Insider, its usage remains lower compared to other AI training data-blockers. OpenAI’s GPTBot, for instance, is deployed on approximately 32% of the top 1,000 websites. Similarly, CCBot by Common Crawl has garnered more widespread adoption. The discrepancy in adoption rates raises questions about the potential implications of blocking access to training data on AI-generated search results.
Implications for the Future
Jonathan Gillham, CEO of Originality.ai, highlights the risk associated with blocking access to training data, particularly concerning the potential exclusion of relevant content from AI-generated search results. As Google experiments with genAI search through its Search Generative Experience (SGE), the decisions surrounding the rollout and functionality of this new search paradigm will significantly influence the future landscape of the web in the era of AI.
In conclusion, the emergence of tools like Google-Extended underscores the evolving dynamics between content creators and tech giants in the AI era. As stakeholders navigate the complexities of AI model training and content access, the decisions made today will shape the future trajectory of the web and its interaction with artificial intelligence technologies.