Earlier this month, the Common Crawl Foundation had the privilege of participating in a groundbreaking workshop hosted by the Internet Architecture Board (IAB) in Washington DC. The workshop, titled "IAB Workshop on AI-CONTROL," brought together experts from crawling companies, web publishers, AI companies, and “bot defense” companies to discuss the intersection of artificial intelligence and Internet protocols. Thom Vaughan from Common Crawl served on the program committee.
Key Topics of Discussion
While adhering to Chatham House rules limits the specifics we can share, we can highlight some of the general themes that were explored:
Opt-out and Opt-in Vocabulary
The workshop attendees discussed various approaches for allowing individuals and organizations to opt-out of AI data collection and processing. We agreed that it was important to develop a vocabulary that clearly expressed the preferences of rights holders and authors. This vocabulary should include Creative Commons-style opt-in choices. This vocabulary will be important for the eventual EU text and data mining (TDM) opt-out registry.
The Rise of Bot Defenses
Attendees discussed the recent popularity of using “robot defenses” to stop crawling, instead of robots.txt
.
Providers of these defenses are sometimes treating archive crawlers (like Common Crawl’s CCBot
) the same as bots crawling for particular AI companies. This is unfortunate and opaque: robots.txt
is usually a public document, and is obeyed by a significant number of crawlers. We discussed some examples of US government websites inadvertently blocking the official 2024 End of Term Archive.
Stakeholder Concerns
A wide range of perspectives was shared, and we discussed the balancing act between innovation, privacy, and ethical considerations.
The Future
The discussions at this workshop will undoubtedly influence future recommendations and standards in the realm of Generative AI and Internet protocols. As we enter what many are calling the "dawn of generative AI," the guidance provided by organizations like the IAB will be instrumental in shaping a responsible and innovative future.
While we can't share specific details of the presentations or discussions, we can say that the level of expertise and the depth of conversation were extensive. Our organization was well-represented, with our CTO Greg Lindahl contributing valuable insights to the discussions. A conference report will soon be published on the IETF datatracker.
Conclusion
It's clear that the intersection of AI and internet protocols will remain a critical area of focus. Workshops like this one play a vital role in cultivating collaboration and developing thoughtful approaches to emerging challenges.
We look forward to seeing how the ideas exchanged at this workshop will shape future guidelines and best practices in the field. If you want to continue the discussion you're more than welcome to join us in our Discord Server, or our Google Group.
Note: This blog post adheres to Chatham House rules. No specific statements or opinions have been attributed to individual participants.