September 30, 2024

IAB Workshop on AI-CONTROL

Note: this post has been marked as obsolete.

Earlier this month, the Common Crawl Foundation had the privilege of participating in a groundbreaking workshop hosted by the Internet Architecture Board (IAB) in Washington DC.

Common Crawl Foundation

Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.

Earlier this month, the Common Crawl Foundation had the privilege of participating in a groundbreaking workshop hosted by the Internet Architecture Board (IAB) in Washington DC. The workshop, titled "IAB Workshop on AI-CONTROL," brought together experts from crawling companies, web publishers, AI companies, and “bot defense” companies to discuss the intersection of artificial intelligence and Internet protocols. Thom Vaughan from Common Crawl served on the program committee.

*Left to right: Thom Vaughan, Paul Ohm, Carl Gahnberg, Jari Arkko, Farzaneh Badii (photo permission graciously approved)*

Key Topics of Discussion

While adhering to Chatham House rules limits the specifics we can share, we can highlight some of the general themes that were explored:

Opt-out and Opt-in Vocabulary

The workshop attendees discussed various approaches for allowing individuals and organizations to opt-out of AI data collection and processing. We agreed that it was important to develop a vocabulary that clearly expressed the preferences of rights holders and authors. This vocabulary should include Creative Commons-style opt-in choices. This vocabulary will be important for the eventual EU text and data mining (TDM) opt-out registry.

The Rise of Bot Defenses

Attendees discussed the recent popularity of using “robot defenses” to stop crawling, instead of robots.txt.

Providers of these defenses are sometimes treating archive crawlers (like Common Crawl’s CCBot) the same as bots crawling for particular AI companies. This is unfortunate and opaque: robots.txt is usually a public document, and is obeyed by a significant number of crawlers. We discussed some examples of US government websites inadvertently blocking the official 2024 End of Term Archive.

Stakeholder Concerns

A wide range of perspectives was shared, and we discussed the balancing act between innovation, privacy, and ethical considerations.

The Future

The discussions at this workshop will undoubtedly influence future recommendations and standards in the realm of Generative AI and Internet protocols. As we enter what many are calling the "dawn of generative AI," the guidance provided by organizations like the IAB will be instrumental in shaping a responsible and innovative future.

While we can't share specific details of the presentations or discussions, we can say that the level of expertise and the depth of conversation were extensive. Our organization was well-represented, with our CTO Greg Lindahl contributing valuable insights to the discussions. A conference report will soon be published on the IETF datatracker.

Conclusion

It's clear that the intersection of AI and internet protocols will remain a critical area of focus. Workshops like this one play a vital role in cultivating collaboration and developing thoughtful approaches to emerging challenges.

We look forward to seeing how the ideas exchanged at this workshop will shape future guidelines and best practices in the field. If you want to continue the discussion you're more than welcome to join us in our Discord Server, or our Google Group.

‍

‍Note: This blog post adheres to Chatham House rules. No specific statements or opinions have been attributed to individual participants.

‍

*Left to right: Thom Vaughan, Washington Monument, Greg Lindahl*

‍

This release was authored by:

No items found.

Erratum:

Content is truncated

Originally reported by:

Permalink

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

For more details, see our truncation analysis notebook.

IAB Workshop on AI-CONTROL

Key Topics of Discussion

Opt-out and Opt-in Vocabulary

The Rise of Bot Defenses

Stakeholder Concerns

The Future

Conclusion

Erratum:

Content is truncated

The Data

Overview

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

Use Cases

CCBot

Infra Status

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

Team

Jobs

Mission

Impact

Privacy Policy

Terms of Use