Last week members of the Common Crawl Foundation team (Sebastian Nagel, Pedro Ortiz Suarez, and Thom Vaughan) attended the 2025 IIPC General Assembly (GA) and Web Archiving Conference (WAC), hosted by the National Library of Norway in Oslo. As new members of the IIPC, we are thrilled to join a global community of organizations committed to preserving the web for future generations, and to have the chance to present some of our work among colleagues in the web archiving space.

Common Crawl delivered a range of contributions including poster presentations, lightning talks, and a workshop. These were very well received, and we appreciated the many conversations that followed.

We had the opportunity to (re)connect with representatives from several national libraries, including those of Norway, Sweden, Denmark, France, and the Netherlands, as well as researchers and professionals from industry and academia.

Among our lightning talks, posters, and workshops, our team gave presentations during the General Assembly and Web Archiving Conference on:
- the rocky road to converting ARC to WARC formats,
- Asynchronous and Modular Pipelines for Fast WARC Annotation,
- Politely Downloading Millions of WARC Files Without Burning the Servers Down via cc-downloader,
- Crawler Politeness in the Age of GenAI,
- Crawling with HTTP/2 and
- a hands-on workshop on using Common Crawl’s Web Graph releases.

Our team also met with Stephan Oepen from the University of Oslo, and colleagues from the End of Term Archive project with whom we’ve collaborated on the EOT 2024: Ilya Kreymer of Webrecorder, Sawood Alam of the Internet Archive, and Mark Phillips of the University of North Texas Libraries.

We’re looking forward to more discussions with our friends (new and old) from IIPC in the near future.
Erratum:
Content is truncated
Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.
For more details, see our truncation analysis notebook.