Table of Contents
- Web Languages Project
- NeurIPS Social with Common Crawl and Wikimedia
- Event Updates
- Open Job Positions
Web Languages Project
We have launched the Web Languages project, a volunteer effort with the goal of improving our crawling by making a human-curated list of important non-English websites.
Common Crawl recognizes many languages in its datasets, and we can see that we don't have enough data in languages like Hindi (which has 500+ million speakers!), smaller countries’ languages like Hungarian, and regional languages like Catalan. We are interested in languages from all over the world. By contributing, you can help improve the coverage of underrepresented languages, making a meaningful impact on their visibility and accessibility.
For more details about the project please see our Web Languages GitHub repo, and join our Discord for further discussion and questions.
NeurIPS Social with Common Crawl and Wikimedia
Common Crawl and Wikimedia will host an in-person social event at NeurIPS, Nonprofits Bridging Tech and Social Impact. If you will be at NeurIPS this December in Vancouver, join us to explore the intersections between nonprofit organizations and the tech community. This session, held on December 11 at 7:30pm, will feature representatives from the Wikimedia Foundation and Common Crawl Foundation, offering an opportunity to connect with nonprofits committed to using technology for social missions.
The event will begin with presentations from both organizations, highlighting their goals, projects and research (e.g., Wikipedia, Common Crawl datasets), and challenges facing the open commons community. Following the presentations, the session will transition into roundtable discussions focused on current initiatives and an open Q&A.
Event Updates
We’ve been busy attending numerous events this Fall. In late September, we had the privilege of participating in a groundbreaking workshop on AI-CONTROL hosted by the Internet Architecture Board (IAB) in Washington DC. This event brought together experts from crawling companies, web publishers, AI companies, and “bot defense” companies to discuss the intersection of artificial intelligence and Internet protocols. For more details, see our blog post.
We attended the IETF 121 meeting in Dublin, where there was further discussion on the initial results from the recent AI CONTROL workshop. Here are some notes from the chairs Mark Nottingham and Suresh Krishnan.
In October, we had the honor of briefing the White House Office of Science and Technology Policy (OSTP) on the role of The Common Crawl Foundation as critical infrastructure in the artificial intelligence ecosystem and how we can support U.S. federal efforts in advancing responsible AI use and research. More details about the event and its follow-ups, see our blog post.
In November, we had the opportunity to present at two events, sharing insights into the work of the Common Crawl Foundation and the impact of open web data on research and industry. The first presentation took place at the Turing Institute, as part of the NLP Special Interest Group. The second event was held at University College London, co-hosted with Valyu. For more on these discussions, see our blog post.
Open Job Positions
We now have a Jobs page on our website. Learn about our open roles and how to get in touch with us if you are interested in joining our collaborative team, where your contributions will help shape the future of web data.