Mozilla, EleutherAI launch toolkits to help AI builders create open datasets

Easy-to-follow guides on how to transcribe audio files into text using privacy friendly tools and how to convert different documents into a singular format.
The majority of popular AI models rely on data crawled from the web, frequently without the explicit permission of copyright holders. This lack of clarity has led to lawsuits and a trend toward secrecy in dataset practices, stifling transparency, accountability, and limiting innovation to those who can afford it.
In response, a growing community of developers is working to show that building better alternatives is possible. As part of a year-long partnership around open and openly licensed datasets, Mozilla and EleutherAI are launching two new toolkits to help developers build ethically sourced datasets, a step towards a more open and ethical AI ecosystem.
These toolkits help developers get started with creating open datasets. The code and demos will be available on the Mozilla.ai Blueprints hub, a platform that helps developers prototype with open-source AI using out of the box workflows.
Toolkit 1: Transcribing Audio Files with Open-Source Whisper Models
This Blueprint guides developers through transcribing audio using open-source Whisper models via Speaches, a self-hosted server similar to the OpenAI Whisper API. Designed for local use, this privacy focused setup offers a secure alternative to commercial APIs, making it ideal for handling sensitive or private audio data. Inspired by real-world use cases, the toolkit features an easy to follow setup using either Docker or the CLI.
Toolkit 2: Converting Unstructured Documents into Markdown Format
This toolkit helps developers convert diverse document formats (PDFs, DOCX, HTML, etc.) into Markdown using Docling, a command-line tool with powerful Optical Character Recognition and image-handling capabilities. Ideal for building open-text datasets for use in downstream applications, this toolkit emphasizes accessibility and versatility, including batch-processing capabilities.
Mozilla and EleutherAI's partnership included an AI dataset convening, which brought together 30 leading scholars and practitioners from prominent open-source AI startups, nonprofit AI labs, and civil society organizations to discuss emerging practices for a new focus within the open LLM community, culminating with the publication of the research paper: Towards Best Practices for Open Datasets for LLM Training". The new toolkits are a final milestone in this partnership and are a resource to help builders action the best practices previously shared.
As AI development continues to move at warp speed, we must ask ourselves how can we responsibly curate and govern data so that the AI ecosystem becomes more equitable and transparent' says Ayah Bdeir, Mozilla Foundation Senior Advisor, AI Strategy Open Source AI success depends on the community sharing its expertise and ourpartnership with EleutherAI is part of our commitment to support incredible builders who are iterating and experimenting on the front lines of open source AI"
Currently, the threat of litigation is often cited as a reason for minimizing dataset transparency, hindering transparency and innovation. Building open access data is the antidote. Building a future of responsibly curated, openly licensed datasets requires collaboration across legal, technical, and policy fields, along with investments in standards and digitization. In short, open-access data can address many AI challenges, but creating it is difficult. The toolkits from EleutherAI and Mozilla are a crucial step in making this process easier.
Openness and transparency is the future of AI. By putting practical tools into the hands of developers, we're helping build high-quality, openly licensed datasets that form the foundation for more trustworthy, transparent, and interpretable AI systems," says Stella Biderman, Executive Director, Eleuther AI.
The post Mozilla, EleutherAI launch toolkits to help AI builders create open datasets appeared first on The Mozilla Blog.