Open-source AI is everywhere right now. The problem is, no one agrees on what it actually is. Now we may finally have an answer. The Open Source Initiative (OSI), the self-appointed arbiters of what it means to be open source, has released a new definition, which it hopes will help lawmakers develop regulations to protect consumers from AI risks.
Though OSI has published much about what constitutes open-source technology in other fields, this marks its first attempt to define the term for AI models. It asked a 70-person group of researchers, lawyers, policymakers, and activists, as well as representatives from big tech companies like Meta, Google, and Amazon, to come up with the working definition.
According to the group, an open-source AI system can be used for any purpose without securing permission, and researchers should be able to inspect its components and study how the system works.
It should also be possible to modify the system for any purpose—including to change its output—and to share it with others to use, with or without modifications, for any purpose. In addition, the standard attempts to define a level of transparency for a given model’s training data, source code, and weights.
The previous lack of an open-source standard presented a problem. Although we know that the decisions of OpenAI and Anthropic to keep their models, data sets, and algorithms secret makes their AI closed source, some experts argue that Meta and Google’s freely accessible models, which are open to anyone to inspect and adapt, aren’t truly open source either, because of licenses that restrict what users can do with the models and because the training data sets aren’t made public. Meta, Google, and OpenAI have been contacted for their response to the new definition but did not reply before publication.
“Companies have been known to misuse the term when marketing their models,” says Avijit Ghosh, an applied policy researcher at Hugging Face, a platform for building and sharing AI models. Describing models as open source may cause them to be perceived as more trustworthy, even if researchers aren’t able to independently investigate whether they really are open source.
Ayah Bdeir, a senior advisor to Mozilla and a participant in OSI’s process, says certain parts of the open-source definition were relatively easy to agree upon, including the need to reveal model weights (the parameters that help determine how an AI model generates an output). Other parts of the deliberations were more contentious, particularly the question of how public training data should be.
The lack of transparency about where training data comes from has led to innumerable lawsuits against big AI companies, from makers of large language models like OpenAI to music generators like Suno, which do not disclose much about their training sets beyond saying they contain “publicly accessible information.” In response, some advocates say that open-source models should disclose all their training sets, a standard that Bdeir says would be difficult to enforce because of issues like copyright and data ownership.
Ultimately, the new definition requires that open-source models provide information about the training data to the extent that “a skilled person can recreate a substantially equivalent system using the same or similar data.” It’s not a blanket requirement to share all training data sets, but it also goes further than what many proprietary models or even ostensibly open-source models do today. It’s a compromise.
“Insisting on an ideologically pristine kind of gold standard that actually will not effectively be met by anybody ends up backfiring,” Bdeir says. She adds that OSI is planning some sort of enforcement mechanism, which will flag models that are described as open source but do not meet its definition. It also plans to release a list of AI models that do meet the new definition. Though none are confirmed, the handful of models that Bdeir told MIT Technology Review are expected to land on the list are relatively small names, including Pythia by Eleuther, OLMo by Ai2, and models by the open-source collective LLM360.