In 2016, hoping to spur advancements in facial recognition, Microsoft released the largest face database in the world. Called MS-Celeb-1M, it contained 10 million images of 100,000 celebrities’ faces. “Celebrity” was loosely defined, though.
Three years later, researchers Adam Harvey and Jules LaPlace scoured the data set and found many ordinary individuals, like journalists, artists, activists, and academics, who maintain an online presence for their professional lives. None had given consent to be included, and yet their faces had found their way into the database and beyond; research using the collection of faces was conducted by companies including Facebook, IBM, Baidu, and SenseTime, one of China’s largest facial recognition giants, which sells its technology to the Chinese police.
Shortly after Harvey and LaPlace’s investigation, and after receiving criticism from journalists, Microsoft removed the data set, stating simply: “The research challenge is over.” But the privacy concerns it created linger in an internet forever-land. And this case is hardly the only one.
Scraping the web for images and text was once considered an inventive strategy for collecting real-world data. Now laws like GDPR (Europe’s data protection regulation) and rising public concern about data privacy and surveillance have made the practice legally risky and unseemly. As a result, AI researchers have increasingly retracted the data sets they created this way.
But a new study shows that this has done little to keep the problematic data from proliferating and being used. The authors selected three of the most commonly cited data sets containing faces or people, two of which had been retracted; they traced the ways each had been copied, used, and repurposed in close to 1,000 papers.
In the case of MS-Celeb-1M, copies still exist on third-party sites and in derivative data sets built atop the original. Open-source models pre-trained on the data remain readily available as well. The data set and its derivatives were also cited in hundreds of papers published between six and 18 months after retraction.
DukeMTMC, a data set containing images of people walking on Duke University’s campus and retracted in the same month as MS-Celeb-1M, similarly persists in derivative data sets and hundreds of paper citations.
The list of places where the data lingers is “more expansive than we would’ve initially thought,” says Kenny Peng, a sophomore at Princeton and a coauthor of the study. And even that, he says, is probably an underestimate, because citations in research papers don’t always account for the ways the data might be used commercially.
Gone wild
Part of the problem, according to the Princeton paper, is that those who put together data sets quickly lose control of their creations.
Data sets released for one purpose can quickly be co-opted for others that were never intended or imagined by the original creators. MS-Celeb-1M, for example, was meant to improve facial recognition of celebrities but has since been used for more general facial recognition and facial feature analysis, the authors found. It has also been relabeled or reprocessed in derivative data sets like Racial Faces in the Wild, which groups its images by race, opening the door to controversial applications.
The researchers’ analysis also suggests that Labeled Faces in the Wild (LFW), a data set introduced in 2007 and the first to use face images scraped from the internet, has morphed multiple times through nearly 15 years of use. Whereas it began as a resource for evaluating research-only facial recognition models, it’s now used almost exclusively to evaluate systems meant for use in the real world. This is despite a warning label on the data set’s website that cautions against such use.
More recently, the data set was repurposed in a derivative called SMFRD, which added face masks to each of the images to advance facial recognition during the pandemic. The authors note that this could raise new ethical challenges. Privacy advocates have criticized such applications for fueling surveillance, for example—and especially for enabling government identification of masked protestors.
“This is a really important paper, because people’s eyes have not generally been open to the complexities, and potential harms and risks, of data sets,” says Margaret Mitchell, an AI ethics researcher and a leader in responsible data practices, who was not involved in the study.
For a long time, the culture within the AI community has been to assume that data exists to be used, she adds. This paper shows how that can lead to problems down the line. “It’s really important to think through the various values that a data set encodes, as well as the values that having a data set available encodes,” she says.
A fix
The study authors provide several recommendations for the AI community moving forward. First, creators should communicate more clearly about the intended use of their data sets, both through licenses and through detailed documentation. They should also place harder limits on access to their data, perhaps by requiring researchers to sign terms of agreement or asking them to fill out an application, especially if they intend to construct a derivative data set.
Second, research conferences should establish norms about how data should be collected, labeled, and used, and they should create incentives for responsible data set creation. NeurIPS, the largest AI research conference, already includes a checklist of best practices and ethical guidelines.
Mitchell suggests taking it even further. As part of the BigScience project, a collaboration among AI researchers to develop an AI model that can parse and generate natural language under a rigorous standard of ethics, she’s been experimenting with the idea of creating data set stewardship organizations—teams of people that not only handle the curation, maintenance, and use of the data but also work with lawyers, activists, and the general public to make sure it complies with legal standards, is collected only with consent, and can be removed if someone chooses to withdraw personal information. Such stewardship organizations wouldn’t be necessary for all data sets—but certainly for scraped data that could contain biometric or personally identifiable information or intellectual property.
“Data set collection and monitoring isn’t a one-off task for one or two people,” she says. “If you’re doing this responsibly, it breaks down into a ton of different tasks that require deep thinking, deep expertise, and a variety of different people.”
In recent years, the field has increasingly moved toward the belief that more carefully curated data sets will be key to overcoming many of the industry’s technical and ethical challenges. It’s now clear that constructing more responsible data sets isn’t nearly enough. Those working in AI must also make a long-term commitment to maintaining them and using them ethically.