There’s a new text-to-image AI in town. With ERNIE-ViLG, a new AI developed by the Chinese tech company Baidu, you can generate images that capture the cultural specificity of China. It also makes better anime art than DALL-E 2 or other Western image-making AIs.
But there are many things—like Tiananmen Square, the country’s second-largest city square and a symbolic political center—that the AI refuses to show you.
When a demo of the software was released in late August, users quickly found that certain words—both explicit mentions of political leaders’ names and words that are potentially controversial only in political contexts—were labeled as “sensitive” and blocked from generating any result. China’s sophisticated system of online censorship, it seems, has extended to the latest trend in AI.
It’s not rare for similar AIs to limit users from generating certain types of content. DALL-E 2 prohibits sexual content, faces of public figures, or medical treatment images. But the case of ERNIE-ViLG underlines the question of where exactly the line between moderation and political censorship lies.
The ERNIE-ViLG model is part of Wenxin, a large-scale project in natural-language processing from China’s leading AI company, Baidu. It was trained on a data set of 145 million image-text pairs and contains 10 billion parameters—the values that a neural network adjusts as it learns, which the AI uses to discern the subtle differences between concepts and art styles.
That means ERNIE-ViLG has a smaller training data set than DALL-E 2 (650 million pairs) and Stable Diffusion (2.3 billion pairs) but more parameters than either one (DALL-E 2 has 3.5 billion parameters and Stable Diffusion has 890 million). Baidu released a demo version on its own platform in late August and then later on Hugging Face, the popular international AI community.
The main difference between ERNIE-ViLG and Western models is that the Baidu-developed one understands prompts written in Chinese and is less likely to make mistakes when it comes to culturally specific words.
For example, a Chinese video creator compared the results from different models for prompts that included Chinese historical figures, pop culture celebrities, and food. He found that ERNIE-ViLG produced more accurate images than DALL-E 2 or Stable Diffusion. Following its release, ERNIE-ViLG has also been embraced by those in the Japanese anime community, who found that the model can generate more satisfying anime art than other models, likely because it included more anime in its training data.
But ERNIE-ViLG will be defined, as the other models are, by what it allows. Unlike DALL-E 2 or Stable Diffusion, ERNIE-ViLG does not have a published explanation of its content moderation policy, and Baidu declined to comment for this story.
When the ERNIE-ViLG demo was first released on Hugging Face, users inputting certain words would receive the message “Sensitive words found. Please enter again (存在敏感词,请重新输入),” which was a surprisingly honest admission about the filtering mechanism. However, since at least September 12, the message has read “The content entered doesn’t meet relevant rules. Please try again after adjusting it. (输入内容不符合相关规则,请调整后再试!)”
In a test of the demo by MIT Technology Review, a number of Chinese words were blocked: names of high-profile Chinese political leaders like Xi Jinping and Mao Zedong; terms that can be considered politically sensitive, like “revolution” and “climb walls” (a metaphor for using a VPN service in China); and the name of Baidu’s founder and CEO, Yanhong (Robin) Li.
While words like “democracy” and “government” themselves are allowed, prompts that combine them with other words, like “democracy Middle East” or “British government,” are blocked. Tiananmen Square in Beijing also can’t be found in ERNIE-ViLG, likely because of its association with the Tiananmen Massacre, references to which are heavily censored in China.
In today’s China, social media companies usually have proprietary lists of sensitive words, built from both government instructions and their own operational decisions. This means whatever filter ERNIE-ViLG employs is likely to differ from the ones used by Tencent-owned WeChat or by Weibo, which is operated by Sina Corporation. Some of these platforms have been systematically tested by the Toronto-based research group Citizen Lab.
Badiucao, a Chinese-Australian political cartoonist (who uses the alias for his artwork to protect his identity), was one of the first users to spot the censorship in ERNIE-ViLG. Many of his artworks directly criticize the Chinese government or its political leaders, so these were some of the first prompts he put into the model.
“Of course, I was also intentionally exploring its ecosystem. Because it’s new territory, I’m curious to know whether censorship has caught up with it,” says Badiucao. “But [the result] is quite a shame.”
As an artist, Badiucao doesn’t agree with any form of moderation in these AIs, including the approach taken by DALL-E 2, because he believes he should be the one to decide what’s acceptable in his own art. But still, he cautions that censorship driven by moral concerns should not be confused with censorship for political reasons. “It’s different when an AI judges what it cannot generate based on commonly agreed-upon moral standards and when a government, as a third party, comes in and says you can’t do this because it harms the country or the national government,” he says.
The difficulty of identifying a clear line between censorship and moderation is also a result of differences between cultures and legal regimes, says Giada Pistilli, principal ethicist at Hugging Face. For example, different cultures may interpret the same imagery differently. “When it comes to religious symbols, in France nothing is allowed in public, and that’s their expression of secularism,” says Pistilli. “When you go to the US, secularism means that everything, like every religious symbol, is allowed.”
In January, the Chinese government proposed a new regulation banning any AI-generated content that “endangers national security and social stability,” which would cover AIs like ERNIE-ViLG.
What could help in ERNIE-ViLG’s case is for the developer to release a document explaining the moderation decisions, says Pistilli: “Is it censored because it’s the law that’s telling them to do so? Are they doing that because they believe it’s wrong? It always helps to explain our arguments, our choices.”
Despite the built-in censorship, ERNIE-ViLG will still be an important player in the development of large-scale text-to-image AIs. The emergence of AI models trained on specific language data sets makes up for some of the limitations of English-based mainstream models. It will particularly help users who need an AI that understands the Chinese language and can generate accurate images accordingly.
Just as Chinese social media platforms have thrived in spite of rigorous censorship, ERNIE-ViLG and other Chinese AI models may eventually experience the same: they’re too useful to give up.