Web scraping is a tool, not a crime

As a reporter who can code, I can easily collect information from websites and social media accounts to find stories. All I need to do is write a few lines of code that go into the ether, open up websites, and download the data that is already publicly available on them. This process is called scraping.

But there’s a calculus I make in my head whenever I begin pursuing a story that requires scraping: “Is this story worth going to prison for?”

I’m not talking about hacking into the walled-off databases of the CIA. I’m talking about using a script to gather information that I can access as an everyday Internet consumer, like public Instagram posts or tweets that use a certain hashtag.

My worry is not unfounded. A vaguely written US law called the Computer Fraud and Abuse Act makes accessing this kind of information in programmatic ways a potential crime. The decades-old law was introduced after lawmakers saw the 1983 movie WarGames and decided the US needed an anti-hacking law that forbids anyone from using a computer “without authorization or exceeding authorized access.”

While the law may have been well-intentioned and has been used to prosecute people who download things from their work systems that they’re not supposed to, it also catches a lot of other people in its widely cast net, including academics, researchers, and journalists.

What does “exceeding authorized access” mean in an age of social media? Does an employee who has access to a database of research journals for work and uses them for private purposes exceed authorized access? Does a reporter like me who gathers information using automated processes and her own Facebook account commit a crime?

Until now, interpretations of the law have ping-ponged from court case to court case, relying on various judges to give us a better definition of what exactly it means to exceed one’s authorized access to information. But soon the US Supreme Court will rule on the law for the first time, in the case Van Buren v. United States. Nathan Van Buren, a police officer, had access to confidential databases for work and sold information he looked up there to a third party. The court heard opening arguments on November 30 and could announce its decision any day.

From unfair pricing on Amazon to hate speech on Facebook, many corporate misdeeds can be traced through the platforms on which we conduct large parts of our lives. And the vast digital footprint that human beings produce online, much of which is publicly available, can help us patch data holes and investigate areas that might be otherwise hard to understand.

As the artist and technology expert Mimi Onuoha pointed out in her poignant piece The Library of Missing Datasets:

That which we ignore reveals more than what we give our attention to. It’s in these things that we find cultural and colloquial hints of what is deemed important. Spots that we've left blank reveal our hidden social biases and indifferences.

Data collection is expensive and cumbersome, but it’s also an important tool for discovering and revealing systemic injustices. What data we deem important enough to collect is a matter often left to powerful entities—governments and corporations—that don’t always keep society’s most vulnerable people in mind.

If Chinese government officials won’t publish information on the camps where Muslim minorities are being detained, then perhaps researchers can use information from Google maps to approximate the scope of this issue. If perpetrators won’t admit to war crimes but post about them on social media, prosecutors and human rights researchers can still build cases against them.

Should companies like Facebook have the legal recourse to shut down academic research? Should there be an exemption when web scraping is the only way to gather data that helps researchers, academics, and journalists diagnose the ills of our society?

Twitter may have modeled a way forward. Reckoning with its role in the spread of misinformation around the 2016 US election, the company decided to create special access to data specifically for academics and researchers. While the company still frowns upon scraping, this step signals that it recognizes how important the data is.

Perhaps lawmakers can, too.

Lam Thuy Vo is a senior reporter at BuzzFeed News, where she has reported stories on misinformation, hatred online, and platform-related accountability. Her book Mining Social Media was published by No Starch Press in late 2019.