Several years ago, as the U.S. Census Bureau began to prepare for its 2020 count, it was confronted with an existential problem.
A growing body of academic research was providing evidence that machine learning systems, combined with the availability of large commercial datasets about Americans, were making it possible to personally identify people from information in confidential datasets—like the Census.
The bureau, which relies on Americans willingly sharing their private information under the assurance they won’t be personally identifiable, decided to conduct its own test. In 2016, it found that by combining a relatively small fraction of the statistics it published after the 2010 Census with commercial datasets available at the time, anyone could undermine the Census’s current privacy system
and reconstruct the name, location, and key demographic characteristics of about 52 million people.
“If they’d used more statistics, it could have been worse. If they’d used more rich commercial datasets, it could have been worse,” said Cynthia Dwork, a computer science professor at Harvard University.
That discovery kicked off one of the most consequential changes in the Census Bureau’s history….