The End of Online Anonymity

June 10th, 2019

Since 2015, I've been writing about the impact that machine learning will have on our society. One of the most concerning possibilities, in my mind, was always the potential abuse of these technologies by malicious actors to manipulate or scam people, either through subtle means or by impersonating those they trust.

Today, this concern is very much mainstream: "fake news" has become a buzzword and a kind of modern-day boogeyman. For the most part, I think most people aren't overly worried. We know that there already are malicious actors creating sketchy content and putting it out there, but most of it seems obviously fake if you examine it more closely. We all assume that we will always be smart enough to tell real from fake, and carry on.

Media manipulation is nothing new. Attempts to control public discourse and influence the masses predate the internet, TV, newspapers and the printing press. What's about to change is that now, with machine learning, it will become possible to turn electricity into millions of voices relentlessly spreading your gospel to every corner of the internet. At this point in time, it seems most of the fake content out there is not generated using machine learning, it's created by human beings using puppet accounts. For the most part, someone still has to turn the crank. That limits how much content can be created and how many sources it can come from.

Personally, I'm not just worried about manipulative articles being passed as news. I'm also worried about the impact that networks of malicious bots will have on online communities. We're still fairly far from being at the point where we can automatically generate news articles that appear convincing upon close inspection, but what about online comment threads? How difficult is it to build a bot that can write convincing one or two sentences comments?

Yesterday, I stumbled upon a link to a subreddit populated by bots based on OpenAIs GPT-2 text generation model. The result is certainly funny, but also leaves me feeling uncomfortable. Yes, much of the content is obviously fake, but many of the comments are actually believable. If you feel unimpressed, you should keep in mind that this is an individual's side project that repurposed an existing neural network. As it is, the GPT-2 model simply generates text and completes a sentence. It's an impressive and amusing tech demo, but not something you can easily control. In order to weaponize GPT-2, a malicious actor would need to add some kind of a guidance system: a way to condition text output of the model so as to spread a specific message.

The solution to the fake content problem may seem obvious: we can fight fire with fire, and build machine learning tools to detect machine-generated content. Tools like this are already in the works. Grover boasts 92% accuracy in detecting fake content. The sad reality, however, is that this is an arms race, and it's not clear at all that this is something we can win. As technology improves, fake content will become harder and harder to tell apart from real content. Manual content verification won't be able to keep up with the volume, and automated filtering systems will fail.

In my opinion, there is only one effective way to stop fake content, and this is to verify that everyone who posts content is in fact human. You could ask people to upload pictures of themselves, but we're already at the point where we can produce realistic images of imaginary people using GANs. Any counter-measure of this form will inevitably be defeated. Ultimately, an unfortunate possibility is that online platforms will begin requiring a verified government ID in order to register. We could even end up living in a dystopian world where a kind of "e-passport", crypto-signed government ID is attached to your every internet connection, and tracked everywhere online, which is very sad to think about.

The rise of bots could render many online communities simply uninhabitable. Large social media websites may have some hope of policing content, but smaller independent players likely won't have the resources. We are moving towards a model where the internet is dominated by a few centralized content providers and their walled gardens, and generated content may unfortunately make it even harder for grassroots online communities to survive and grow.

I don't want to see online anonymity be taken away. I hope there is a way to build a new web, a new kind of social media using a hash graph to implement a decentralized web of trust, something that can allow content verification without forcing everyone to sacrifice their right to remain anonymous online. I certainly think it's a problem that's worth thinking about and I hope to see more research in that direction, because unless we can come up with a technological solution, a regulatory solution may be imposed onto us, and it will inevitably favor the big players at the expense of the small.

EDIT 2022-02-06: I want to make it crystal clear that I am advocating against centralized online IDs, not in favor of them. This post is meant to read as a warning, not an endorsement.