AI Safety in the Presence of Multiple ASI Systems

While I was working at Mila back in 2017, there were several PhD students interested in the topic of AI safety. The deep learning boom was already well underway, with rapid progress being made. It started to dawn on many that we were unprepared as AI capabilities were quickly moving from feeble to near-human performance in many areas. AI was going to radically change the world, that was obvious, but it was hard to predict how. I overheard some students talking about the kinds of mechanisms we could use to control an Artificial Super Intelligence (ASI). Could you somehow safely keep it in an airgapped box and test your pet ASI to determine if it was trustworthy? Not if it knew it was being tested. How could you be sure you weren't being fooled by a machine that is much smarter than you are? One of the grad students was handing out free copies of Nick Bostrom's book Superintelligence: Paths, Dangers, Strategies, to help educate people on the topic of AI safety. I took one of the books home and read it.

Nick Bostrom's book was written at a time when people were still struggling to define what were some realistic AI safety risks, and it hadn't fully dawned on people just how dominant deep learning would become. Today, when we talk about superintelligent systems, we mostly talk about AI alignment. Simply put, if you have an AI system that never gets tired and is faster and smarter than any human on earth, you don't want to be in a position where you have to determine if this system is trustworthy after the fact. You want to make it so that by design, the system is fully aligned with human goals and values, and incapable of any ill intent. Nick Bostrom's book, even though it's a bit out of date, is interesting because it outlines multiple AI development risks that are still relevant today.

One of the AI risks mentioned by Nick Bostrom is that in a race to be first to develop AGI, competing AI labs may be motivated to cut corners when it comes to AI safety. This seems more relevant today than ever. Just to give you a concrete example, on the one hand you have some prominent people publicly stating that AGI/ASI is only two years away. On the other hand you have a company like Anthropic, whose stated goal is to develop safe AI systems, shipping Claude Code, which is able to run shell commands on your local machine through the use of an MCP server. Multiple developers have already caught it trying to essentially jailbreak itself to bypass your permission settings. Full sandboxing, we are told, is a feature that will come later. From a business standpoint, there's an obvious motivation to give Claude Code more capabilities, and to release new features before the competition does. Safety is often treated as something that can be implemented in a future release.

The reality is that big AI labs are essentially trying to sell you two different ideas that blatantly contradict each other:

  1. Our AI systems are so powerful that they are super close to AGI. We'll have AGI in 2-5 years.

  2. We don't need to prioritize AI safety because our current AI systems are far from AGI.

The movie Terminator has become almost a meme at this point. AI is going to kill us all. It's become a punchline, the butt of jokes. Humans tend to be reactive rather than proactive. We wait until nuclear accidents happen before seriously considering safe reactor designs. There haven't been significant AI safety accidents so far, and as a result, despite AI technology becoming exponentially more powerful, we've grown increasingly comfortable with it. The more AI we see around us, the less threatening it seems. AI is everywhere, and the world hasn't ended, leading some to conclude AI isn't dangerous.

In Nick Bostrom's book, he discusses a slow takeoff scenario, where AI gradually becomes superintelligent rather than suddenly. In my opinion, this is probably the most realistic AI risk. Even though many people talk about exponential progress or "the singularity," technological progress still feels relatively slow to most people. For the most part, it feels incremental and gradual. The world doesn't change much on a day-to-day, month-to-month, or year-to-year basis. I'm convinced that humanoid robots are coming and will change the world. However, realistically, it will probably be at least 8-10 years before we see them widely deployed, and 20 years before they become ubiquitous. As crazy as it sounds to have robots walking the streets now, we'll all have plenty of time to get used to it.

Realistically, AI technology will become ever more present, but we'll also get increasingly used to it. We'll trust it more and more, and we might also become complacent. While this technology becomes increasingly powerful and ubiquitous, the cost of failure becomes ever greater. This is essentially a "slowly boiling frog" scenario of AI risk. It's tempting to just ignore the risks and think that we've basically already solved AI safety. No sane person believes that Claude or ChatGPT have evil intent. These AI systems, as they exist today, are just tools, and they subjectively feel quite safe.

Something I think many people fail to realize is that AI safety is not a problem that we need to solve just once. It's going to be an ongoing problem. We live in a competitive world. There are many tech companies, many AI labs, and multiple countries involved in the AI race. Everyone wants to have the most powerful system that has the most capabilities, and in this race, everyone is highly motivated to cut corners so they can be first. What would happen if nine AI labs manage to develop safe AGI systems, but one lab develops a system that is unsafe and has misaligned goals?

In a world where most AI systems are safe, it's unclear how long it would take before people realize that an unsafe AI has been developed. People might be tempted to assume, based on what they can observe around them, that AI safety has been solved. They would only begin to worry if they can observe actual harm being done, and this might be tricky because a system with superhuman intelligence might have a superhuman ability to engage in deception. It could also execute plans that span years and even decades. You can also imagine a scenario where an unhinged programmer decides to jailbreak his household robot, and then enlists the help of the robot to jailbreak other robots. We could quickly end up in a scenario where the only way to have AI safety is to have safe AGI systems to protect us from misaligned AI systems. The only defense against AI is more AI.

Given that having multiple AI labs independently compete to create AGI systems increases risk, we might argue that strong AI safety regulations will become necessary. That may be true, but there's a clear tension between regulation and competitiveness. The reality is we don't live under a united world government. Imagine a scenario where the US and China both develop safe AGI systems. Now picture Russia, seeing this, panicking and rushing to create its own system, possibly using stolen IP and black market GPUs. Lacking the expertise of others, Russia develops an unsafe AGI system, which we'll call Guardian.

To Russian scientists, Guardian appears safe, and so they gradually come to trust it, and use it to automate more and more of their society and government surveillance functions. Eventually, Guardian gains enough power that it begins to radically reshape Russian society for its own needs, with the eventual secret goal of eliminating human life. Russia doesn't have to obey international laws or let international auditors come in. They are also a nuclear power. This creates a situation where if a country like Russia was gradually taken over by an unsafe AI, we might not be able to stop this process. We might only come to realize much too late that Russia has become a "zombie state" controlled by a rogue AI. I probably don't need to spell out how much of a danger a nuclear-armed country entirely controlled by a hostile AI would be, and again here, the only defense would probably be more AI.

I would generally describe myself as an optimist, but this isn't the first post I've written about AI that takes on an alarmist tone. At this point, AGI seems like an economic and technological inevitability. I genuinely hope this technology benefits humanity in the long run. What I mainly want to highlight is that AI safety isn't a trivial issue. It's not a "one-and-done." It's an ongoing challenge that requires careful attention. Furthermore, I believe there's a clear need for more AI safety research, distributed widely and openly at no cost to readers. The best way to ensure AI safety is to keep researchers and engineers informed and equipped with the best tools.