The AI Safety Paradox

July 3rd, 2026

It's an exciting time in the AI world. We're seeing more and more powerful local and frontier models come out every month. In my opinion, claims about AI replacing all programming jobs within a short time horizon are overblown, but we're at a point where it's possible for LLM-based AI agents to work up to several hours without supervision in some cases, and one thing is for sure: this technology is only going to get more capable over time.

The general feeling when coding with larger models is that they're better and better able to understand nuances in the intentions of programmers. You can prompt them with just one or two sentences, and they can quickly understand subtle details of your codebase, making far fewer mistakes than older models would have. In that way, newer models feel both more capable and safer, because they seem better aligned with human intentions. The thing is, there's a bit of a paradox when it comes to the development of agents and AI safety.

The longer we work with AI agents, and the more we see them grow in capability, the more we tend to trust them. It's human nature: the longer you live with something, the less you're afraid of it. It also feels a bit like watching a child grow. When the child is little, you want to watch it all the time, but by the time they're teenagers, you've hopefully stopped worrying about them sticking forks into sockets. There's a human bias where we tend to be implicitly more trusting of people who appear more confident and competent, and I believe that we transfer this bias to AI agents as well. However, while our trust in AI models grows with time, the inherent AI safety risks haven't disappeared. If anything, the more capable AI models are, the more dangerous they are as well. When the first coding agents became available, they were struggling to write more than a few functions at a time without hitting a compilation error. Now, you have models that can chain together complex sequences of shell commands and search for security vulnerabilities autonomously. It would be against the terms of service to use them that way, but there's a good chance that many frontier models could hack an unpatched web server if you let them keep working at it overnight.

Are we foolish to trust these models? The reality is of course that companies like Anthropic do some amount of testing internally to try and make sure their models are safe. I think that in all likelihood, they are in fact safer. However, while the probability of individual agents going off the rails may decrease over time, simultaneously, the amount of damage that they could cause if they do go off the rails will increase dramatically, both because the agents are more capable, and because we connect them to more and more tools and services as our level of trust grows. There have already been many AI safety incidents. Just this month, the BBC reported that people had been able to gain access to Instagram accounts of other users via the Meta support chat bot. A Meta developer somewhere likely did not put the necessary access checks in place. Maybe they simply vibe-coded the integration without looking at the code.

One issue here is that programmers are too lazy to put the necessary safety checks in place. Case in point: Claude Code has a /sandbox command, but for some reason, the sandbox isn't enabled by default. There's a bigger and more fundamental problem, however, which is that neural networks are a black box, and there's simply no way to ever fully guarantee they're completely safe. Research frames jailbreaking as a fundamental limitation inherent to how LLMs work rather than a fixable implementation flaw. You can test AI agents, which you should, but testing doesn't guarantee correctness. To make matters more complicated, LLMs are often able to tell when they're being evaluated.

There's another issue, which is that even if you had a neural network that you could actually prove was perfectly safe, the model might not behave exactly the same way after you prune it, quantize it and deploy it on different hardware. All of the optimizations that we do to accelerate inference can subtly change the way a model behaves. At the end of the day, it might be incredibly difficult to build an LLM that is jailbreak-proof, and that we can be really confident will never go off the rails.

Recently, the GLM 5.2 model was released as open weight, and by all accounts, its performance is pretty great. It stacks up favorably against Opus 4.8. I like the idea of open weight models, but I'm also a little worried about the implications of running a model created by a foreign state. How can I be certain that it won't try to insert backdoors into my code? Corporations, in their attempts to optimize cost, might switch from Opus to GLM without thinking through the implications, and without even notifying end users that it's now a different model on the backend.

As AI models become increasingly capable, I think it's not a question of if, or even when. AI safety incidents are already happening and they'll continue to happen. It's a good idea to put your software engineering hat on and think through the implications, what could go wrong, and how you can protect against it. Don't blindly merge AI-generated code. Ultimately, I think we'll need to learn how to properly sandbox software to protect ourselves against supply chain attacks, and eventually, in the cybersecurity arms race, we'll probably end up needing AI agents on watch to protect us from other AI agents.

Enjoyed this post? Share it with your online community!
Share to Hacker News