Hackers Used to Be Humans. Soon, AIs Will Hack Humanity

If you don’t have enough to worry about already, consider a world where AIs are hackers.

Hacking is as old as humanity. We are creative problem solvers. We exploit loopholes, manipulate systems, and strive for more influence, power, and wealth. To date, hacking has exclusively been a human activity. Not for long.

As I lay out in a report I just published, artificial intelligence will eventually find vulnerabilities in all sorts of social, economic, and political systems, and then exploit them at unprecedented speed, scale, and scope. After hacking humanity, AI systems will then hack other AI systems, and humans will be little more than collateral damage.

Okay, maybe this is a bit of hyperbole, but it requires no far-future science fiction technology. I’m not postulating an AI “singularity,” where the AI-learning feedback loop becomes so fast that it outstrips human understanding. I’m not assuming intelligent androids. I’m not assuming evil intent. Most of these hacks don’t even require major research breakthroughs in AI. They’re already happening. As AI gets more sophisticated, though, we often won’t even know it’s happening.

AIs don’t solve problems like humans do. They look at more types of solutions than us. They’ll go down complex paths that we haven’t considered. This can be an issue because of something called the explainability problem. Modern AI systems are essentially black boxes. Data goes in one end, and an answer comes out the other. It can be impossible to understand how the system reached its conclusion, even if you’re a programmer looking at the code.

In 2015, a research group fed an AI system called Deep Patient health and medical data from some 700,000 people, and tested whether it could predict diseases. It could, but Deep Patient provides no explanation for the basis of a diagnosis, and the researchers have no idea how it comes to its conclusions. A doctor either can either trust or ignore the computer, but that trust will remain blind.

While researchers are working on AI that can explain itself, there seems to be a trade-off between capability and explainability. Explanations are a cognitive shorthand used by humans, suited for the way humans make decisions. Forcing an AI to produce explanations might be an additional constraint that could affect the quality of its decisions. For now, AI is becoming more and more opaque and less explainable.

Separately, AIs can engage in something called reward hacking. Because AIs don’t solve problems in the same way people do, they will invariably stumble on solutions we humans might never have anticipated—and some will subvert the intent of the system. That’s because AIs don’t think in terms of the implications, context, norms, and values we humans share and take for granted. This reward hacking involves achieving a goal but in a way the AI’s designers neither wanted nor intended.

Take a soccer simulation where an AI figured out that if it kicked the ball out of bounds, the goalie would have to throw the ball in and leave the goal undefended. Or another simulation, where an AI figured out that instead of running, it could make itself tall enough to cross a distant finish line by falling over it. Or the robot vacuum cleaner that instead of learning to not bump into things, it learned to drive backwards, where there were no sensors telling it it was bumping into things. If there are problems, inconsistencies, or loopholes in the rules, and if those properties lead to an acceptable solution as defined by the rules, then AIs will find these hacks.

We learned about this hacking problem as children with the story of King Midas. When the god Dionysus grants him a wish, Midas asks that everything he touches turns to gold. He ends up starving and miserable when his food, drink, and daughter all turn to gold. It’s a specification problem: Midas programmed the wrong goal into the system.

Genies are very precise about the wording of wishes, and can be maliciously pedantic. We know this, but there’s still no way to outsmart the genie. Whatever you wish for, he will always be able to grant it in a way you wish he hadn’t. He will hack your wish. Goals and desires are always underspecified in human language and thought. We never describe all the options, or include all the applicable caveats, exceptions, and provisos. Any goal we specify will necessarily be incomplete.