AI learns to cheat at Q*bert in a way no human has ever done before

Vaughn Highfield March 2, 2018

An AI has managed to cheat with the best humanity has to offer after discovering an exploit in classic arcade game Q*bert and running with it.

While earlier iterations of the AI would play Q*bert properly, at some point in its learning of how the game works, it discovers an exploit that lets it rack up insane points. Naturally, as any score-hunting player would, it repeats the process so it can boost its score in the most effective way possible.

You can see the AI working its way around platforms in the video below. At first, it looks as if it’s aimlessly jumping between platforms. Instead of seeing the game progress to the next round, Q*bert becomes stuck in a loop where all its platforms begin to flash – it’s here the AI can then go on a score-frenzy racking up huge points.

How the AI won the Q*bert war

Smashing the all-time record for the title, the AI racked up an impossibly high score thanks to its evolution strategy algorithm programming. Evolution strategies (ES) differ from the usual reinforcement learning (RL) that traditional AI uses as it’s seen as more scalable due to its generational learning.

Each learning loop is referred to as a generation and it continues its task until a set condition is met (in this case, a high-score). With each successive generation, the AI absorbs the knowledge of the previous generation and therefore is better at attaining the same goal and surpassing it. Keep going, and you’ll end up with an AI that’s absolutely unrivalled at its task. That’s exactly what happened here with the Q*bert score.

Outlined in the paper, published last week by researchers at the University of Freiburg, Germany, it appears that the bug wasn’t a known quantity. In fact, while they aren’t too surprised about finding the bug, it’s interesting to see how the AI then went ahead and learnt to exploit it every time it played to maximise its scoring potential.

“To find the bug, the agent had to first learn to almost complete the first level – this was not done at once but using many small improvements,” the researchers explained to The Register. We suspect that at some point in the training one of the offspring solutions encountered the bug and got a much better score compared to its siblings, which in turn increased its contribution to the update – its weight was the highest one in the weighted mean. This slowly moved the solution into the space where more and more offsprings started to encounter the same bug.”

“We do not know the precise conditions under which the bug appears; it is possible that it only appears if the agent follows a pattern that seems suboptimal, [for example when the agent wastes time, or even loses a life]. If that was the case, then it would be extremely hard for standard RL to find the bug: if you use incremental rewards you will learn strategies that quickly yield some reward, rather than learning strategies that don’t yield many rewards for a while and then suddenly win big.”

However, despite the bot’s wonderful results, the researchers aren’t saying this is a case to champion ES learning over RL. In fact, both systems have their own problems and a combination of the two is largely seen as the best option moving forward.

The same ES method on other Atari games didn’t bring about anywhere near the same positive results. On the other hand, RL is responsible for smashing records left, right and centre, including beating the world’s best GO player. ES does still have its own place in things though, and it’s actually how Nvidia performs a lot of it’s AI training due to it requiring more computational power but achieving better results over a longer period of time.

Regardless of which way will become the future for AI development, at least this bot cheating the system isn’t as bad as this now disgraced video game world champion.