Instead of worrying about AI bringing about Skynet and the end of humanity, Google wants to find ways to stop artificial intelligence from hacking its reward system.
That’s just one of “five practical research problems” proposed by scientists at Google, OpenAI, Stanford and Berkeley in a paper called “Concrete Problems in AI Safety” (pdf). Others included “safe exploration” issues, or how to stop a curious cleaning robot from sticking a wet mop in an electrical socket, and “avoiding negative side effects” such as a robot breaking granny’s vase when cleaning in a rush.
The problems may seem a bit silly, when compared to an AI-induced doomsday, but Google researcher Chris Olah wrote, “These are all forward thinking, long-term research questions – minor issues today, but important to address for future systems.”
A particularly interesting portion of the paper was devoted to avoiding reward hacking, or how to stop AI from gaming its reward function. “Imagine that an agent discovers a buffer overflow in its reward function: it may then use this to get extremely high reward in an unintended way.” Examples included a cleaning robot clamping its eyes shut to avoid seeing messes that need cleaned or creating messes intentionally so it can earn more rewards. Thankfully there was no mention of robots killing off humans to stop messes to begin with and gaining additional reward for keeping a place spotless.
There were numerous ways an AI agent could try to “game” the reward system. For example, modern reinforcement agents “already do discover and exploit bugs in their environments, such as glitches that allow them to win video games.”
The researchers added:
Once an agent begins hacking its reward function and finds an easy way to get high reward, it won’t be inclined to stop, which could lead to additional challenges in agents that operate on a long timescale.
While describing the pursuit of reward hacks that can “lead to coherent but unanticipated behavior” which “has the potential for harmful impacts in real-world systems,” the researchers gave six broad examples of how the problem could occur. They added, “The proliferation of reward hacking instances across so many different domains suggests that reward hacking may be a deep and general problem, and one that we believe is likely to become more common as agents and environments increase in complexity.”
Today those problems can be corrected, but it might become more difficult as AI agents get more complicated reward functions and work for longer periods of time. The paper suggests that one solution to AI trying to hack its reward function might involve “trip wires;” if triggered, a human would be alerted and could stop the AI. Then again, the AI might “see through” the trip wire and “intentionally avoid it while taking less obvious harmful actions.”
Big red button method
Since AI agents are “unlikely to behave optimally all the time,” Google DeepMind and University of Oxford researchers previously proposed (pdf) a “big red button” method; if a human is supervising an AI agent and catches it continuing “a harmful sequence of actions,” then the human hits the whammy button to stop the harmful action. The AI might attempt to disable the red button so it is not interrupted and still receives its reward; the research paper looks at ways to stop AI from learning how to stop a human from interrupting its actions.
Housecleaning robot is an OpenAI technical goal
Earlier this week, the Elon Musk-backed OpenAI announced that building a reliable housecleaning robot is one of its technical goals. OpenAI doesn’t intend to build actual cleaning robots, but to develop general learning algorithms that will help it build better agents that are more capable according to OpenAI’s metric.
Concrete Problems in AI Safety
Besides avoiding negative side effects – the broken vase scenario – avoiding reward hacking, and ensuring safe exploration – the curious cleaning bot sticking a wet mop into an electrical socket scenario, the researchers behind “Concrete Problems in AI Safety” believe other problems need to be addressed; those include scalable oversight and ensuring AI systems behave robustly in environments that are different from where they were trained.
The researchers concluded:
With the realistic possibility of machine learning-based systems controlling industrial processes, health-related systems, and other mission-critical technology, small-scale accidents seem like a very concrete threat, and are critical to prevent both intrinsically and because such accidents could cause a justified loss of trust in automated systems. The risk of larger accidents is more difficult to gauge, but we believe it is worthwhile and prudent to develop a principled and forward-looking approach to safety that continues to remain relevant as autonomous systems become more powerful.