It is easy to be fooled if you do not understand how a model works

Benjamin Heinzerling writes in The Gradient that the “Clever Hans effect” is alive and well in natural language processing (NLP) deep learning models:

Of course, the problem of learners solving a task by learning the “wrong” thing has been known for a long time and is known as the Clever Hans effect, after the eponymous horse which appeared to be able to perform simple intellectual tasks, but in reality relied on involuntary cues given by its handler. Since the 1960s, versions of the tank anecdote tell of a neural network trained by the military to recognize tanks in images, but actually learning to recognize different levels of brightness due to one type of tank appearing only in bright photos and another type only in darker ones.

Less anecdotal, Viktoria Krakovna has collected a depressingly long list of agents following the letter, but not the spirit of their reward function, with such gems as a video game agent learning to die at the end of the first level, since repeating that easy level gives a higher score than dying early in the harder second level. Two more recent, but already infamous cases are an image classifier claimed to be able to distinguish faces of criminals from those of law-abiding citizens, but actually recognizing smiles and a supposed “sexual orientation detector” which can be better explained as a detector of glasses, beards and eyeshadow.

NLP’s Clever Hans Moment has Arrived

BERT is a fantastic NLP model, but it’s not displaying deep understanding of the material. For certain tasks, at least, it is exploiting statistical correlations better than you can. And that makes it hard to see what it’s doing.

Reminds me of one of my favorite quotes: “The first principle is that you must not fool yourself – and you are the easiest person to fool.” Richard Feynman