Beneficial Intelligence

Biased Data

Sten Vesterli Season 2 Episode 11

In this episode of Beneficial Intelligence, I discuss biased data. Machine Learning depends on large data sets, and unless you take care, ML algorithms will perpetuate any bias in the data it learns from.  

The famous ImageNet database contains 14 million labeled images. However, 6% of these have the wrong label. The labels are provided by humans paid very little per image, so they will work very fast. Unfortunately, as Nobel Prize winner Daniel Kahneman has shown, when humans work fast, they depend on their fast System 1 thinking that is very prone to bias. Thus, a woman in hospital scrubs is likely to be classified "nurse" and a man in the same clothes is likely to be classified "doctor." 

Google Translate was showing its bias when translating from Hungarian. Hungarian only has a gender-neutral pronoun, but the English translation was given a pronoun. The original gender-neutral phrases became "she does the dishes" and "he reads" in English.

As CIO or CTO, you need to make sure somebody ensures the quality of the data you use to train your machine learning algorithms. If you don't have a Chief Data Officer, maybe you have a Data Protection Officer who could reasonably be given this purview. But you cannot foist this responsibility on individual development teams under deadline pressure. It is your responsibility to ensure that any machine learning system is learning from clean, unbiased data. 

Beneficial Intelligence is a weekly podcast with stories and pragmatic advice for CIOs, CTOs, and other IT leaders. To get in touch, please contact me at sten@vesterli.com