Bias in Data
In recent years, machine learning algorithms’ successful application, leads every business owner to adopt this new technology and use it for various purposes such as advertising, recommendation, image recognition, etc. Besides, these kinds of algorithms and models are becoming trustful partners in serious domains like health-care system, crime prediction (social justice), employment monitoring/hiring . ML/DL models’ fundamental basis is statistics and mathematics, which makes people believe in algorithms much more than before. However, what is an important piece of creating an AI models? It is not math or statistics, it is data! Because these models (consider supervised models) are created by giving input and output data and finding the relationship between those two by training model for every single input-output pairs. Hence, our models considerably depend on the data we are using while training.
In part 1, I gave some introduction and shared examples about data bias and consequences of it.
What leads to Data bias?
Let’s remember American statistician Andrew Gelman’s approach about data: “The most important aspect of a statistical analysis is not what you do with the data, it is what data you use.” To feed our neural networks or models with data, we collect them either with a human touch (human-produced, collected data) or online system-generated data. In this case, a question comes to everyone’s mind, what can cause biased data? I will start to elaborate on human-produced data and its effect on algorithms and models.
Firstly, data coming with this kind of source can cause activity/response bias, societal bias, or labeling bias. Activity/response bias occurs when we collect data from social networks, such as Twitter, Facebook, or Instagram, which does not represent the whole population because only some people use and express their thoughts on these platforms. For example, if we use Twitter data on reaction to the COVID-19 virus in the US and see that Twitter users are using all kinds of protection against that, our model will consider this as people taking illness seriously, and the virus’s spread will go down quickly. However, we did not consider that only 21% of people in the US actively use Twitter in 2019 . This means that our twitter data does not represent the whole population, and the model will be trained with biased data.
Now, let’s consider Twitter data again, but this time let’s investigate from the perspective of societal bias. This is what we call “bias in data representation” . It means we have a suitable amount of data for each group, but sentiment for them is not the same; not all of them are positively described. Considering the previous example, in Twitter, by 2019, there have been reported 68 million actives users in the US, which is still a vast amount of data . If in their tweets, they mention racist or sexist terms and the model is trained with this data, our machine will conclude that being racist or sexist is a normal thing and most probably will amplify this socially not ethical approach in the working process. We may get 95% accuracy in our predictions, but our model will not be considered ethical and will discriminate against people groups.
Additionally, data labeling is one of the root causes of biased models. One of the most significant datasets in AI and computer vision history, called ImageNet, included the problematic category “Person,” which was removed later. However, we should mention that the same category was wrangling on the internet over decades. That dataset was first introduced in 2009 and intended to identify objects. However, to do some experiments, researchers included person images that were later found to be classified inappropriately. For example, the fat young boy was classified as a “loser,” or a man holding beer was classified as “alcoholic” or “bad person” . This shows that data was highly dependent on annotations that people made, and these annotations were clearly biased.
In conclusion, we can come up with 2 main reasons for data bias. The collected data does not represent the whole population. It is not randomized, it does not include some important variable for modeling, or the human-produced data consists of some bias towards a group of people.
Other part of the series:
- Introduction to biasness in AI — part 1
- Bias in Model and Modelers — part 3
- How to deal with data bias? — part 4
I would appreciate if you share your opinions about article.
 Masoud Mansoury,Himan Abdollahpouri,Mykola Pechenizkiy,Bamshad Mobasher,Robin Burke. In Feedback Loop and Bias Amplification in Recommender Systems, 2020.
 Alex Beutel, Jilin Chen, Zhe Zhao, Ed H. Chi. In Data Decisions and Theoretical Implications when Adversarially Learning Fair Representations, 2017.
 Kate Crawford and Trevor Paglen. In The Politics of Images in Machine Learning Training Sets, 2019.
 Jeffrey Dastin. In Amazon scraps secret AI recruiting tool that showed bias against women, 2018.
 Rich Caruana, Paul Koch, Yin Lou, Marc Sturm, Johannes Gehrke, Noemie Elhadad. In Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission, 2015.
 Laurence Hart. What data will you feed your artificial intelligence? February 2018.
 Adrian Benton,Margaret Mitchell,Dirk Hovy. In Multi-Task Learning for Mental Health using Social Media Text, 2017.
 H.Tankovska. In Twitter: number of monetizable daily active U.S. users 2017–2020, 2021.
 Prabhakar Krishnamurthy. In Understanding Data Bias.Types and sources of data bias, 2019.
 Brian Hu Zhang, Blake Lemoine, Margaret Mitchell. In Mitigating Unwanted Biases with Adversarial Learning, 2018.
 Margaret Mitchell. In Bias in the Vision and Language of Artificial Intelligence,2021.
 Julia Angwin, Jeff Larson, Surya Mattu , Lauren Kirchner, ProPublica. In Machine Bias., 2016.
 Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, Lucy Vasserman. In Measuring and Mitigating Unintended Bias in Text Classification, 2017.
 JORDAN WEISSMANN. In Amazon Created a Hiring Tool Using A.I.It Immediately Started Discriminating Against Women., 2018.