Biasness in AI— part 2

Photo by Markus Spiske on Unsplash

Bias in Data

In part 1, I gave some introduction and shared examples about data bias and consequences of it.

What leads to Data bias?

Firstly, data coming with this kind of source can cause activity/response bias, societal bias, or labeling bias. Activity/response bias occurs when we collect data from social networks, such as Twitter, Facebook, or Instagram, which does not represent the whole population because only some people use and express their thoughts on these platforms. For example, if we use Twitter data on reaction to the COVID-19 virus in the US and see that Twitter users are using all kinds of protection against that, our model will consider this as people taking illness seriously, and the virus’s spread will go down quickly. However, we did not consider that only 21% of people in the US actively use Twitter in 2019 [8]. This means that our twitter data does not represent the whole population, and the model will be trained with biased data.

Now, let’s consider Twitter data again, but this time let’s investigate from the perspective of societal bias. This is what we call “bias in data representation” [11]. It means we have a suitable amount of data for each group, but sentiment for them is not the same; not all of them are positively described. Considering the previous example, in Twitter, by 2019, there have been reported 68 million actives users in the US, which is still a vast amount of data [8]. If in their tweets, they mention racist or sexist terms and the model is trained with this data, our machine will conclude that being racist or sexist is a normal thing and most probably will amplify this socially not ethical approach in the working process. We may get 95% accuracy in our predictions, but our model will not be considered ethical and will discriminate against people groups.

Additionally, data labeling is one of the root causes of biased models. One of the most significant datasets in AI and computer vision history, called ImageNet, included the problematic category “Person,” which was removed later. However, we should mention that the same category was wrangling on the internet over decades[3]. That dataset was first introduced in 2009 and intended to identify objects. However, to do some experiments, researchers included person images that were later found to be classified inappropriately. For example, the fat young boy was classified as a “loser,” or a man holding beer was classified as “alcoholic” or “bad person” [3]. This shows that data was highly dependent on annotations that people made, and these annotations were clearly biased.

In conclusion, we can come up with 2 main reasons for data bias. The collected data does not represent the whole population. It is not randomized, it does not include some important variable for modeling, or the human-produced data consists of some bias towards a group of people.

Other part of the series:

I would appreciate if you share your opinions about article.

[1] Masoud Mansoury,Himan Abdollahpouri,Mykola Pechenizkiy,Bamshad Mobasher,Robin Burke. In Feedback Loop and Bias Amplification in Recommender Systems, 2020.

[2] Alex Beutel, Jilin Chen, Zhe Zhao, Ed H. Chi. In Data Decisions and Theoretical Implications when Adversarially Learning Fair Representations, 2017.

[3] Kate Crawford and Trevor Paglen. In The Politics of Images in Machine Learning Training Sets, 2019.

[4] Jeffrey Dastin. In Amazon scraps secret AI recruiting tool that showed bias against women, 2018.

[5] Rich Caruana, Paul Koch, Yin Lou, Marc Sturm, Johannes Gehrke, Noemie Elhadad. In Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission, 2015.

[6] Laurence Hart. What data will you feed your artificial intelligence? February 2018.

[7] Adrian Benton,Margaret Mitchell,Dirk Hovy. In Multi-Task Learning for Mental Health using Social Media Text, 2017.

[8] H.Tankovska. In Twitter: number of monetizable daily active U.S. users 2017–2020, 2021.

[9] Prabhakar Krishnamurthy. In Understanding Data Bias.Types and sources of data bias, 2019.

[10] Brian Hu Zhang, Blake Lemoine, Margaret Mitchell. In Mitigating Unwanted Biases with Adversarial Learning, 2018.

[11] Margaret Mitchell. In Bias in the Vision and Language of Artificial Intelligence,2021.

[12] Julia Angwin, Jeff Larson, Surya Mattu , Lauren Kirchner, ProPublica. In Machine Bias., 2016.

[13] Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, Lucy Vasserman. In Measuring and Mitigating Unintended Bias in Text Classification, 2017.

[14] JORDAN WEISSMANN. In Amazon Created a Hiring Tool Using A.I.It Immediately Started Discriminating Against Women., 2018.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store