This is 3rd part of the series in the biased data topic where I mostly touch biased models and data analysts. In the previous section, we mentioned human-produced data mostly, but those are not the only data source for us in the world of big data. We still collect data from different websites, use clicks or mouse movements as valuable training data for our models. Even though it can still be seen as human-produced data, in this data generation, our machines direct us most of the time for the clicks and pages we are going into.
What leads to having a biased model?
Recommendation algorithms are one popular example that sits in this category. Many people acknowledge that these algorithms are pushing popularity bias to the model, caused by a feedback loop . How does it happen? There is always a few items that get popular with the clicks of a certain proportion of people. These popular items are recommended to everyone while ignoring other products which could be interesting for certain consumers. People react to recommended trending things, and these actions are recorded and later fed to the algorithm as new training data.
We know that Recommender systems and Online advertising systems use collaborative filtering (CF) techniques with a feedback loop. In  paper, authors discuss how CF amplifies the bias over iterations and leads to the problems like shifting users’ taste representation or homogenization. Shifting representation of users’ taste may conclude with removing their true preferences from the recommender systems’ recommendations and bad performance. Homogenization happens when one group dominates another minority group, in this case, recommendations will not consider this minority group and will be based on the majority group of people. That is why I call this a model bias, because the problem here is not data actually, the system which uses data in inappropriate way and causes bias in model.
Power in the hands of modelers!
In this part, I would like to discuss briefly how the modeler’s approach can cause bias in the models. We should admit that, while making a model, all the power lies in the people’s hands, analysts who create the model. One simple act of the creator could end with uninvited results. For example, omitted variables is one of the common roots of biased models which can be caused by data analysts.
Researchers in paper , are trying to identify high death risk patients who are pneumonia, to decide whether they should be admitted to the hospital or treated as an outpatient. In the end, their model was classifying pneumonia patients who have asthma to the low-risk death category. This result concerned both doctors and researchers in the field. They investigated their model and data until they found that one scenario/variable is not included in the data. Patients who have both asthma and pneumonia are directly admitted to the ICU (intensive-care unit). However, in the data, this information was missing; one variable was omitted, which was significantly important for the model.
Also, sometimes data scientist starts to build the model in order to prove some hypothesis. In this case, the modeler is biased from the beginning; they collect data and design the model so that it will be suitable to prove their prior idea. For such cases, data scientists sometimes remove samples or variables that do not let them get the desired result. Most of the time, this kind of model suffers from biased results as it is designed to have a bias.
Other parts of the series:
Would like to hear your opinions about my article.
 Masoud Mansoury,Himan Abdollahpouri,Mykola Pechenizkiy,Bamshad Mobasher,Robin Burke. In Feedback Loop and Bias Amplification in Recommender Systems, 2020.
 Alex Beutel, Jilin Chen, Zhe Zhao, Ed H. Chi. In Data Decisions and Theoretical Implications when Adversarially Learning Fair Representations, 2017.
 Kate Crawford and Trevor Paglen. In The Politics of Images in Machine Learning Training Sets, 2019.
 Jeffrey Dastin. In Amazon scraps secret AI recruiting tool that showed bias against women, 2018.
 Rich Caruana, Paul Koch, Yin Lou, Marc Sturm, Johannes Gehrke, Noemie Elhadad. In Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission, 2015.
 Laurence Hart. What data will you feed your artificial intelligence? February 2018.
 Adrian Benton,Margaret Mitchell,Dirk Hovy. In Multi-Task Learning for Mental Health using Social Media Text, 2017.
 H.Tankovska. In Twitter: number of monetizable daily active U.S. users 2017–2020, 2021.
 Prabhakar Krishnamurthy. In Understanding Data Bias.Types and sources of data bias, 2019.
 Brian Hu Zhang, Blake Lemoine, Margaret Mitchell. In Mitigating Unwanted Biases with Adversarial Learning, 2018.
 Margaret Mitchell. In Bias in the Vision and Language of Artificial Intelligence,2021.
 Julia Angwin, Jeff Larson, Surya Mattu , Lauren Kirchner, ProPublica. In Machine Bias., 2016.
 Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, Lucy Vasserman. In Measuring and Mitigating Unintended Bias in Text Classification, 2017.
 JORDAN WEISSMANN. In Amazon Created a Hiring Tool Using A.I.It Immediately Started Discriminating Against Women., 2018.