{"id":81477,"date":"2021-11-29T07:30:00","date_gmt":"2021-11-29T06:30:00","guid":{"rendered":"https:\/\/www.hiig.de\/?p=81477"},"modified":"2023-03-28T14:02:54","modified_gmt":"2023-03-28T12:02:54","slug":"bias-in-natural-language-processing","status":"publish","type":"post","link":"https:\/\/www.hiig.de\/en\/bias-in-natural-language-processing\/","title":{"rendered":"How to identify bias in Natural Language Processing"},"content":{"rendered":"\n<p><strong>Why do translation programmes or chatbots on our computers often contain discriminatory tendencies towards gender or race? Here is an easy guide to understand how bias in natural language processing works. We explain why sexist technologies like search engines are not just an unfortunate coincidence.<\/strong><\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>What is bias in translation programmes?<\/strong><\/h2>\n\n\n\n<p>Have you ever used machine translation for translating a sentence to Estonian? In some languages, like Estonian, pronouns and nouns do not indicate gender. When translating to English, the software has to make a choice. Which word becomes male and which female? However, often it is a choice grounded in stereotypes. Is this just a coincidence?<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"450\" src=\"https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-8.png\" alt=\"\" class=\"wp-image-81478\" srcset=\"https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-8.png 800w, https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-8-60x34.png 60w, https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-8-768x432.png 768w, https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-8-180x101.png 180w, https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-8-400x225.png 400w, https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-8-200x112.png 200w, https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-8-50x28.png 50w, https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-8-550x309.png 550w, https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-8-600x338.png 600w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" \/><figcaption><span style=\"color:#646466\" class=\"has-inline-color\"><strong>Figure 1<\/strong>: Screenshot of a translation from English to Estonian (and vice versa) by Google Translate. There is no grammatical distinction for gender in Estonian but for English it is necessary, the programme has to decide which word becomes grammatically male or female. The algorithms are often based on stereotypes, as in this example.<\/span><\/figcaption><\/figure><\/div>\n\n\n\n<p>These kinds of systems are created using large amounts of language data \u2013 and this naturally occurring language data contains biases: systematic and unfair discrimination against certain individuals or groups of individuals in favour of others. The way these systems are created <a href=\"https:\/\/aclanthology.org\/D17-1323\/\">can even amplify these pre-existing biases<\/a>. In this blog post we ask where such distortions come from and if there is anything we can do to reduce it?&nbsp;<\/p>\n\n\n\n<p>We will start with a technical explanation of how individual words from large amounts of language data are transformed into numerical representations \u2013 so-called embeddings \u2013 so that computers can make sense of them. This isn\u2019t to bore you but to make clear that bias in word embeddings is no coincidence but rather a logical byproduct. We will then discuss what happens when biased embeddings are used in practical applications.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Word embeddings: what, why, and how?<\/strong><\/h2>\n\n\n\n<p>Word embeddings are simply lists of numbers. When you search Google for \u201cwomen\u201d, what the program sees is a list of numbers and the results are based on calculations on those numbers. Computers cannot process words directly. And this is why we need embeddings.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"346\" src=\"https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-2.png\" alt=\"Eine riesiger Block an Zahlenkombinationen, die alle Wort- und Logikverkn\u00fcpfungen zu dem Wort &quot;Women&quot; darstellet.\" class=\"wp-image-81454\" srcset=\"https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-2.png 800w, https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-2-60x26.png 60w, https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-2-768x332.png 768w, https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-2-180x78.png 180w, https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-2-50x22.png 50w, https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-2-550x238.png 550w, https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-2-600x260.png 600w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" \/><figcaption><span style=\"color:#646466\" class=\"has-inline-color\"><strong>Figure 2: <\/strong>Numerical representation or word embedding of the word \u201cwoman\u201d. It is nothing but a long list, or vector, of numbers.<\/span><\/figcaption><\/figure>\n\n\n\n<p>The main goal is to transform words into numbers such that only a minimum of information is lost. Take the following example. Each word is represented by a single number.&nbsp;<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"254\" height=\"438\" src=\"https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-10.png\" alt=\"\" class=\"wp-image-81480\" srcset=\"https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-10.png 254w, https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-10-35x60.png 35w, https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-10-104x180.png 104w, https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-10-29x50.png 29w, https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-10-209x360.png 209w\" sizes=\"auto, (max-width: 254px) 100vw, 254px\" \/><figcaption><strong><span style=\"color:#646466\" class=\"has-inline-color\">Figure 3<\/span><\/strong><\/figcaption><\/figure><\/div>\n\n\n\n<p>However, we can capture even more information. For instance, \u201cnurse\u201d, \u201cdoctor\u201d, and \u201chospital\u201d are somehow similar. We might want similarity to be represented in the embeddings, too. And this is why we use more sophisticated methods.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Word embeddings and Machine Learning<\/strong><\/h2>\n\n\n\n<p>We will focus on a particular embedding method called <a href=\"https:\/\/nlp.stanford.edu\/projects\/glove\/\">GloVe<\/a>. GloVe falls into the category of machine learning. This means that we do not manually assign numbers to a word like we did before but rather learn the embeddings on a large corpus of language (and by large we mean billions of words).<\/p>\n\n\n\n<p>The basic idea is often expressed by the phrase: \u201c<a href=\"https:\/\/en.wikipedia.org\/wiki\/John_Rupert_Firth\">You shall know a word by the company it keeps<\/a>\u201d. This simply means that the context of a word says much about its meaning. Take the previous sentence again. If we were to mask one of the words, you could still roughly guess what it is: \u201cThe ___ and the doctor work at the hospital.\u201d This idea forms the foundation for GloVe and its algorithm consists of three steps.<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li><strong>First, we sweep a window over a sentence from our language corpus.<\/strong> The size of the window is up to us and depends on how big we want the surrounding context of a word to be. There have been <a href=\"https:\/\/aclanthology.org\/P14-2050\/\">experiments<\/a> that with a window size of 2 (like below) \u201cHogwarts\u201d is stronger associated with other fictional schools like \u201cSunnydale\u201d or \u201cEvernight\u201d but with a size of 5 it is associated with words like \u201cDumbledore\u201d or \u201cMalfoy\u201d.<\/li><\/ol>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"579\" height=\"346\" src=\"https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-11.png\" alt=\"\" class=\"wp-image-81486\" srcset=\"https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-11.png 579w, https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-11-60x36.png 60w, https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-11-180x108.png 180w, https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-11-50x30.png 50w, https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-11-550x329.png 550w\" sizes=\"auto, (max-width: 579px) 100vw, 579px\" \/><figcaption><strong><span style=\"color:#646466\" class=\"has-inline-color\">Figure 4<\/span><\/strong><\/figcaption><\/figure><\/div>\n\n\n\n<ol class=\"wp-block-list\" start=\"2\"><li><strong>Second, we count the co-occurrences of the words.<\/strong> By co-occurrence we mean the words which appear in the same window as the given word. \u201cNurse\u201d, for example, co-occurs with \u201cand\u201d and twice with \u201cthe\u201d. The result can be displayed in a table.&nbsp;<\/li><\/ol>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"528\" height=\"346\" src=\"https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-12-1.png\" alt=\"\" class=\"wp-image-81484\" srcset=\"https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-12-1.png 528w, https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-12-1-60x39.png 60w, https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-12-1-180x118.png 180w, https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-12-1-50x33.png 50w\" sizes=\"auto, (max-width: 528px) 100vw, 528px\" \/><figcaption><strong><span style=\"color:#646466\" class=\"has-inline-color\">Figure 5<\/span><\/strong><\/figcaption><\/figure><\/div>\n\n\n\n<ol class=\"wp-block-list\" start=\"3\"><li><strong>The third step is the actual learning.<\/strong> We won\u2019t get into detail because this would require a lot of maths. However, the basic idea is simple. We calculate co-occurrence probabilities. Words that co-occur often, such as \u201cdoctor\u201d and \u201cnurse\u201d, get a high probability and unlikely pairs like \u201cdoctor\u201d and \u201celephant\u201d get a low probability. Finally, we assign the embeddings according to the probabilities. The embedding for \u201cdoctor\u201d and the one for \u201cnurse\u201d, for instance, will have some similar numbers and they will be quite different to the numbers for \u201celephant\u201d.<\/li><\/ol>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-13.png\" alt=\"\" class=\"wp-image-81488\" width=\"755\" height=\"515\" srcset=\"https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-13.png 755w, https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-13-60x41.png 60w, https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-13-180x123.png 180w, https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-13-50x34.png 50w, https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-13-528x360.png 528w, https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-13-600x409.png 600w\" sizes=\"auto, (max-width: 755px) 100vw, 755px\" \/><figcaption><span style=\"color:#646466\" class=\"has-inline-color\"><strong>Figure 6:<\/strong> This code snippet outputs the 20 most similar words to \u201cbanana\u201d. The numbers next to the words display their similarity to \u201cbanana\u201d. High numbers mean high similarity.<\/span><\/figcaption><\/figure><\/div>\n\n\n\n<p>That\u2019s it! Those are all steps needed to create GloVe embeddings. All that was done was to swipe a window over millions of sentences, count the co-occurrences, calculate the probabilities, and assign the embeddings.&nbsp;<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"210\" src=\"https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-14.png\" alt=\"\" class=\"wp-image-81490\" srcset=\"https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-14.png 800w, https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-14-60x16.png 60w, https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-14-768x202.png 768w, https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-14-180x47.png 180w, https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-14-50x13.png 50w, https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-14-550x144.png 550w, https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Blog-Bias-in-Language-Processing-Webseite-\u2013-14-600x158.png 600w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" \/><figcaption><span style=\"color:#646466\" class=\"has-inline-color\"><strong>Figure 7<\/strong>: Word embeddings let you do actual maths on words. If you subtract the embedding for \u201cMan\u201d from the one for \u201cKing\u201d and add the embedding for \u201cWoman\u201d, your result is \u201cQueen\u201d.<\/span><\/figcaption><\/figure><\/div>\n\n\n\n<p>Whilst GloVe is not used anymore in most modern technologies, and has been replaced by slightly more advanced methods, the underlying principles are the same: similarity (and meaning) are assigned based on co-occurrence.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Bias in Natural Language Processing: Is it just a fantasy?<\/strong><\/h2>\n\n\n\n<p>But back to the original topic: what is it about this method that makes it so vulnerable to biases? Imagine that words like \u201cengineer\u201d often co-occur with \u201cman\u201d or \u201cnurse\u201d with \u201cwoman\u201d. GloVe will assign strong similarity to those pairs, too. And this is a problem. To emphasise this point: bias in word embeddings is no coincidence. The very same mechanism that gives them the ability to capture meaning of words is also responsible for gender, race and other biases.&nbsp;<\/p>\n\n\n\n<p>Bias in embeddings is certainly not cool. But is it merely a theoretical problem or are there also real-world implications? Well, here are some examples to convince you of the latter.<\/p>\n\n\n\n<p>Word embeddings constitute the input of most modern systems, including many ubiquitous technologies that we use daily. From search engines to speech recognition, translation tools to predictive text, embeddings form the foundation of all of these. And when the foundation is biased, there is a good chance that it spreads to the entire system.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Male-trained word embeddings fuel gender bias: It\u2019s real life<\/strong><\/h2>\n\n\n\n<p>When you type \u201cmachine learning\u201d into a search engine, you will get thousands of results \u2013 too many to go through them all. So you need to rank them by relevance. But how do you determine relevance? As it has been <a href=\"https:\/\/dl.acm.org\/doi\/10.1145\/2872518.2889361\">shown<\/a>, one efficient way is to use word embeddings and rank by similarity. Results that include words similar to \u201cmachine learning\u201d will get higher rankings. However, what has been <a href=\"https:\/\/arxiv.org\/abs\/1608.07187\">shown<\/a>, too, is that embeddings are often gender biased. Say, you search for \u201cComputer science PhD student at Humboldt University\u201d. <a href=\"https:\/\/arxiv.org\/abs\/1607.06520\">What might happen<\/a> is that the webpages of male PhD students will be ranked higher because their names are stronger associated with \u201ccomputer science\u201d. And this in turn reduces the visibility of women in computer science.<\/p>\n\n\n\n<p>Maybe it\u2019s no surprise that these word embeddings grow up to be biased, given what they are fed with. The training data of GPT-2, one of the biggest language models, is sourced by scraping outbound links from Reddit, <a href=\"https:\/\/dl.acm.org\/doi\/10.1145\/3442188.3445922\">which is mainly populated by males between the ages of 18 and 29<\/a>. However, there are also <a href=\"https:\/\/dl.acm.org\/doi\/10.1145\/3351095.3372843\">embeddings<\/a> that have been trained on Wikipedia \u2013 a seemingly unbiased source \u2013 and yet they turn out to be biased. <a href=\"https:\/\/rc.library.uta.edu\/uta-ir\/handle\/10106\/29572\">This might be<\/a> because only about 18% of biographies on (English) Wikipedia are about women and only 8\u201315% of contributors are female. Truth is, finding large balanced language data sets is a pretty tough job.<\/p>\n\n\n\n<p>We could <a href=\"http:\/\/arxiv.org\/abs\/2005.14050\">go on<\/a> and <a href=\"http:\/\/arxiv.org\/abs\/2006.03955\">go on<\/a> and <a href=\"https:\/\/aclanthology.org\/W17-1606\/\">go on<\/a> and <a href=\"https:\/\/www.buzzfeednews.com\/article\/nidhisubbaraman\/robot-racism-through-language\">go on<\/a> with examples but let us now turn to something more optimistic: solutions!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>So what can we&nbsp;do against&nbsp; bias in Natural Language Processing?<\/strong><\/h2>\n\n\n\n<p>So, the bad news: language-based technology is often racist, sexist, ableist, ageist\u2026 the list goes on. But there is a faint glimmer of some good news: we can do something about it! Even better news: awareness that these problems exist is already the first step. We can also attempt to nip the problem in the bud, and try to create a dataset that contains less bias from the offset.&nbsp;<\/p>\n\n\n\n<p>As this is nigh on impossible, <a href=\"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00041\/43452\/Data-Statements-for-Natural-Language-Processing\">as linguistic data will always include pre-existing biases<\/a>, an alternative is to properly <a href=\"https:\/\/arxiv.org\/abs\/1803.09010\">describe your dataset<\/a> \u2013 that way, both research and industry can ensure that they only use appropriate datasets and can assess more thoroughly what impact using a certain dataset in a certain context will have.&nbsp;<\/p>\n\n\n\n<p>On the technical level, various techniques have been proposed to reduce bias in the actual system. Some of these techniques involve judging the appropriateness of a distinction, for example, the gender distinction between <a href=\"https:\/\/blog.conceptnet.io\/posts\/2017\/conceptnet-numberbatch-17-04-better-less-stereotyped-word-vectors\/\">\u2018mother\u2019 vs. \u2018father\u2019 is appropriate, whereas \u2018homemaker\u2019 vs. \u2018programmer\u2019 rather inappropriate<\/a>. The word embeddings for these inappropriate word pairs are then adjusted accordingly. How successful these methods actually are in practice is <a href=\"https:\/\/aclanthology.org\/N19-1061\/\">debatable<\/a>.&nbsp;<\/p>\n\n\n\n<p>Basically, the people who may be affected by a technology \u2013 particularly those affected adversely \u2013 should be at the centre of research. This also involves consulting with all potential stakeholders at all stages of the development process \u2013 the earlier the better!&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Watch your data and try participation<\/strong><\/h2>\n\n\n\n<p>Biases that exist in our society are reflected in the language we use; this is further reflected and can even be amplified in technologies that have language as a basis. As a minimum, technologies that have critical outcomes \u2013 such as a system that automatically decides whether or not to grant loans \u2013 should be designed in a participatory manner, from the beginning of the design process to the very end. Just because people can be discriminatory, this does not mean that computers have to be too! There are <a href=\"https:\/\/stereoset.mit.edu\/\">tools<\/a> and <a href=\"http:\/\/wordbias.umiacs.umd.edu\/\">methods<\/a> to decrease the effects and we can even use word embeddings to do <a href=\"https:\/\/www.pnas.org\/content\/115\/16\/E3635\">research<\/a> on bias. So: be aware of what data you are using, try out technical de-biasing methods and always keep various stakeholders in the loop.<\/p>\n<div class=\"shariff shariff-align-flex-start shariff-widget-align-flex-start\"><ul class=\"shariff-buttons theme-round orientation-horizontal buttonsize-medium\"><li class=\"shariff-button linkedin shariff-nocustomcolor\" style=\"background-color:#1488bf\"><a href=\"https:\/\/www.linkedin.com\/sharing\/share-offsite\/?url=https%3A%2F%2Fwww.hiig.de%2Fen%2Fbias-in-natural-language-processing%2F\" title=\"Share on LinkedIn\" aria-label=\"Share on LinkedIn\" role=\"button\" rel=\"noopener nofollow\" class=\"shariff-link\" style=\"; background-color:#0077b5; color:#fff\" target=\"_blank\"><span class=\"shariff-icon\" style=\"\"><svg width=\"32px\" height=\"20px\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" viewBox=\"0 0 27 32\"><path fill=\"#0077b5\" d=\"M6.2 11.2v17.7h-5.9v-17.7h5.9zM6.6 5.7q0 1.3-0.9 2.2t-2.4 0.9h0q-1.5 0-2.4-0.9t-0.9-2.2 0.9-2.2 2.4-0.9 2.4 0.9 0.9 2.2zM27.4 18.7v10.1h-5.9v-9.5q0-1.9-0.7-2.9t-2.3-1.1q-1.1 0-1.9 0.6t-1.2 1.5q-0.2 0.5-0.2 1.4v9.9h-5.9q0-7.1 0-11.6t0-5.3l0-0.9h5.9v2.6h0q0.4-0.6 0.7-1t1-0.9 1.6-0.8 2-0.3q3 0 4.9 2t1.9 6z\"\/><\/svg><\/span><\/a><\/li><li class=\"shariff-button bluesky shariff-nocustomcolor\" style=\"background-color:#84c4ff\"><a href=\"https:\/\/bsky.app\/intent\/compose?text=How%20to%20identify%20bias%20in%20Natural%20Language%20Processing https%3A%2F%2Fwww.hiig.de%2Fen%2Fbias-in-natural-language-processing%2F  via @hiigberlin.bsky.social\" title=\"Share on Bluesky\" aria-label=\"Share on Bluesky\" role=\"button\" rel=\"noopener nofollow\" class=\"shariff-link\" style=\"; background-color:#0085ff; color:#fff\" target=\"_blank\"><span class=\"shariff-icon\" style=\"\"><svg width=\"20\" height=\"20\" version=\"1.1\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" viewBox=\"0 0 20 20\"><path class=\"st0\" d=\"M4.89,3.12c2.07,1.55,4.3,4.71,5.11,6.4.82-1.69,3.04-4.84,5.11-6.4,1.49-1.12,3.91-1.99,3.91.77,0,.55-.32,4.63-.5,5.3-.64,2.3-2.99,2.89-5.08,2.54,3.65.62,4.58,2.68,2.57,4.74-3.81,3.91-5.48-.98-5.9-2.23-.08-.23-.11-.34-.12-.25,0-.09-.04.02-.12.25-.43,1.25-2.09,6.14-5.9,2.23-2.01-2.06-1.08-4.12,2.57-4.74-2.09.36-4.44-.23-5.08-2.54-.19-.66-.5-4.74-.5-5.3,0-2.76,2.42-1.89,3.91-.77h0Z\"\/><\/svg><\/span><\/a><\/li><li class=\"shariff-button mailto shariff-nocustomcolor\" style=\"background-color:#a8a8a8\"><a href=\"mailto:?body=https%3A%2F%2Fwww.hiig.de%2Fen%2Fbias-in-natural-language-processing%2F&subject=How%20to%20identify%20bias%20in%20Natural%20Language%20Processing\" title=\"Send by email\" aria-label=\"Send by email\" role=\"button\" rel=\"noopener nofollow\" class=\"shariff-link\" style=\"; background-color:#999; color:#fff\"><span class=\"shariff-icon\" style=\"\"><svg width=\"32px\" height=\"20px\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" viewBox=\"0 0 32 32\"><path fill=\"#999\" d=\"M32 12.7v14.2q0 1.2-0.8 2t-2 0.9h-26.3q-1.2 0-2-0.9t-0.8-2v-14.2q0.8 0.9 1.8 1.6 6.5 4.4 8.9 6.1 1 0.8 1.6 1.2t1.7 0.9 2 0.4h0.1q0.9 0 2-0.4t1.7-0.9 1.6-1.2q3-2.2 8.9-6.1 1-0.7 1.8-1.6zM32 7.4q0 1.4-0.9 2.7t-2.2 2.2q-6.7 4.7-8.4 5.8-0.2 0.1-0.7 0.5t-1 0.7-0.9 0.6-1.1 0.5-0.9 0.2h-0.1q-0.4 0-0.9-0.2t-1.1-0.5-0.9-0.6-1-0.7-0.7-0.5q-1.6-1.1-4.7-3.2t-3.6-2.6q-1.1-0.7-2.1-2t-1-2.5q0-1.4 0.7-2.3t2.1-0.9h26.3q1.2 0 2 0.8t0.9 2z\"\/><\/svg><\/span><\/a><\/li><\/ul><\/div>","protected":false},"excerpt":{"rendered":"<p>Why do translation programmes or chatbots often contain discriminatory tendencies towards gender or race? Here is an easy guide to understand how bias natural language processing works. <\/p>\n","protected":false},"author":347,"featured_media":81409,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1289,1582],"tags":[1380,1108,686,1238,1379,1035,1241,1382],"class_list":["post-81477","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-artificial-intelligence","category-ftif-ai-and-society","tag-bias-2","tag-diskriminierung-2","tag-ki-2","tag-language-generator-en","tag-language-processing-2","tag-sexism-2","tag-sprache-en","tag-sprachtechnologie-2"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.6 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>How to identify bias in Natural Language Processing &#8211; Digital Society Blog<\/title>\n<meta name=\"description\" content=\"Why do discriminatory tendencies towards gender or race become bias in natural language processing?\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.hiig.de\/en\/bias-in-natural-language-processing\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"How to identify bias in Natural Language Processing &#8211; Digital Society Blog\" \/>\n<meta property=\"og:description\" content=\"Why do discriminatory tendencies towards gender or race become bias in natural language processing?\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.hiig.de\/en\/bias-in-natural-language-processing\/\" \/>\n<meta property=\"og:site_name\" content=\"HIIG\" \/>\n<meta property=\"article:published_time\" content=\"2021-11-29T06:30:00+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-03-28T12:02:54+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Digital-Society-Blog-Language-Processing-Webseite.png\" \/>\n\t<meta property=\"og:image:width\" content=\"800\" \/>\n\t<meta property=\"og:image:height\" content=\"450\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Frederik Efferenn\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Frederik Efferenn\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"9 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"How to identify bias in Natural Language Processing &#8211; Digital Society Blog","description":"Why do discriminatory tendencies towards gender or race become bias in natural language processing?","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.hiig.de\/en\/bias-in-natural-language-processing\/","og_locale":"en_US","og_type":"article","og_title":"How to identify bias in Natural Language Processing &#8211; Digital Society Blog","og_description":"Why do discriminatory tendencies towards gender or race become bias in natural language processing?","og_url":"https:\/\/www.hiig.de\/en\/bias-in-natural-language-processing\/","og_site_name":"HIIG","article_published_time":"2021-11-29T06:30:00+00:00","article_modified_time":"2023-03-28T12:02:54+00:00","og_image":[{"width":800,"height":450,"url":"https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Digital-Society-Blog-Language-Processing-Webseite.png","type":"image\/png"}],"author":"Frederik Efferenn","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Frederik Efferenn","Est. reading time":"9 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.hiig.de\/en\/bias-in-natural-language-processing\/#article","isPartOf":{"@id":"https:\/\/www.hiig.de\/en\/bias-in-natural-language-processing\/"},"author":{"name":"Frederik Efferenn","@id":"https:\/\/www.hiig.de\/#\/schema\/person\/42269f2579783653bb3f7bfa6dea3663"},"headline":"How to identify bias in Natural Language Processing","datePublished":"2021-11-29T06:30:00+00:00","dateModified":"2023-03-28T12:02:54+00:00","mainEntityOfPage":{"@id":"https:\/\/www.hiig.de\/en\/bias-in-natural-language-processing\/"},"wordCount":1740,"publisher":{"@id":"https:\/\/www.hiig.de\/#organization"},"image":{"@id":"https:\/\/www.hiig.de\/en\/bias-in-natural-language-processing\/#primaryimage"},"thumbnailUrl":"https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Digital-Society-Blog-Language-Processing-Webseite.png","keywords":["Bias","diskriminierung","KI","Language Generator","Language Processing","sexism","Sprache","Sprachtechnologie"],"articleSection":["Artificial Intelligence","ftif AI and Society"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.hiig.de\/en\/bias-in-natural-language-processing\/","url":"https:\/\/www.hiig.de\/en\/bias-in-natural-language-processing\/","name":"How to identify bias in Natural Language Processing &#8211; Digital Society Blog","isPartOf":{"@id":"https:\/\/www.hiig.de\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.hiig.de\/en\/bias-in-natural-language-processing\/#primaryimage"},"image":{"@id":"https:\/\/www.hiig.de\/en\/bias-in-natural-language-processing\/#primaryimage"},"thumbnailUrl":"https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Digital-Society-Blog-Language-Processing-Webseite.png","datePublished":"2021-11-29T06:30:00+00:00","dateModified":"2023-03-28T12:02:54+00:00","description":"Why do discriminatory tendencies towards gender or race become bias in natural language processing?","breadcrumb":{"@id":"https:\/\/www.hiig.de\/en\/bias-in-natural-language-processing\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.hiig.de\/en\/bias-in-natural-language-processing\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.hiig.de\/en\/bias-in-natural-language-processing\/#primaryimage","url":"https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Digital-Society-Blog-Language-Processing-Webseite.png","contentUrl":"https:\/\/www.hiig.de\/wp-content\/uploads\/2021\/11\/Digital-Society-Blog-Language-Processing-Webseite.png","width":800,"height":450},{"@type":"BreadcrumbList","@id":"https:\/\/www.hiig.de\/en\/bias-in-natural-language-processing\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.hiig.de\/en\/"},{"@type":"ListItem","position":2,"name":"How to identify bias in Natural Language Processing"}]},{"@type":"WebSite","@id":"https:\/\/www.hiig.de\/#website","url":"https:\/\/www.hiig.de\/","name":"HIIG","description":"Alexander von Humboldt Institute for Internet and Society","publisher":{"@id":"https:\/\/www.hiig.de\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.hiig.de\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.hiig.de\/#organization","name":"HIIG","url":"https:\/\/www.hiig.de\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.hiig.de\/#\/schema\/logo\/image\/","url":"https:\/\/www.hiig.de\/wp-content\/uploads\/2019\/06\/hiig.png","contentUrl":"https:\/\/www.hiig.de\/wp-content\/uploads\/2019\/06\/hiig.png","width":320,"height":80,"caption":"HIIG"},"image":{"@id":"https:\/\/www.hiig.de\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/www.hiig.de\/#\/schema\/person\/42269f2579783653bb3f7bfa6dea3663","name":"Frederik Efferenn"}]}},"_links":{"self":[{"href":"https:\/\/www.hiig.de\/en\/wp-json\/wp\/v2\/posts\/81477","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.hiig.de\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.hiig.de\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.hiig.de\/en\/wp-json\/wp\/v2\/users\/347"}],"replies":[{"embeddable":true,"href":"https:\/\/www.hiig.de\/en\/wp-json\/wp\/v2\/comments?post=81477"}],"version-history":[{"count":11,"href":"https:\/\/www.hiig.de\/en\/wp-json\/wp\/v2\/posts\/81477\/revisions"}],"predecessor-version":[{"id":82332,"href":"https:\/\/www.hiig.de\/en\/wp-json\/wp\/v2\/posts\/81477\/revisions\/82332"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.hiig.de\/en\/wp-json\/wp\/v2\/media\/81409"}],"wp:attachment":[{"href":"https:\/\/www.hiig.de\/en\/wp-json\/wp\/v2\/media?parent=81477"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.hiig.de\/en\/wp-json\/wp\/v2\/categories?post=81477"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.hiig.de\/en\/wp-json\/wp\/v2\/tags?post=81477"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}