3 research outputs found
Slovene and Croatian word embeddings in terms of gender occupational analogies
In recent years, the use of deep neural networks and dense vector embeddings for text representation have led to excellent results in the field of computational understanding of natural language. It has also been shown that word embeddings often capture gender, racial and other types of bias. The article focuses on evaluating Slovene and Croatian word embeddings in terms of gender bias using word analogy calculations. We compiled a list of masculine and feminine nouns for occupations in Slovene and evaluated the gender bias of fastText, word2vec and ELMo embeddings with different configurations and different approaches to analogy calculations. The lowest occupational gender bias was observed with the fastText embeddings. Similarly, we compared different fastText embeddings on Croatian occupational analogies
Gender bias in language and artificial intelligence tools
Predmetno magistrsko delo predstavlja interdisciplinarni pristop k razumevanju spolne pristranosti, ki se pojavlja v rezultatih orodij umetne inteligence, ki temeljijo na jezikovnih modelih. Pristranosti in stereotipi postanejo problematični, ko sistematično izkazujemo nepravično pozitivno ali negativno nagnjenost do določene skupine, kar lahko vodi to hevristik in napačnega ter nepravičnega odločanja. Delo preko več korakov osvetljuje, kako do pristranosti v orodjih umetne inteligence sploh pride: najprej pojasni metode sistemov obdelave naravnega jezika ter modeliranje pomena besed preko vektorskih besednih vložitev. Nato se ukvarja s tem, kako se spolna pristranost manifestira v jeziku samem, ter kako jezik neke učne množice podatkov vpliva na rezultate besednih vložitev. Kot razloge za spolno pristranost v orodjih umetne inteligence poleg izbire učnih množic navedem še označevanje, predstavitev vhodnih podatkov, modele ter konceptualizacijo raziskav. Na podlagi lastnih študij, objavljenih med leti 2019 in 2021, naloga pokaže, na kakšne načine orodja obdelave naravnega jezika izkazujejo spolno pristranost. Magistrska naloga se s spolno pristranostjo ukvarja tudi z vidika interakcije človeka s pristranimi orodji. Antropomorfizacija in navidezna objektivnost umetne inteligence predstavljata dva izmed glavnih negativnih vplivov tovrstnih orodij na človekovo odločanje. Ključni doprinos naloge je preko širokega pregleda literature virov jezikoslovja, računalništva, filozofije in drugih ved predlagati več smernic, ki bi lahko zmanjšale spolno pristranost v orodjih, kot so veliki jezikovni modeli. Spolno pristranost je potrebno natančno definirati s pomočjo družboslovnih ved. Orodja umetne inteligence morajo hkrati zadoščati visokim etičnim standardom, biti vključevalna in upoštevati zastopanost različnih izkušenj. Za razvijalce orodij je potrebno vzpostaviti jasen okvir odgovornosti. Podjetja se morajo zavezati k nenehnim izboljšavam orodij in transparentnosti, četudi jim te finančno ne koristijo. Kljub koristim, ki jih lahko imamo od orodja, ki posnema človeka, menim, da je dobro, da se razvijalci zavežejo k zmanjševanju antropomorfizacije orodij, saj le-ta lahko neželeno vpliva na človekovo odločanje in s tem posega v njegovo avtonomijo. S tem, ko orodja sama ustvarjajo besedila, ta pa uporabimo kot učno množico za nadaljnja orodja, ustvarjamo »fenomen povratne zanke«. Ena izmed smernic naloge je tovrsten efekt preprečiti. Nazadnje naloga predlaga vodilno vlogo, ki naj jo ima izobrazba na področju umetne inteligence in spolne pristranosti, saj lahko opolnomoči ljudi tako pri uporabi orodij kot pri življenju v novi realnosti.This master\u27s thesis represents an interdisciplinary approach to understanding gender bias manifested in the output of artificial intelligence tools, which are based on language models. Biases and stereotypes become problematic when we systematically exhibit unfair positive or negative bias towards a particular group, leading to heuristics and wrong and unfair decision-making. This work sheds light on how bias in AI tools arises in the first place through a series of steps: first, it explains the methodologies of natural language processing systems and the modelling of word meaning through word embeddings. It then examines how gender bias manifests in the language and how language present in a given training dataset can influence the results of word embeddings. In addition to the choice of training datasets, the work also mentions labelling, input data representation, models and research conceptualisation as reasons for gender bias in AI tools. Based on our own studies, published between 2019 and 2021, the thesis shows concrete ways gender bias manifests itself in natural language processing tools. The master’s thesis also deals with gender bias from the perspective of human interaction with biased AI tools. Anthropomorphisation and the apparent objectivity of artificial intelligence are one of the main ways in which such tools can negatively influence human decision-making. The main contribution of this thesis is to propose, through a broad literature review of sources from linguistics, computer science, philosophy and other disciplines, as well as our own studies, several guidelines that could reduce gender bias in tools such as large language models. Firstly, gender bias needs to be precisely defined with the help of social sciences. At the same time, AI tools must meet high ethical standards, be inclusive and start representing different experiences. A clear framework of accountability for tool developers needs to be established. Companies must commit to continuous improvement of tools and to transparency, even if these do not benefit the company financially. Despite the usefulness of tools that mimic humans, I believe it would be beneficial for developers to commit to reducing the anthropomorphisation of tools, since the latter can undesirably influence decision-making and thus interfere with personal autonomy. One of the guidelines proposed in the paper is the prevention of the so-called »feedback loop phenomenon« that occurs when AI-generated texts are fed as training dataset for further tools. Lastly, the master’s thesis proposes to introduce education in AI and gender bias, which can empower people both in their use of tools and as individuals living in the new reality
List of single-word male and female occupations in Slovenian
The list of single-word occupations in Slovene is based on the Slovene Standard Classification of Occupations (https://www.uradni-list.si/glasilo-uradni-list-rs/vsebina?urlid=199728&stevilka=1641).
The list includes 234 occupation pairs. For each occupation, it contains its masculine word form (e.g. fotograf), its possible synonym, its feminine equivalent (e.g. fotografka) and the corresponding synonym of the feminine form (e.g. fotografinja). The cases where no synonyms were added for a specific occupation are denoted with the label 0 (note that only synonyms with the same root are considered).
Several conditions for inclusion or exclusion of an occupation to the list were applied:
- Our list contains only single word occupation pairs, while the majority of the occupations in the aforementioned classification are multi-word expressions.
- An occupation has to exist both in female and male grammatical gender (gender-neutral words such as pismonoša [en. postman] are not included in the list).
- At least one of the variants of an occupation (masculine or feminine) occurs at least 500 times in the Corpus of Written Standard Slovene Gigafida 2.0.
- The occupations that are also proper names in Slovene, e.g. kovač [en. blacksmith], were filtered out if in the Slovene Morphological Lexicon Sloleks 2.0 (Dobrovoljc et al., 2019) the proper name form exists.
- Occupations that could be easily associated with a context unrelated to occupations (e.g. čarovnik/čarovnica [en. wizard/witch]) or where a male or female variant is a homograph of a common noun (e.g. detektivka [en. detective] also denotes a detective novel) were excluded from the final set of occupations.
When a more established version of an occupation exists, we manually add a synonym with the same root (e.g. in the case of fotografka, an arguably more established fotografinja was added [en. photographer]).
If the standard classification does not include the female (e.g. dramatik [en. playwright]) or the male version (e.g. prostitutka [en. prostitute]) of an occupation, the missing version is manually added if it exists and appears in Gigafida corpus (e.g. there are no established words for female and male versions of postrešček [en. porter] and hostesa [en. hostess]).
The list of occupations can be used for different natural language processing tasks including evaluation of word embeddings models through analogies, which can point to bias in language use.
If you use the dataset, please cite the following paper: SUPEJ, Anka, ULČAR, Matej, ROBNIK ŠIKONJA, Marko, POLLAK, Senja (2020). Primerjava slovenskih besednih vektorskih vložitev z vidika spola na analogijah poklicev. Zbornik konference Jezikovne tehnologije in digitalna humanistika / Proc. of the Conference on Language Technologies and Digital Humanities, p. 93-100