1,548 research outputs found

    Recalibrating machine learning for social biases: demonstrating a new methodology through a case study classifying gender biases in archival documentation

    Get PDF
    This thesis proposes a recalibration of Machine Learning for social biases to minimize harms from existing approaches and practices in the field. Prioritizing quality over quantity, accuracy over efficiency, representativeness over convenience, and situated thinking over universal thinking, the thesis demonstrates an alternative approach to creating Machine Learning models. Drawing on GLAM, the Humanities, the Social Sciences, and Design, the thesis focuses on understanding and communicating biases in a specific use case. 11,888 metadata descriptions from the University of Edinburgh Heritage Collections' Archives catalog were manually annotated for gender biases and text classification models were then trained on the resulting dataset of 55,260 annotations. Evaluations of the models' performance demonstrates that annotating gender biases can be automated; however, the subjectivity of bias as a concept complicates the generalizability of any one approach. The contributions are: (1) an interdisciplinary and participatory Bias-Aware Methodology, (2) a Taxonomy of Gendered and Gender Biased Language, (3) data annotated for gender biased language, (4) gender biased text classification models, and (5) a human-centered approach to model evaluation. The contributions have implications for Machine Learning, demonstrating how bias is inherent to all data and models; more specifically for Natural Language Processing, providing an annotation taxonomy, annotated datasets and classification models for analyzing gender biased language at scale; for the Gallery, Library, Archives, and Museum sector, offering guidance to institutions seeking to reconcile with histories of marginalizing communities through their documentation practices; and for historians, who utilize cultural heritage documentation to study and interpret the past. Through a real-world application of the Bias-Aware Methodology in a case study, the thesis illustrates the need to shift away from removing social biases and towards acknowledging them, creating data and models that surface the uncertainty and multiplicity characteristic of human societies

    Dataflow Programming and Acceleration of Computationally-Intensive Algorithms

    Get PDF
    The volume of unstructured textual information continues to grow due to recent technological advancements. This resulted in an exponential growth of information generated in various formats, including blogs, posts, social networking, and enterprise documents. Numerous Enterprise Architecture (EA) documents are also created daily, such as reports, contracts, agreements, frameworks, architecture requirements, designs, and operational guides. The processing and computation of this massive amount of unstructured information necessitate substantial computing capabilities and the implementation of new techniques. It is critical to manage this unstructured information through a centralized knowledge management platform. Knowledge management is the process of managing information within an organization. This involves creating, collecting, organizing, and storing information in a way that makes it easily accessible and usable. The research involved the development textual knowledge management system, and two use cases were considered for extracting textual knowledge from documents. The first case study focused on the safety-critical documents of a railway enterprise. Safety is of paramount importance in the railway industry. There are several EA documents including manuals, operational procedures, and technical guidelines that contain critical information. Digitalization of these documents is essential for analysing vast amounts of textual knowledge that exist in these documents to improve the safety and security of railway operations. A case study was conducted between the University of Huddersfield and the Railway Safety Standard Board (RSSB) to analyse EA safety documents using Natural language processing (NLP). A graphical user interface was developed that includes various document processing features such as semantic search, document mapping, text summarization, and visualization of key trends. For the second case study, open-source data was utilized, and textual knowledge was extracted. Several features were also developed, including kernel distribution, analysis offkey trends, and sentiment analysis of words (such as unique, positive, and negative) within the documents. Additionally, a heterogeneous framework was designed using CPU/GPU and FPGAs to analyse the computational performance of document mapping

    Self-supervised learning for transferable representations

    Get PDF
    Machine learning has undeniably achieved remarkable advances thanks to large labelled datasets and supervised learning. However, this progress is constrained by the labour-intensive annotation process. It is not feasible to generate extensive labelled datasets for every problem we aim to address. Consequently, there has been a notable shift in recent times toward approaches that solely leverage raw data. Among these, self-supervised learning has emerged as a particularly powerful approach, offering scalability to massive datasets and showcasing considerable potential for effective knowledge transfer. This thesis investigates self-supervised representation learning with a strong focus on computer vision applications. We provide a comprehensive survey of self-supervised methods across various modalities, introducing a taxonomy that categorises them into four distinct families while also highlighting practical considerations for real-world implementation. Our focus thenceforth is on the computer vision modality, where we perform a comprehensive benchmark evaluation of state-of-the-art self supervised models against many diverse downstream transfer tasks. Our findings reveal that self-supervised models often outperform supervised learning across a spectrum of tasks, albeit with correlations weakening as tasks transition beyond classification, particularly for datasets with distribution shifts. Digging deeper, we investigate the influence of data augmentation on the transferability of contrastive learners, uncovering a trade-off between spatial and appearance-based invariances that generalise to real-world transformations. This begins to explain the differing empirical performances achieved by self-supervised learners on different downstream tasks, and it showcases the advantages of specialised representations produced with tailored augmentation. Finally, we introduce a novel self-supervised pre-training algorithm for object detection, aligning pre-training with downstream architecture and objectives, leading to reduced localisation errors and improved label efficiency. In conclusion, this thesis contributes a comprehensive understanding of self-supervised representation learning and its role in enabling effective transfer across computer vision tasks

    The Mogadishu Effect: America\u27s Failure-Driven Foreign Policy

    Get PDF
    The October 1993 Battle of Mogadishu, commonly referred to as โ€œBlack Hawk Down,โ€ transformed American foreign policy in its wake. One of the largest special operations missions in recent history, the failures in Somalia left not only the United States government and military in shock, but also the American people. After the nationโ€™s most elite fighting forces suffered a nearly 50 percent casualty rate at the hands of Somali warlords during what many Americans thought was a humanitarian operation, Congress and the American people erupted in anger. Although the United States has continued to be seen as an overbearing global peacekeeping force in the thirty years since Somalia, the Battle of Mogadishu served as the turning point for a generational foreign policy shift that significantly limited future global intervention because of the overt publicization of battleโ€™s aftermath in the media, domestic and international reactions, and a fear of repeating the same mistakes elsewhere. The first major American loss of life after the Cold War, the battle and the reaction that followed, known as the โ€œMogadishu effect,โ€ forced President Clinton to rethink the United Statesโ€™ role internationally. Clinton and his administration struggled to convince the American people that involvement overseas, especially global peacekeeping, was vital to international order after becoming the worldโ€™s sole superpower. Congressional hearings, presidential correspondence, government documents, poll results, and numerous media releases across Clintonโ€™s presidency mark the distinct shift in American foreign policy that took place after Mogadishu. Although he inherited involvement in the United Nations mission in Somalia from George H.W. Bush, the failures in Somalia transformed Clintonโ€™s humanitarian involvement in Haiti, Bosnia, and Rwanda, tarnishing the remainder of his presidency and shifting expectations of significant American involvement in international peacekeeping after the Cold War

    Exploring acceptance of autonomous vehicle policies using KeyBERT and SNA: Targeting engineering students

    Full text link
    This study aims to explore user acceptance of Autonomous Vehicle (AV) policies with improved text-mining methods. Recently, South Korean policymakers have viewed Autonomous Driving Car (ADC) and Autonomous Driving Robot (ADR) as next-generation means of transportation that will reduce the cost of transporting passengers and goods. They support the construction of V2I and V2V communication infrastructures for ADC and recognize that ADR is equivalent to pedestrians to promote its deployment into sidewalks. To fill the gap where end-user acceptance of these policies is not well considered, this study applied two text-mining methods to the comments of graduate students in the fields of Industrial, Mechanical, and Electronics-Electrical-Computer. One is the Co-occurrence Network Analysis (CNA) based on TF-IWF and Dice coefficient, and the other is the Contextual Semantic Network Analysis (C-SNA) based on both KeyBERT, which extracts keywords that contextually represent the comments, and double cosine similarity. The reason for comparing these approaches is to balance interest not only in the implications for the AV policies but also in the need to apply quality text mining to this research domain. Significantly, the limitation of frequency-based text mining, which does not reflect textual context, and the trade-off of adjusting thresholds in Semantic Network Analysis (SNA) were considered. As the results of comparing the two approaches, the C-SNA provided the information necessary to understand users' voices using fewer nodes and features than the CNA. The users who pre-emptively understood the AV policies based on their engineering literacy and the given texts revealed potential risks of the AV accident policies. This study adds suggestions to manage these risks to support the successful deployment of AVs on public roads.Comment: 29 pages with 11 figure

    The regulation of digital platforms: the case of pagoPA

    Get PDF
    How can EU regulation affect innovation. Digital revolution: How big data have changed the world and the legal landscape. The regulation of digital platforms in Europe. Digital revolution: How distributed ledger technologies are changing the world and the legal landscape. Regulation of digital payments: the case of pagopa

    Effects of Data Duplication in Pretraining

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๋ฐ์ดํ„ฐ์‚ฌ์ด์–ธ์Šค๋Œ€ํ•™์› ๋ฐ์ดํ„ฐ์‚ฌ์ด์–ธ์Šคํ•™๊ณผ, 2023. 2. ์ด์žฌ์ง„.This paper studies the effect of deduplication in training data on language models, such as BERT (the encoder-based model) and GPT-2 (the decoder-based model). Previous studies focus on memorizing duplicates in the training dataset whereas we perform several experiments with data deduplication. The pretraining data is first clustered by MinhashLSH, a stochastic method for finding near-duplicate documents in large corpus data, and then deduplicated by Jaccard similarity with various threshold values. Then, the models are finetuned with different downstream tasks. The experimental result indicates that GPT-2 works better with the deduplication, whereas BERT works differently depending on the tasks. It is due to the difference in self-supervised learning methods between BERT and GPT-2. The duplicated data may work on BERT as data augmentation through random masking in its data preprocessing stage. Data duplication may introduce biases and lead to overfitting, but the effect depends on the amount of duplicated data. To improve performance, data deduplication with proper granularity is essential in language model training.์ด ์—ฐ๊ตฌ๋Š” BERT(์ธ์ฝ”๋” ๊ธฐ๋ฐ˜ ๋ชจ๋ธ) ๋ฐ GPT-2(๋””์ฝ”๋” ๊ธฐ๋ฐ˜ ๋ชจ๋ธ)์™€ ๊ฐ™์€ ์–ธ์–ด ๋ชจ๋ธ์— ๋Œ€ํ•œ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์˜ ์ค‘๋ณต ์ œ๊ฑฐ ํšจ๊ณผ๋ฅผ ์ œ์‹œํ•˜๋Š” ๋ฐ ๋ชฉ์ ์ด ์žˆ๋‹ค. ๊ธฐ์กด ์—ฐ๊ตฌ์—์„œ๋Š” ์ƒ์„ฑ ๋ชจ๋ธ์— ํ•œํ•˜์—ฌ ์ค‘๋ณต ์ œ๊ฑฐ์˜ ์ด์ ์„ ๋ฐํ˜”์œผ๋ฉฐ, ๋ชจ๋ธ์ด ์•”๊ธฐ๋œ ํ…์ŠคํŠธ๋ฅผ ๋œ ์ƒ์„ฑํ•˜๊ณ  ๋ชจ๋ธ์˜ ํ›ˆ๋ จ ๋‹จ๊ณ„๊ฐ€ ๋” ์ ๊ฒŒ ํ•„์š”ํ•˜๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ•˜์˜€๋‹ค. ์ด์— ๋ง๋ถ™์—ฌ ํ˜„ ์—ฐ๊ตฌ์—์„œ๋Š” ๋ฐ์ดํ„ฐ ์ค‘๋ณต ์ œ๊ฑฐ์— ๋Œ€ํ•ด ๋ช‡ ๊ฐ€์ง€ ์ถ”๊ฐ€์ ์ธ ์‹คํ—˜์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์‚ฌ์ „ ํ•™์Šต ๋ฐ์ดํ„ฐ๋Š” ์šฐ์„  MinhashLSH(๋Œ€๊ทœ๋ชจ ๋ง๋ญ‰์น˜ ๋ฐ์ดํ„ฐ์—์„œ ์œ ์‚ฌํ•œ ๋ฌธ์„œ๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•œ ํ™•๋ฅ ๋ก ์  ๋ฐฉ๋ฒ•)๋กœ ํด๋Ÿฌ์Šคํ„ฐ๋ง ํ•œ ๋‹ค์Œ, ๋‹ค์–‘ํ•œ ์ž„๊ณ„๊ฐ’์˜ Jaccard ์œ ์‚ฌ์„ฑ์œผ๋กœ ์ค‘๋ณต document๋ฅผ ์ œ๊ฑฐํ•˜๋Š” ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์„ ๊ฑฐ์นœ๋‹ค. ๊ตฌ์„ฑ๋œ ๋ฐ์ดํ„ฐ์…‹์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์‚ฌ์ „ ํ•™์Šต์„ ์ง„ํ–‰ํ•˜๊ณ , ์ดํ›„ ๋‹ค์–‘ํ•œ downstream ์ž‘์—…์— finetuningํ•œ๋‹ค. GPT-2๋Š” ์ค‘๋ณต ์ œ๊ฑฐ๋œ ๋ชจ๋ธ์—์„œ ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ๋‚ด๋Š” ๋ฐ˜๋ฉด, BERT๋Š” downstream ์ž‘์—…์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค. ์ด๋Š” BERT์™€ GPT-2์˜ self-supervised learning ๋ฐฉ์‹์˜ ์ฐจ์ด ๋•Œ๋ฌธ์ด๋‹ค. BERT์—์„œ๋Š” ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ๋‹จ๊ณ„์—์„œ ๋žœ๋ค ๋งˆ์Šคํ‚น ๋ฐฉ์‹์„ ํ†ตํ•ด ์ค‘๋ณต๋œ ๋ฐ์ดํ„ฐ๊ฐ€ ์˜คํžˆ๋ ค ๋ฐ์ดํ„ฐ augmentation์œผ๋กœ ์ž‘์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋ ‡์ง€๋งŒ ๊ฒฐ๊ณผ์ ์œผ๋กœ ๋ฐ์ดํ„ฐ ์ค‘๋ณต์€ ํŽธํ–ฅ์„ ๋„์ž…ํ•˜๊ณ  ๊ณผ์ ํ•ฉ์œผ๋กœ ์ด์–ด์งˆ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๊ทธ ํšจ๊ณผ๋Š” ์ค‘๋ณต ๋ฐ์ดํ„ฐ์˜ ์–‘์— ๋”ฐ๋ผ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด์„  ์–ธ์–ด ๋ชจ๋ธ ํ›ˆ๋ จ์—์„œ ์ ์ ˆํ•œ ์ž„๊ณ„๊ฐ’์˜ ๋ฐ์ดํ„ฐ ์ค‘๋ณต ์ œ๊ฑฐ๊ฐ€ ํ•„์ˆ˜์ ์ด๋‹ค.Chapter 1. Introduction ๏ผ‘ 1.1. Study Background ๏ผ‘ 1.2. Purpose of Research ๏ผ“ 1.3. Related Work ๏ผ” Chapter 2. Approach ๏ผ– 2.1. Pretraining Models ๏ผ– 2.2. Pretraining Dataset ๏ผ— 2.3. Near Deduplication ๏ผ— 2.4. Injection of Exact Document Duplication ๏ผ‘๏ผ Chapter 3. Experiments ๏ผ‘๏ผ’ 3.1. Near Deduplication Results ๏ผ‘๏ผ’ 3.2. Duplication Injection Results ๏ผ‘๏ผ” Chapter 4. Conclusion ๏ผ‘๏ผ– 4.1. Discussion and Future work ๏ผ‘๏ผ–์„

    Freebirthing in the UK: A Narrative Study

    Get PDF
    • โ€ฆ
    corecore