1,792 research outputs found

    Recalibrating machine learning for social biases: demonstrating a new methodology through a case study classifying gender biases in archival documentation

    Get PDF
    This thesis proposes a recalibration of Machine Learning for social biases to minimize harms from existing approaches and practices in the field. Prioritizing quality over quantity, accuracy over efficiency, representativeness over convenience, and situated thinking over universal thinking, the thesis demonstrates an alternative approach to creating Machine Learning models. Drawing on GLAM, the Humanities, the Social Sciences, and Design, the thesis focuses on understanding and communicating biases in a specific use case. 11,888 metadata descriptions from the University of Edinburgh Heritage Collections' Archives catalog were manually annotated for gender biases and text classification models were then trained on the resulting dataset of 55,260 annotations. Evaluations of the models' performance demonstrates that annotating gender biases can be automated; however, the subjectivity of bias as a concept complicates the generalizability of any one approach. The contributions are: (1) an interdisciplinary and participatory Bias-Aware Methodology, (2) a Taxonomy of Gendered and Gender Biased Language, (3) data annotated for gender biased language, (4) gender biased text classification models, and (5) a human-centered approach to model evaluation. The contributions have implications for Machine Learning, demonstrating how bias is inherent to all data and models; more specifically for Natural Language Processing, providing an annotation taxonomy, annotated datasets and classification models for analyzing gender biased language at scale; for the Gallery, Library, Archives, and Museum sector, offering guidance to institutions seeking to reconcile with histories of marginalizing communities through their documentation practices; and for historians, who utilize cultural heritage documentation to study and interpret the past. Through a real-world application of the Bias-Aware Methodology in a case study, the thesis illustrates the need to shift away from removing social biases and towards acknowledging them, creating data and models that surface the uncertainty and multiplicity characteristic of human societies

    Dataflow Programming and Acceleration of Computationally-Intensive Algorithms

    Get PDF
    The volume of unstructured textual information continues to grow due to recent technological advancements. This resulted in an exponential growth of information generated in various formats, including blogs, posts, social networking, and enterprise documents. Numerous Enterprise Architecture (EA) documents are also created daily, such as reports, contracts, agreements, frameworks, architecture requirements, designs, and operational guides. The processing and computation of this massive amount of unstructured information necessitate substantial computing capabilities and the implementation of new techniques. It is critical to manage this unstructured information through a centralized knowledge management platform. Knowledge management is the process of managing information within an organization. This involves creating, collecting, organizing, and storing information in a way that makes it easily accessible and usable. The research involved the development textual knowledge management system, and two use cases were considered for extracting textual knowledge from documents. The first case study focused on the safety-critical documents of a railway enterprise. Safety is of paramount importance in the railway industry. There are several EA documents including manuals, operational procedures, and technical guidelines that contain critical information. Digitalization of these documents is essential for analysing vast amounts of textual knowledge that exist in these documents to improve the safety and security of railway operations. A case study was conducted between the University of Huddersfield and the Railway Safety Standard Board (RSSB) to analyse EA safety documents using Natural language processing (NLP). A graphical user interface was developed that includes various document processing features such as semantic search, document mapping, text summarization, and visualization of key trends. For the second case study, open-source data was utilized, and textual knowledge was extracted. Several features were also developed, including kernel distribution, analysis offkey trends, and sentiment analysis of words (such as unique, positive, and negative) within the documents. Additionally, a heterogeneous framework was designed using CPU/GPU and FPGAs to analyse the computational performance of document mapping

    Predicting Paid Certification in Massive Open Online Courses

    Get PDF
    Massive open online courses (MOOCs) have been proliferating because of the free or low-cost offering of content for learners, attracting the attention of many stakeholders across the entire educational landscape. Since 2012, coined as “the Year of the MOOCs”, several platforms have gathered millions of learners in just a decade. Nevertheless, the certification rate of both free and paid courses has been low, and only about 4.5–13% and 1–3%, respectively, of the total number of enrolled learners obtain a certificate at the end of their courses. Still, most research concentrates on completion, ignoring the certification problem, and especially its financial aspects. Thus, the research described in the present thesis aimed to investigate paid certification in MOOCs, for the first time, in a comprehensive way, and as early as the first week of the course, by exploring its various levels. First, the latent correlation between learner activities and their paid certification decisions was examined by (1) statistically comparing the activities of non-paying learners with course purchasers and (2) predicting paid certification using different machine learning (ML) techniques. Our temporal (weekly) analysis showed statistical significance at various levels when comparing the activities of non-paying learners with those of the certificate purchasers across the five courses analysed. Furthermore, we used the learner’s activities (number of step accesses, attempts, correct and wrong answers, and time spent on learning steps) to build our paid certification predictor, which achieved promising balanced accuracies (BAs), ranging from 0.77 to 0.95. Having employed simple predictions based on a few clickstream variables, we then analysed more in-depth what other information can be extracted from MOOC interaction (namely discussion forums) for paid certification prediction. However, to better explore the learners’ discussion forums, we built, as an original contribution, MOOCSent, a cross- platform review-based sentiment classifier, using over 1.2 million MOOC sentiment-labelled reviews. MOOCSent addresses various limitations of the current sentiment classifiers including (1) using one single source of data (previous literature on sentiment classification in MOOCs was based on single platforms only, and hence less generalisable, with relatively low number of instances compared to our obtained dataset;) (2) lower model outputs, where most of the current models are based on 2-polar iii iv classifier (positive or negative only); (3) disregarding important sentiment indicators, such as emojis and emoticons, during text embedding; and (4) reporting average performance metrics only, preventing the evaluation of model performance at the level of class (sentiment). Finally, and with the help of MOOCSent, we used the learners’ discussion forums to predict paid certification after annotating learners’ comments and replies with the sentiment using MOOCSent. This multi-input model contains raw data (learner textual inputs), sentiment classification generated by MOOCSent, computed features (number of likes received for each textual input), and several features extracted from the texts (character counts, word counts, and part of speech (POS) tags for each textual instance). This experiment adopted various deep predictive approaches – specifically that allow multi-input architecture - to early (i.e., weekly) investigate if data obtained from MOOC learners’ interaction in discussion forums can predict learners’ purchase decisions (certification). Considering the staggeringly low rate of paid certification in MOOCs, this present thesis contributes to the knowledge and field of MOOC learner analytics with predicting paid certification, for the first time, at such a comprehensive (with data from over 200 thousand learners from 5 different discipline courses), actionable (analysing learners decision from the first week of the course) and longitudinal (with 23 runs from 2013 to 2017) scale. The present thesis contributes with (1) investigating various conventional and deep ML approaches for predicting paid certification in MOOCs using learner clickstreams (Chapter 5) and course discussion forums (Chapter 7), (2) building the largest MOOC sentiment classifier (MOOCSent) based on learners’ reviews of the courses from the leading MOOC platforms, namely Coursera, FutureLearn and Udemy, and handles emojis and emoticons using dedicated lexicons that contain over three thousand corresponding explanatory words/phrases, (3) proposing and developing, for the first time, multi-input model for predicting certification based on the data from discussion forums which synchronously processes the textual (comments and replies) and numerical (number of likes posted and received, sentiments) data from the forums, adapting the suitable classifier for each type of data as explained in detail in Chapter 7

    A Comprehensive Survey on Applications of Transformers for Deep Learning Tasks

    Full text link
    Transformer is a deep neural network that employs a self-attention mechanism to comprehend the contextual relationships within sequential data. Unlike conventional neural networks or updated versions of Recurrent Neural Networks (RNNs) such as Long Short-Term Memory (LSTM), transformer models excel in handling long dependencies between input sequence elements and enable parallel processing. As a result, transformer-based models have attracted substantial interest among researchers in the field of artificial intelligence. This can be attributed to their immense potential and remarkable achievements, not only in Natural Language Processing (NLP) tasks but also in a wide range of domains, including computer vision, audio and speech processing, healthcare, and the Internet of Things (IoT). Although several survey papers have been published highlighting the transformer's contributions in specific fields, architectural differences, or performance evaluations, there is still a significant absence of a comprehensive survey paper encompassing its major applications across various domains. Therefore, we undertook the task of filling this gap by conducting an extensive survey of proposed transformer models from 2017 to 2022. Our survey encompasses the identification of the top five application domains for transformer-based models, namely: NLP, Computer Vision, Multi-Modality, Audio and Speech Processing, and Signal Processing. We analyze the impact of highly influential transformer-based models in these domains and subsequently classify them based on their respective tasks using a proposed taxonomy. Our aim is to shed light on the existing potential and future possibilities of transformers for enthusiastic researchers, thus contributing to the broader understanding of this groundbreaking technology

    La traduzione specializzata all’opera per una piccola impresa in espansione: la mia esperienza di internazionalizzazione in cinese di Bioretics© S.r.l.

    Get PDF
    Global markets are currently immersed in two all-encompassing and unstoppable processes: internationalization and globalization. While the former pushes companies to look beyond the borders of their country of origin to forge relationships with foreign trading partners, the latter fosters the standardization in all countries, by reducing spatiotemporal distances and breaking down geographical, political, economic and socio-cultural barriers. In recent decades, another domain has appeared to propel these unifying drives: Artificial Intelligence, together with its high technologies aiming to implement human cognitive abilities in machinery. The “Language Toolkit – Le lingue straniere al servizio dell’internazionalizzazione dell’impresa” project, promoted by the Department of Interpreting and Translation (Forlì Campus) in collaboration with the Romagna Chamber of Commerce (Forlì-Cesena and Rimini), seeks to help Italian SMEs make their way into the global market. It is precisely within this project that this dissertation has been conceived. Indeed, its purpose is to present the translation and localization project from English into Chinese of a series of texts produced by Bioretics© S.r.l.: an investor deck, the company website and part of the installation and use manual of the Aliquis© framework software, its flagship product. This dissertation is structured as follows: Chapter 1 presents the project and the company in detail; Chapter 2 outlines the internationalization and globalization processes and the Artificial Intelligence market both in Italy and in China; Chapter 3 provides the theoretical foundations for every aspect related to Specialized Translation, including website localization; Chapter 4 describes the resources and tools used to perform the translations; Chapter 5 proposes an analysis of the source texts; Chapter 6 is a commentary on translation strategies and choices

    Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve

    Full text link
    The widespread adoption of large language models (LLMs) makes it important to recognize their strengths and limitations. We argue that in order to develop a holistic understanding of these systems we need to consider the problem that they were trained to solve: next-word prediction over Internet text. By recognizing the pressures that this task exerts we can make predictions about the strategies that LLMs will adopt, allowing us to reason about when they will succeed or fail. This approach - which we call the teleological approach - leads us to identify three factors that we hypothesize will influence LLM accuracy: the probability of the task to be performed, the probability of the target output, and the probability of the provided input. We predict that LLMs will achieve higher accuracy when these probabilities are high than when they are low - even in deterministic settings where probability should not matter. To test our predictions, we evaluate two LLMs (GPT-3.5 and GPT-4) on eleven tasks, and we find robust evidence that LLMs are influenced by probability in the ways that we have hypothesized. In many cases, the experiments reveal surprising failure modes. For instance, GPT-4's accuracy at decoding a simple cipher is 51% when the output is a high-probability word sequence but only 13% when it is low-probability. These results show that AI practitioners should be careful about using LLMs in low-probability situations. More broadly, we conclude that we should not evaluate LLMs as if they are humans but should instead treat them as a distinct type of system - one that has been shaped by its own particular set of pressures.Comment: 50 pages plus 11 page of references and 23 pages of appendice

    텍스트마이닝(Text Mining)기법을 응용한 섬지역 개발 방법 연구: 인도네시아 발리섬과 롬복섬을 대상으로

    Get PDF
    학위논문(석사) -- 서울대학교대학원 : 국제대학원 국제학과(국제지역학전공), 2023. 2. 은기수.Islands are prime destinations for attracting international travelers motivated to experience an explorative, exotic island lifestyle. People's preference for island destinations has greatly increased during the COVID-19 pandemic over busy and crowded landmarks or tourist attractions in the center of big cities. Not all islands, however, attract tourists as most islands inherently share similar natural endowments including beaches and marine ecosystems. Due to wide spectrum of maturity in service and amenities in tourism industry of each island, there is ceaseless competition even between islands with similar geographic conditions. This research takes focuses on investigating key determinants that account for prominent differences in size, maturity of tourism sectors and popularity of Bali and Lombok by in-depth analysis on differences in socio-religious context of two regions. Adopting the John Stuart Mills Method of Difference as a framework, the study interpreted the social and cultural fabric of the target islands, borrowing local terminology and values (Agama, Adat, Dinas) on applying the anthropology-derived emic technique. The Two-Way Methodology was employed in this study for in-depth analysis on sociocultural context of target islands. The first, referred to "Historical Analysis," which categorizes and links historical events affected social structures of target destinations; the second, known as "Empirical Analysis," uses text mining the visitors review big-data sets to examine whether the interpreted dynamics of social structures also influences at real-time tourism sites. This analysis led the researcher to find out the dynamics of religion (agama) and norms (adat/dinas) as a determinant that led the socio-religious structure of two islands in different paths. In conclusion, this research proves that understanding of the key elements for determining social structure by adopting the methodology using Historical Analysis based on the concept of Agama, Adat, Dinas and using Empirical Analysis for Big-data in tourism sector can suggest meaningful strategic implications for researching and developing areas of unique religious, social structure and cultural diversity, particularly in island destinations.본연구의목적은 사회・종교적 고유성(Originality)을 본질적 속성으로 하는 섬 지역에 있어 관광 산업을 중심으로 한 실효적인 지역개발 접근법을 제시하는 데 있다. 유사한 지리조건 및 역사적 배경을 갖고 있으나, 관광산업의 성숙도와 규모면에서 차이를 보이고 있는 인도네시아 발리섬과 롬복섬의 사회・종교적 맥락의 차이를 심층 분석함으로써, 이에 관한 분석 방법론이 섬 지역 개발 정책 수립 및 실행과정에 유용하게 쓰일 수 있게 함을 목표로 한다. 발리섬과 롬복섬의 관광산업이 차이를 보이는 근본 원인을 규명하기 위해 John Stuart Mill의 차이법(Method of Difference)과 인류학의 문화비교 방법론인 에믹(emic) 접근법에 기초하여, 7세기 이후 현재까지 인도네시아에서 발생한 중요한 역사적 사건들에 대한 발리와 롬복의대응 과정에서 각각 다르게 형성된 종교・사회・문화적 맥락을 분석 및 해석하고, 그 차이의 실제성을 빅데이터 텍스트마이닝 기법을 통해 검증하고자 하였다. 역사적 사실 및 기록 연구를 통해 발리와 롬복의 관광산업 발전양상 차이의 주요 원인은 두 섬 내부에서 일어난 agama와 adat의 정의 및 관계설정의 차이로 인해 다르게 형성된 사회규범 및 관계망의 개방성의 차이이며, 이 차이가 투자 및 외부인 수용도에 영향을 미쳤음을 밝혔다. 이어 이 차이의 실제성을 세계 최대 여행정보사이트인 트립어드바이저에 방문자들이 남긴 동선정보와 경험에 대한 평가가 담긴 빅데이터의 핵심단어 사용빈도 및 시각화, 동시출현단어 분석, 대응일치 분석 결과 또한 이러한 차이를 뒷받침하고 있음을 입증하였다. 이상의 분석 및 고찰 결과는 고립된 지리적 특성에 따라 고유의 종교・사회・문화적 맥락화를 본질적 속성으로 하는 다양한 섬 지역 연구에 적용할 수 있으며, 특히 본 논문에서 주요 개념으로 다룬 agama, adat, dinas 개념을 해당지역의 사회구조의 형태와 구조의 결정요소를 파악하는데 중요한 변수로 고려하는 경우 보다 효과적이고 유의미한 지역개발 및 관광산업 정책을 수립할 수 있을 것으로 기대한다.I. Introduction 1 1. Purpose of the Study 1 2. Flow of the Study 2 II. Backgrounds 3 1. Similarities Between Bali and Lombok 3 1-1. Geography 4 1-2. Lifestyle 5 1-3. Cultural Backgrounds 5 2. Different Scales of Tourism Sector of Two Islands 6 1-1. Tourism as a Backbone Industry and Economic Drive in Indonesia 7 1-2. Steady and Strong Tourism Development in Bali 11 1-3. Fluctuating and Complicated Tourism Development in Lombok 13 III. Methodologies and Theories 16 1. Research design 16 2. Methodologies 16 IV. Analysis and Interpretations 18 1. Historical Interpretations 18 1-1. Conceptual Frames: Agama, Adat, Dinas, Dharma 18 1-2. Bali and Its Hindu(/Buddhist) Dharma and Adat in Lombok 21 1-3. Bali as Tourist Destination and Negotiation of Its Socio-religious Identity 32 1-4. Agama and Adat in Bali and Lombok 35 1-5. Negotiation of Identity and Tourism Sector 38 2. Empirical Interpretations 46 2-1. Review of Promotional Material in Bali and Lombok Tourism 46 2-2. Review of Actual Experience of Visitors to Bali and Lombok by Text-mining 55 2-2-1. Data Collection 2-2-2. Data Processing 2-2-3. Data Analysis 2-2-4. Descriptive Statistics 2-2-5. Word co-occurrence network analysis 2-2-6. Correspondence Analysis V. Conclusions and Recommendations 71 Bibliography 80 Abstract in Korean 93석

    Referring to discourse participants in Ibero-Romance languages

    Get PDF
    Synopsis: This volume brings together contributions by researchers focusing on personal pronouns in Ibero-Romance languages, going beyond the well-established variable of expressed vs. non-expressed subjects. While factors such as agreement morphology, topic shift and contrast or emphasis have been argued to account for variable subject expression, several corpus studies on Ibero-Romance languages have shown that the expression of subject pronouns goes beyond these traditionally established factors and is also subject to considerable dialectal variation. One of the factors affecting choice and expression of personal pronouns or other referential devices is whether the construction is used personally or impersonally. The use and emergence of new impersonal constructions, eventually also new (im)personal pronouns, as well as the variation found in the expression of human impersonality in different Ibero-Romance language varieties is another interesting research area that has gained ground in the recent years. In addition to variable subject expression, similar methods and theoretical approaches have been applied to study the expression of objects. Finally, the reference to the addressee(s) using different address pronouns and other address forms is an important field of study that is closely connected to the variable expression of pronouns. The present book sheds light on all these aspects of reference to discourse participants. The volume contains contributions with a strong empirical background and various methods and both written and spoken corpus data from Ibero-Romance languages. The focus on discourse participants highlights the special properties of first and second person referents and the factors affecting them that are often different from the anaphoric third person. The chapters are organized into three thematic sections: (i) Variable expression of subjects and objects, (ii) Between personal and impersonal, and (iii) Reference to the addressee

    Predicate Matrix: an interoperable lexical knowledge base for predicates

    Get PDF
    183 p.La Matriz de Predicados (Predicate Matrix en inglés) es un nuevo recurso léxico-semántico resultado de la integración de múltiples fuentes de conocimiento, entre las cuales se encuentran FrameNet, VerbNet, PropBank y WordNet. La Matriz de Predicados proporciona un léxico extenso y robusto que permite mejorar la interoperabilidad entre los recursos semánticos mencionados anteriormente. La creación de la Matriz de Predicados se basa en la integración de Semlink y nuevos mappings obtenidos utilizando métodos automáticos que enlazan el conocimiento semántico a nivel léxico y de roles. Asimismo, hemos ampliado la Predicate Matrix para cubrir los predicados nominales (inglés, español) y predicados en otros idiomas (castellano, catalán y vasco). Como resultado, la Matriz de predicados proporciona un léxico multilingüe que permite el análisis semántico interoperable en múltiples idiomas

    GENDER, HUMAN RIGHTS AND EDUCATION IN AFRICA

    Get PDF
    Proceedings of the 2023 International Conference of the Association for the Promotion of African Studies (APAS) held at the University of Nigeria Nsukka on 24th - 27th Ma
    corecore