676 research outputs found

    Improving Cross-Lingual Transfer Learning for Event Detection

    Get PDF
    The widespread adoption of applications powered by Artificial Intelligence (AI) backbones has unquestionably changed the way we interact with the world around us. Applications such as automated personal assistants, automatic question answering, and machine-based translation systems have become mainstays of modern culture thanks to the recent considerable advances in Natural Language Processing (NLP) research. Nonetheless, with over 7000 spoken languages in the world, there still remain a considerable number of marginalized communities that are unable to benefit from these technological advancements largely due to the language they speak. Cross-Lingual Learning (CLL) looks to address this issue by transferring the knowledge acquired from a popular, high-resource source language (e.g., English, Chinese, or Spanish) to a less favored, lower-resourced target language (e.g., Urdu or Swahili). This dissertation leverages the Event Detection (ED) sub-task of Information Extraction (IE) as a testbed and presents three novel approaches that improve cross-lingual transfer learning from distinct perspectives: (1) direct knowledge transfer, (2) hybrid knowledge transfer, and (3) few-shot learning

    Less is More: Restricted Representations for Better Interpretability and Generalizability

    Get PDF
    Deep neural networks are prevalent in supervised learning for large amounts of tasks such as image classification, machine translation and even scientific discovery. Their success is often at the sacrifice of interpretability and generalizability. The increasing complexity of models and involvement of the pre-training process make the inexplicability more imminent. The outstanding performance when labeled data are abundant while prone to overfit when labeled data are limited demonstrates the difficulty of deep neural networks' generalizability to different datasets. This thesis aims to improve interpretability and generalizability by restricting representations. We choose to approach interpretability by focusing on attribution analysis to understand which features contribute to prediction on BERT, and to approach generalizability by focusing on effective methods in a low-data regime. We consider two strategies of restricting representations: (1) adding bottleneck, and (2) introducing compression. Given input x, suppose we want to learn y with the latent representation z (i.e. x→z→y), adding bottleneck means adding function R such that L(R(z)) < L(z) and introducing compression means adding function R so that L(R(y)) < L(y) where L refers to the number of bits. In other words, the restriction is added either in the middle of the pipeline or at the end of it. We first introduce how adding information bottleneck can help attribution analysis and apply it to investigate BERT's behavior on text classification in Chapter 3. We then extend this attribution method to analyze passage reranking in Chapter 4, where we conduct a detailed analysis to understand cross-layer and cross-passage behavior. Adding bottleneck can not only provide insight to understand deep neural networks but can also be used to increase generalizability. In Chapter 5, we demonstrate the equivalence between adding bottleneck and doing neural compression. We then leverage this finding with a framework called Non-Parametric learning by Compression with Latent Variables (NPC-LV), and show how optimizing neural compressors can be used in the non-parametric image classification with few labeled data. To further investigate how compression alone helps non-parametric learning without latent variables (NPC), we carry out experiments with a universal compressor gzip on text classification in Chapter 6. In Chapter 7, we elucidate methods of adopting the perspective of doing compression but without the actual process of compression using T5. Using experimental results in passage reranking, we show that our method is highly effective in a low-data regime when only one thousand query-passage pairs are available. In addition to the weakly supervised scenario, we also extend our method to large language models like GPT under almost no supervision --- in one-shot and zero-shot settings. The experiments show that without extra parameters or in-context learning, GPT can be used for semantic similarity, text classification, and text ranking and outperform strong baselines, which is presented in Chapter 8. The thesis proposes to tackle two big challenges in machine learning --- "interpretability" and "generalizability" through restricting representation. We provide both theoretical derivation and empirical results to show the effectiveness of using information-theoretic approaches. We not only design new algorithms but also provide numerous insights on why and how "compression" is so important in understanding deep neural networks and improving generalizability

    Chinese Fine-Grained Financial Sentiment Analysis with Large Language Models

    Full text link
    Entity-level fine-grained sentiment analysis in the financial domain is a crucial subtask of sentiment analysis and currently faces numerous challenges. The primary challenge stems from the lack of high-quality and large-scale annotated corpora specifically designed for financial text sentiment analysis, which in turn limits the availability of data necessary for developing effective text processing techniques. Recent advancements in large language models (LLMs) have yielded remarkable performance in natural language processing tasks, primarily centered around language pattern matching. In this paper, we propose a novel and extensive Chinese fine-grained financial sentiment analysis dataset, FinChina SA, for enterprise early warning. We thoroughly evaluate and experiment with well-known existing open-source LLMs using our dataset. We firmly believe that our dataset will serve as a valuable resource to advance the exploration of real-world financial sentiment analysis tasks, which should be the focus of future research. The FinChina SA dataset is publicly available at https://github.com/YerayL/FinChina-SAComment: FinLLM Symposium at IJCAI 202

    An exploration of the attitudes and beliefs of teacher trainers and teacher trainees concerning the use of the L1 in the EFL classroom

    Get PDF
    One important conflict within English language teaching methodology is concerning the use or exclusion of learners’ first languages (L1) when learning English. Perspectives on the topic range from those in favour of complete avoidance of the L1 in the EFL classroom, constantly striving for an exclusively L2 classroom to those who believe in the value and learning benefit of allowing and, to some extent, encouraging the use of all manner of languages available to the learner. This thesis conducted interviews and surveys in order to provide an in-depth exploration of the attitudes and beliefs of teacher trainers and teacher trainees in North Rhine Westphalia concerning the use of the L1, as well as other potential languages, in the English language classroom. Although the two groups of participants held many similar attitudes and beliefs concerning L1 use, some significant and interesting differences were found. Teacher trainees showed themselves to be more open concerning the use of the L1 than their more experienced counterparts. It remains, however, unclear what exactly the reason for these differences is. A further aspect which became apparent is how the pressures of language choice and of exclusive L2 instruction in the EFL classroom during observed and examination lessons is felt by teacher trainees. This is potentially adding to the overall burden of the teacher training period in NRW. The thesis concludes that an increase in evidence-based teacher education, concerning not only the aspect of L1 use in the EFL classroom but also many other aspects of language teaching could be prudent in the continued development of well-informed best-practice approaches. This thesis holds the standpoint that complete eradication of the L1 in the EFL classroom is counterproductive to successful language learning. Judicious use of the L1 and the development of a more plurilingusitic attitude to language learning, enabling learners to make use of any available linguistic resources, can offer both learners and teachers helpful scaffolding which can facilitate the successful learning of further languages

    Bias and Fairness in Large Language Models: A Survey

    Full text link
    Rapid advancements of large language models (LLMs) have enabled the processing, understanding, and generation of human-like text, with increasing integration into systems that touch our social sphere. Despite this success, these models can learn, perpetuate, and amplify harmful social biases. In this paper, we present a comprehensive survey of bias evaluation and mitigation techniques for LLMs. We first consolidate, formalize, and expand notions of social bias and fairness in natural language processing, defining distinct facets of harm and introducing several desiderata to operationalize fairness for LLMs. We then unify the literature by proposing three intuitive taxonomies, two for bias evaluation, namely metrics and datasets, and one for mitigation. Our first taxonomy of metrics for bias evaluation disambiguates the relationship between metrics and evaluation datasets, and organizes metrics by the different levels at which they operate in a model: embeddings, probabilities, and generated text. Our second taxonomy of datasets for bias evaluation categorizes datasets by their structure as counterfactual inputs or prompts, and identifies the targeted harms and social groups; we also release a consolidation of publicly-available datasets for improved access. Our third taxonomy of techniques for bias mitigation classifies methods by their intervention during pre-processing, in-training, intra-processing, and post-processing, with granular subcategories that elucidate research trends. Finally, we identify open problems and challenges for future work. Synthesizing a wide range of recent research, we aim to provide a clear guide of the existing literature that empowers researchers and practitioners to better understand and prevent the propagation of bias in LLMs

    Current Challenges in the Application of Algorithms in Multi-institutional Clinical Settings

    Get PDF
    The Coronavirus disease pandemic has highlighted the importance of artificial intelligence in multi-institutional clinical settings. Particularly in situations where the healthcare system is overloaded, and a lot of data is generated, artificial intelligence has great potential to provide automated solutions and to unlock the untapped potential of acquired data. This includes the areas of care, logistics, and diagnosis. For example, automated decision support applications could tremendously help physicians in their daily clinical routine. Especially in radiology and oncology, the exponential growth of imaging data, triggered by a rising number of patients, leads to a permanent overload of the healthcare system, making the use of artificial intelligence inevitable. However, the efficient and advantageous application of artificial intelligence in multi-institutional clinical settings faces several challenges, such as accountability and regulation hurdles, implementation challenges, and fairness considerations. This work focuses on the implementation challenges, which include the following questions: How to ensure well-curated and standardized data, how do algorithms from other domains perform on multi-institutional medical datasets, and how to train more robust and generalizable models? Also, questions of how to interpret results and whether there exist correlations between the performance of the models and the characteristics of the underlying data are part of the work. Therefore, besides presenting a technical solution for manual data annotation and tagging for medical images, a real-world federated learning implementation for image segmentation is introduced. Experiments on a multi-institutional prostate magnetic resonance imaging dataset showcase that models trained by federated learning can achieve similar performance to training on pooled data. Furthermore, Natural Language Processing algorithms with the tasks of semantic textual similarity, text classification, and text summarization are applied to multi-institutional, structured and free-text, oncology reports. The results show that performance gains are achieved by customizing state-of-the-art algorithms to the peculiarities of the medical datasets, such as the occurrence of medications, numbers, or dates. In addition, performance influences are observed depending on the characteristics of the data, such as lexical complexity. The generated results, human baselines, and retrospective human evaluations demonstrate that artificial intelligence algorithms have great potential for use in clinical settings. However, due to the difficulty of processing domain-specific data, there still exists a performance gap between the algorithms and the medical experts. In the future, it is therefore essential to improve the interoperability and standardization of data, as well as to continue working on algorithms to perform well on medical, possibly, domain-shifted data from multiple clinical centers

    FRASIMED: a Clinical French Annotated Resource Produced through Crosslingual BERT-Based Annotation Projection

    Full text link
    Natural language processing (NLP) applications such as named entity recognition (NER) for low-resource corpora do not benefit from recent advances in the development of large language models (LLMs) where there is still a need for larger annotated datasets. This research article introduces a methodology for generating translated versions of annotated datasets through crosslingual annotation projection. Leveraging a language agnostic BERT-based approach, it is an efficient solution to increase low-resource corpora with few human efforts and by only using already available open data resources. Quantitative and qualitative evaluations are often lacking when it comes to evaluating the quality and effectiveness of semi-automatic data generation strategies. The evaluation of our crosslingual annotation projection approach showed both effectiveness and high accuracy in the resulting dataset. As a practical application of this methodology, we present the creation of French Annotated Resource with Semantic Information for Medical Entities Detection (FRASIMED), an annotated corpus comprising 2'051 synthetic clinical cases in French. The corpus is now available for researchers and practitioners to develop and refine French natural language processing (NLP) applications in the clinical field (https://zenodo.org/record/8355629), making it the largest open annotated corpus with linked medical concepts in French

    Novel Methods for Natural Language Modeling and Pretraining

    Get PDF
    This thesis is about modeling language sequences to achieve lower perplexity, better generation, and benefit downstream language tasks; specifically, this thesis addresses the importance of natural language features including the segmentation feature, lexical feature, and alignment feature. We present three new techniques that improve language sequence modeling with different language features. 1. Segment-Aware Language Modeling is a novel model architecture leveraging the text segementation feature for text sequence modeling. It encodes richer positional information for language modeling, by replacing the original token position encoding with a combined position encoding of paragraph, sentence, and token. By applying our approach to Transformer-XL, we train a new language model, Segatron-XL, that achieves a 6.6-7.8% relative reduction in perplexity. Additionally, BERT pretrained with our method -- SegaBERT -- outperforms BERT on general language understanding, sentence representation learning, and machine reading comprehension tasks. Furthermore, our SegaBERT-large model outperforms RoBERTa-large on zero-shot STS tasks. These experimental results demonstrate that our proposed Segatron works on both language models with relative position embeddings and pretrained language models with absolute position embeddings. 2. Hypernym-Instructed Language Modeling is a novel training method leveraging the lexical feature for rare word modeling. It maps words that have a common WordNet hypernym to the same class and trains large neural LMs by gradually annealing from predicting the class to token prediction during training. Class-based prediction leads to an implicit context aggregation for similar words and thus can improve generalization for rare words. Empirically, this curriculum learning strategy consistently reduces perplexity over various large, highly-performant state-of-the-art Transformer-based models on two datasets, WikiText-103 and ArXiv. Our analysis shows that the performance improvement is achieved without sacrificing performance on rare words. 3. Alignment-Aware Acoustic and Text Modeling is a novel pretraining method leveraging both the segmentation and alignment features for text-speech sequence modeling. It reconstructs masked acoustic signals with text input and acoustic-text alignment during training. In this way, the pretrained model can generate high quality of reconstructed spectrogram, which can be applied to the speech editing and new speaker TTS directly. Experiments show A3T outperforms SOTA models on speech editing and improves multi-speaker speech synthesis without the external speaker verification model

    NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages

    Full text link
    Democratizing access to natural language processing (NLP) technology is crucial, especially for underrepresented and extremely low-resource languages. Previous research has focused on developing labeled and unlabeled corpora for these languages through online scraping and document translation. While these methods have proven effective and cost-efficient, we have identified limitations in the resulting corpora, including a lack of lexical diversity and cultural relevance to local communities. To address this gap, we conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content. In addition, we present the \datasetname{} benchmark, encompassing 12 underrepresented and extremely low-resource languages spoken by millions of individuals in Indonesia. Our empirical experiment results using existing multilingual large language models conclude the need to extend these models to more underrepresented languages. We release the NusaWrites dataset at https://github.com/IndoNLP/nusa-writes

    Semantic similarity prediction is better than other semantic similarity measures

    Full text link
    Semantic similarity between natural language texts is typically measured either by looking at the overlap between subsequences (e.g., BLEU) or by using embeddings (e.g., BERTScore, S-BERT). Within this paper, we argue that when we are only interested in measuring the semantic similarity, it is better to directly predict the similarity using a fine-tuned model for such a task. Using a fine-tuned model for the STS-B from the GLUE benchmark, we define the STSScore approach and show that the resulting similarity is better aligned with our expectations on a robust semantic similarity measure than other approaches.Comment: Under revie
    • …
    corecore