3,114 research outputs found

    Methods for improving entity linking and exploiting social media messages across crises

    Get PDF
    Entity Linking (EL) is the task of automatically identifying entity mentions in texts and resolving them to a corresponding entity in a reference knowledge base (KB). There is a large number of tools available for different types of documents and domains, however the literature in entity linking has shown the quality of a tool varies across different corpus and depends on specific characteristics of the corpus it is applied to. Moreover the lack of precision on particularly ambiguous mentions often spoils the usefulness of automated disambiguation results in real world applications. In the first part of this thesis I explore an approximation of the difficulty to link entity mentions and frame it as a supervised classification task. Classifying difficult to disambiguate entity mentions can facilitate identifying critical cases as part of a semi-automated system, while detecting latent corpus characteristics that affect the entity linking performance. Moreover, despiteless the large number of entity linking tools that have been proposed throughout the past years, some tools work better on short mentions while others perform better when there is more contextual information. To this end, I proposed a solution by exploiting results from distinct entity linking tools on the same corpus by leveraging their individual strengths on a per-mention basis. The proposed solution demonstrated to be effective and outperformed the individual entity systems employed in a series of experiments. An important component in the majority of the entity linking tools is the probability that a mentions links to one entity in a reference knowledge base, and the computation of this probability is usually done over a static snapshot of a reference KB. However, an entity’s popularity is temporally sensitive and may change due to short term events. Moreover, these changes might be then reflected in a KB and EL tools can produce different results for a given mention at different times. I investigated the prior probability change over time and the overall disambiguation performance using different KB from different time periods. The second part of this thesis is mainly concerned with short texts. Social media has become an integral part of the modern society. Twitter, for instance, is one of the most popular social media platforms around the world that enables people to share their opinions and post short messages about any subject on a daily basis. At first I presented one approach to identifying informative messages during catastrophic events using deep learning techniques. By automatically detecting informative messages posted by users during major events, it can enable professionals involved in crisis management to better estimate damages with only relevant information posted on social media channels, as well as to act immediately. Moreover I have also performed an analysis study on Twitter messages posted during the Covid-19 pandemic. Initially I collected 4 million tweets posted in Portuguese since the begining of the pandemic and provided an analysis of the debate aroud the pandemic. I used topic modeling, sentiment analysis and hashtags recomendation techniques to provide isights around the online discussion of the Covid-19 pandemic

    A methodology for the resolution of cashtag collisions on Twitter – A natural language processing & data fusion approach

    Get PDF
    Investors utilise social media such as Twitter as a means of sharing news surrounding financials stocks listed on international stock exchanges. Company ticker symbols are used to uniquely identify companies listed on stock exchanges and can be embedded within tweets to create clickable hyperlinks referred to as cashtags, allowing investors to associate their tweets with specific companies. The main limitation is that identical ticker symbols are present on exchanges all over the world, and when searching for such cashtags on Twitter, a stream of tweets is returned which match any company in which the cashtag refers to - we refer to this as a cashtag collision. The presence of colliding cashtags could sow confusion for investors seeking news regarding a specific company. A resolution to this issue would benefit investors who rely on the speediness of tweets for financial information, saving them precious time. We propose a methodology to resolve this problem which combines Natural Language Processing and Data Fusion to construct company-specific corpora to aid in the detection and resolution of colliding cashtags, so that tweets can be classified as being related to a specific stock exchange or not. Supervised machine learning classifiers are trained twice on each tweet – once on a count vectorisation of the tweet text, and again with the assistance of features contained in the company-specific corpora. We validate the cashtag collision methodology by carrying out an experiment involving companies listed on the London Stock Exchange. Results show that several machine learning classifiers benefit from the use of the custom corpora, yielding higher classification accuracy in the prediction and resolution of colliding cashtags

    The Pisa Audio-visual Corpus Project: A multimodal approach to ESP research and teaching

    Get PDF
    This paper presents an ongoing project sponsored by the University of Pisa Language Centre to compile an audiovisual corpus of specialized types of discourse of particular relevance to ESP learners in higher education. The first phase of the project focuses on collecting digitally available video clips that encode specialized language in a range of genres along an ‘authentic’ to ‘fictional’ continuum. The video clips will be analyzed from a multimodal perspective to determine how various semiotic resources work together to construct meaning. They will then be utilized in the ESP classroom to increase learners’ awareness of the key contribution of different modes in specialized communication. We present some exploratory multimodal analyses performed on video clips that encode instances of political discourse across two different genres on the extreme poles of the continuum: a fictional political drama film and an authentic political science lecture

    Causality Management and Analysis in Requirement Manuscript for Software Designs

    Get PDF
    For software design tasks involving natural language, the results of a causal investigation provide valuable and robust semantic information, especially for identifying key variables during product (software) design and product optimization. As the interest in analytical data science shifts from correlations to a better understanding of causality, there is an equal task focused on the accuracy of extracting causality from textual artifacts to aid requirement engineering (RE) based decisions. This thesis focuses on identifying, extracting, and classifying causal phrases using word and sentence labeling based on the Bi-directional Encoder Representations from Transformers (BERT) deep learning language model and five machine learning models. The aim is to understand the form and degree of causality based on their impact and prevalence in RE practice. Methodologically, our analysis is centered around RE practice, and we considered 12,438 sentences extracted from 50 requirement engineering manuscripts (REM) for training our machine models. Our research reports that causal expressions constitute about 32% of sentences from REM. We applied four evaluation metrics, namely recall, accuracy, precision, and F1, to assess our machine models’ performance and accuracy to ensure the results’ conformity with our study goal. Further, we computed the highest model accuracy to be 85%, attributed to Naive Bayes. Finally, we noted that the applicability and relevance of our causal analytic framework is relevant to practitioners for different functionalities, such as generating test cases for requirement engineers and software developers and product performance auditing for management stakeholders

    Large Language Models to Identify Social Determinants of Health in Electronic Health Records

    Full text link
    Social determinants of health (SDoH) have an important impact on patient outcomes but are incompletely collected from the electronic health records (EHR). This study researched the ability of large language models to extract SDoH from free text in EHRs, where they are most commonly documented, and explored the role of synthetic clinical text for improving the extraction of these scarcely documented, yet extremely valuable, clinical data. 800 patient notes were annotated for SDoH categories, and several transformer-based models were evaluated. The study also experimented with synthetic data generation and assessed for algorithmic bias. Our best-performing models were fine-tuned Flan-T5 XL (macro-F1 0.71) for any SDoH, and Flan-T5 XXL (macro-F1 0.70). The benefit of augmenting fine-tuning with synthetic data varied across model architecture and size, with smaller Flan-T5 models (base and large) showing the greatest improvements in performance (delta F1 +0.12 to +0.23). Model performance was similar on the in-hospital system dataset but worse on the MIMIC-III dataset. Our best-performing fine-tuned models outperformed zero- and few-shot performance of ChatGPT-family models for both tasks. These fine-tuned models were less likely than ChatGPT to change their prediction when race/ethnicity and gender descriptors were added to the text, suggesting less algorithmic bias (p<0.05). At the patient-level, our models identified 93.8% of patients with adverse SDoH, while ICD-10 codes captured 2.0%. Our method can effectively extracted SDoH information from clinic notes, performing better compare to GPT zero- and few-shot settings. These models could enhance real-world evidence on SDoH and aid in identifying patients needing social support.Comment: 38 pages, 5 figures, 5 tables in main, submitted for revie
    corecore