91 research outputs found

    ABDN at SemEval-2018 Task 10 : recognising discriminative attributes using context embeddings and WordNet

    Get PDF
    This paper describes the system that we submitted for SemEval-2018 task 10: capturing discriminative attributes. Our system is built upon a simple idea of measuring the attribute word’s similarity with each of the two semantically similar words, based on an extended word embedding method and WordNet. Instead of computing the similarities between the attribute and semantically similar words by using standard word embeddings, we propose a novel method that combines word and context embeddings which can better measure similarities. Our model is simple and effective, which achieves an average F1 score of 0.62 on the test set

    How we do things with words: Analyzing text as social and cultural data

    Get PDF
    In this article we describe our experiences with computational text analysis. We hope to achieve three primary goals. First, we aim to shed light on thorny issues not always at the forefront of discussions about computational text analysis methods. Second, we hope to provide a set of best practices for working with thick social and cultural concepts. Our guidance is based on our own experiences and is therefore inherently imperfect. Still, given our diversity of disciplinary backgrounds and research practices, we hope to capture a range of ideas and identify commonalities that will resonate for many. And this leads to our final goal: to help promote interdisciplinary collaborations. Interdisciplinary insights and partnerships are essential for realizing the full potential of any computational text analysis that involves social and cultural concepts, and the more we are able to bridge these divides, the more fruitful we believe our work will be

    Automatic Extraction of Adverse Drug Reactions from Summary of Product Characteristics

    No full text
    The summary of product characteristics from the European Medicines Agency is a reference document on medicines in the EU. It contains textual information for clinical experts on how to safely use medicines, including adverse drug reactions. Using natural language processing (NLP) techniques to automatically extract adverse drug reactions from such unstructured textual information helps clinical experts to effectively and efficiently use them in daily practices. Such techniques have been developed for Structured Product Labels from the Food and Drug Administration (FDA), but there is no research focusing on extracting from the Summary of Product Characteristics. In this work, we built a natural language processing pipeline that automatically scrapes the summary of product characteristics online and then extracts adverse drug reactions from them. Besides, we have made the method and its output publicly available so that it can be reused and further evaluated in clinical practices. In total, we extracted 32,797 common adverse drug reactions for 647 common medicines scraped from the Electronic Medicines Compendium. A manual review of 37 commonly used medicines has indicated a good performance, with a recall and precision of 0.99 and 0.934, respectively

    Text Mining Business Policy Documents: Applied Data Science in Finance

    No full text
    In a time when the employment of natural language processing techniques in domains such as biomedicine, national security, finance, and law is flourishing, this study takes a deep look at its application in policy documents. Besides providing an overview of the current state of the literature that treats these concepts, the authors implement a set of natural language processing techniques on internal bank policies. The implementation of these techniques, together with the results that derive from the experiments and expert evaluation, introduce a meta-algorithmic modelling framework for processing internal business policies. This framework relies on three natural language processing techniques, namely information extraction, automatic summarization, and automatic keyword extraction. For the reference extraction and keyword extraction tasks, the authors calculated precision, recall, and F-scores. For the former, the researchers obtained 0.99, 0.84, and 0.89; for the latter, this research obtained 0.79, 0.87, and 0.83, respectively. Finally, the summary extraction approach was positively evaluated using a qualitative assessment

    Cybersecurity Standardization for SMEs: Stakeholders’ Perspectives and a Research Agenda

    No full text
    There are various challenges regarding the development and use of cybersecurity standards for SMEs. In particular, SMEs need guidance in interpreting and implementing cybersecurity practices and adopting the standards to their specific needs. As an empirical study, the workshop Cybersecurity Standards: What Impacts and Gaps for SMEs was co-organized by the StandICT.eu and SMESEC Horizon 2020 projects with the aim of identifying cybersecurity standardisation needs and gaps for SMEs. The workshop participants were from key stakeholder groups that include policymakers, standards developing organisations, SME alliances, and cybersecurity organisations. This paper highlights the key discussions and outcomes of the workshop and presents the themes, current initiatives, and plans towards cybersecurity standardisation for SMEs. The findings from the workshop and multivocal literature searches were used to formulate an agenda for future research

    Aiming beyond the Obvious:: Identifying Non-Obvious Cases in Semantic Similarity Datasets

    No full text
    Existing datasets for scoring text pairs in terms of semantic similarity contain instances whose resolution differs according to the degree of difficulty. This paper proposes to distinguish obvious from non-obvious text pairs based on superficial lexical overlap and ground-truth labels. We characterise existing datasets in terms of containing difficult cases and find that recently proposed models struggle to capture the non-obvious cases of semantic similarity. We describe metrics that emphasise cases of similarity which require more complex inference and propose that these are used for evaluating systems for semantic similarity

    Self-Service Data Science in Healthcare with Automated Machine Learning

    No full text
    (1) Background: This work investigates whether and how researcher-physicians can be supported in their knowledge discovery process by employing Automated Machine Learning (AutoML). (2) Methods: We take a design science research approach and select the Tree-based Pipeline Optimization Tool (TPOT) as the AutoML method based on a benchmark test and requirements from researcher-physicians. We then integrate TPOT into two artefacts: a web application and a notebook. We evaluate these artefacts with researcher-physicians to examine which approach suits researcher-physicians best. Both artefacts have a similar workflow, but different user interfaces because of a conflict in requirements. (3) Results: Artefact A, a web application, was perceived as better for uploading a dataset and comparing results. Artefact B, a Jupyter notebook, was perceived as better regarding the workflow and being in control of model construction. (4) Conclusions: Thus, a hybrid artefact would be best for researcher-physicians. However, both artefacts missed model explainability and an explanation of variable importance for their created models. Hence, deployment of AutoML technologies in healthcare remains currently limited to the exploratory data analysis phas

    Dialect Variation on Social Media

    No full text

    Evaluation of classification models for retrieving experimental sections from full-text publications

    No full text
    In recent years, reporting scientific experiments became a challenge for scientists working data-intensive research fields. One of these challenges is to accurately report experimental work relying on computational activities. In this report, an exploratory computational experiment is conducted. We evaluate the performance of a set of classification models to extract experimental paragraphs from full-text scientific publications in an unsupervised fashion. The results show that the best performing classification model (Multinomial Naive Bayes) trained on 30 publications in the Proteomics domain achieves a Recall of 87.12% and an Accuracy of 80.63%. Successful unsupervised extraction of experimental paragraphs from reports can considerably reduce the noise present in full-text publications. This approach could be beneficial to automatically generate domain specific vocabulary describing experimental designs and experimental processes. As such, this work contributes to the identification of NLP techniques automatizing the extraction of domain-specific paragraphs which relate to experimental work
    corecore