24 research outputs found

    Named Entity Recognition for Bacterial Type IV Secretion Systems

    Get PDF
    Research on specialized biological systems is often hampered by a lack of consistent terminology, especially across species. In bacterial Type IV secretion systems genes within one set of orthologs may have over a dozen different names. Classifying research publications based on biological processes, cellular components, molecular functions, and microorganism species should improve the precision and recall of literature searches allowing researchers to keep up with the exponentially growing literature, through resources such as the Pathosystems Resource Integration Center (PATRIC, patricbrc.org). We developed named entity recognition (NER) tools for four entities related to Type IV secretion systems: 1) bacteria names, 2) biological processes, 3) molecular functions, and 4) cellular components. These four entities are important to pathogenesis and virulence research but have received less attention than other entities, e.g., genes and proteins. Based on an annotated corpus, large domain terminological resources, and machine learning techniques, we developed recognizers for these entities. High accuracy rates (>80%) are achieved for bacteria, biological processes, and molecular function. Contrastive experiments highlighted the effectiveness of alternate recognition strategies; results of term extraction on contrasting document sets demonstrated the utility of these classes for identifying T4SS-related documents

    Text and data mining in higher education and public research

    Get PDF
    This study uses case studies from researchers in the UK and France to assess the value of a copyright exception for text and data mining, and identify the steps needed to realise its potential. It was commissioned by the ADBU, the French association of directors and senior staff in university and research libraries, and delivered by Research Consulting, a UK consultancy specialising in the management and dissemination of research

    OSSMETER: Automated measurement and analysis of open source software

    Get PDF
    International audienceDeciding whether an open source software (OSS) meets the requiredstandards for adoption in terms of quality, maturity, activity of development anduser support is not a straightforward process. It involves analysing various sourcesof information, including the project’s source code repositories, communicationchannels, and bug tracking systems. OSSMETER extends state-of-the-art techniquesin the field of automated analysis and measurement of open-source software(OSS), and develops a platform that supports decision makers in the processof discovering, comparing, assessing and monitoring the health, quality, impactand activity of opensource software. To achieve this, OSSMETER computestrustworthy quality indicators by performing advanced analysis and integrationof information from diverse sources including the project metadata, source coderepositories, communication channels and bug tracking systems of OSS projects

    Supporting Custom Quality Models to Analyse and Compare Open-Source Software

    Get PDF
    textabstractThe analysis and comparison of open source software can be improved by means of quality models supporting the evaluation of the software systems being compared and the final decision about which of them has to be adopted. Since software quality can mean different things in different scenarios, quality models should be flexible in order to accommodate the needs of different users. Over the years several quality models have been proposed. Even though some of them are tool supported, they are not designed to be extended or customized to better accommodate the requirements of specific business contexts. In this paper, instead of having a fixed model, we propose a workflow and a tool chain to support the specification of custom quality models, which can guide the automated analysis of open source software

    Disentangled Variational Autoencoder for Emotion Recognition in Conversations

    Full text link
    In Emotion Recognition in Conversations (ERC), the emotions of target utterances are closely dependent on their context. Therefore, existing works train the model to generate the response of the target utterance, which aims to recognise emotions leveraging contextual information. However, adjacent response generation ignores long-range dependencies and provides limited affective information in many cases. In addition, most ERC models learn a unified distributed representation for each utterance, which lacks interpretability and robustness. To address these issues, we propose a VAD-disentangled Variational AutoEncoder (VAD-VAE), which first introduces a target utterance reconstruction task based on Variational Autoencoder, then disentangles three affect representations Valence-Arousal-Dominance (VAD) from the latent space. We also enhance the disentangled representations by introducing VAD supervision signals from a sentiment lexicon and minimising the mutual information between VAD distributions. Experiments show that VAD-VAE outperforms the state-of-the-art model on two datasets. Further analysis proves the effectiveness of each proposed module and the quality of disentangled VAD representations. The code is available at https://github.com/SteveKGYang/VAD-VAE.Comment: Accepted by IEEE Transactions on Affective Computin

    Sustainability strategy and plans beyond the end of the project

    Get PDF
    The central objective of the Metanet4u project is to contribute to the establishment of a pan-European digital platform that makes available language resources and services, encompassing both datasets and software tools, for speech and language processing, and supports a new generation of exchange facilities for them.Preprin

    Trans-disciplinarity and digital humanity: lessons learned from developing text mining tools for textual analysis

    Get PDF
    This peer-reviewed chapter advances social science research on text mining and data mining, which are key artificial intelligence technologies applied in the digital humanities. The chapter provides a detailed documentation of an interdisciplinary project conducted by a team consisting of social scientists, linguists and software engineers to develop a set of bespoke text-mining tools for researchers in the humanities. Through looking at the user-participatory development processes of the text-mining tools, this chapter aims to improve our understandings of digital humanities in the context of scholarly research and, from a pragmatist perspective, to highlight its trans-disciplinary potential. The paper both analyses and produces an empirical account of interdisciplinary research practices across the social sciences and humanities. It concludes with a discussion of some methodological and socio-technical challenges of the 'digital humanity' emerging in this shift towards trans-disciplinarity, particularly focusing on the topic of 'interpretative flexibility'. The edited collection, which is interdisciplinary in nature, develops knowledge of how the application of new computational techniques and visualisation technologies in the arts and humanities is resulting in fresh approaches and methodologies for the study of new and traditional corpora. It includes articles from internationally significant scholars such as N. Katherine Hayles and Lev Manovich. The realisation of this piece has benefited from discussion at the 2009 Media, Communication and Cultural Studies Association (MeCCSA) conference at Bradford, 14–16 January 2009, and the Computational Turn Workshop at Swansea on 9 March 2010, where an earlier version of this paper was presented

    COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature

    Get PDF
    Background Species occurrence records are very important in the biodiversity domain. While several available corpora contain only annotations of species names or habitats and geographical locations, there is no consolidated corpus that covers all types of entities necessary for extracting species occurrence from biodiversity literature. In order to alleviate this issue, we have constructed the COPIOUS corpus—a gold standard corpus that covers a wide range of biodiversity entities. Results Two annotators manually annotated the corpus with five categories of entities, i.e. taxon names, geographical locations, habitats, temporal expressions and person names. The overall inter-annotator agreement on 200 doubly-annotated documents is approximately 81.86% F-score. Amongst the five categories, the agreement on habitat entities was the lowest, indicating that this type of entity is complex. The COPIOUS corpus consists of 668 documents downloaded from the Biodiversity Heritage Library with over 26K sentences and more than 28K entities. Named entity recognisers trained on the corpus could achieve an F-score of 74.58%. Moreover, in recognising taxon names, our model performed better than two available tools in the biodiversity domain, namely the SPECIES tagger and the Global Name Recognition and Discovery. More than 1,600 binary relations of Taxon-Habitat, Taxon-Person, Taxon-Geographical locations and Taxon-Temporal expressions were identified by applying a pattern-based relation extraction system to the gold standard. Based on the extracted relations, we can produce a knowledge repository of species occurrences. Conclusion The paper describes in detail the construction of a gold standard named entity corpus for the biodiversity domain. An investigation of the performance of named entity recognition (NER) tools trained on the gold standard revealed that the corpus is sufficiently reliable and sizeable for both training and evaluation purposes. The corpus can be further used for relation extraction to locate species occurrences in literature—a useful task for monitoring species distribution and preserving the biodiversity

    Text Mining the History of Medicine

    Get PDF
    Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, due to large-scale digitisation efforts. However, it can be difficult for researchers to explore and search such large volumes of data in an efficient manner. Text mining (TM) methods can help, through their ability to recognise various types of semantic information automatically, e.g., instances of concepts (places, medical conditions, drugs, etc.), synonyms/variant forms of concepts, and relationships holding between concepts (which drugs are used to treat which medical conditions, etc.). TM analysis allows search systems to incorporate functionality such as automatic suggestions of synonyms of user-entered query terms, exploration of different concepts mentioned within search results or isolation of documents in which concepts are related in specific ways. However, applying TM methods to historical text can be challenging, according to differences and evolutions in vocabulary, terminology, language structure and style, compared to more modern text. In this article, we present our efforts to overcome the various challenges faced in the semantic analysis of published historical medical text dating back to the mid 19th century. Firstly, we used evidence from diverse historical medical documents from different periods to develop new resources that provide accounts of the multiple, evolving ways in which concepts, their variants and relationships amongst them may be expressed. These resources were employed to support the development of a modular processing pipeline of TM tools for the robust detection of semantic information in historical medical documents with varying characteristics. We applied the pipeline to two large-scale medical document archives covering wide temporal ranges as the basis for the development of a publicly accessible semantically-oriented search system. The novel resources are available for research purposes, while the processing pipeline and its modules may be used and configured within the Argo TM platform
    corecore