24 research outputs found
Named Entity Recognition for Bacterial Type IV Secretion Systems
Research on specialized biological systems is often hampered by a lack of consistent terminology, especially across species. In bacterial Type IV secretion systems genes within one set of orthologs may have over a dozen different names. Classifying research publications based on biological processes, cellular components, molecular functions, and microorganism species should improve the precision and recall of literature searches allowing researchers to keep up with the exponentially growing literature, through resources such as the Pathosystems Resource Integration Center (PATRIC, patricbrc.org). We developed named entity recognition (NER) tools for four entities related to Type IV secretion systems: 1) bacteria names, 2) biological processes, 3) molecular functions, and 4) cellular components. These four entities are important to pathogenesis and virulence research but have received less attention than other entities, e.g., genes and proteins. Based on an annotated corpus, large domain terminological resources, and machine learning techniques, we developed recognizers for these entities. High accuracy rates (>80%) are achieved for bacteria, biological processes, and molecular function. Contrastive experiments highlighted the effectiveness of alternate recognition strategies; results of term extraction on contrasting document sets demonstrated the utility of these classes for identifying T4SS-related documents
Text and data mining in higher education and public research
This study uses case studies from researchers in the UK and France to assess the value of a copyright exception for text and data mining, and identify the steps needed to realise its potential. It was commissioned by the ADBU, the French association of directors and senior staff in university and research libraries, and delivered by Research Consulting, a UK consultancy specialising in the management and dissemination of research
OSSMETER: Automated measurement and analysis of open source software
International audienceDeciding whether an open source software (OSS) meets the requiredstandards for adoption in terms of quality, maturity, activity of development anduser support is not a straightforward process. It involves analysing various sourcesof information, including the project’s source code repositories, communicationchannels, and bug tracking systems. OSSMETER extends state-of-the-art techniquesin the field of automated analysis and measurement of open-source software(OSS), and develops a platform that supports decision makers in the processof discovering, comparing, assessing and monitoring the health, quality, impactand activity of opensource software. To achieve this, OSSMETER computestrustworthy quality indicators by performing advanced analysis and integrationof information from diverse sources including the project metadata, source coderepositories, communication channels and bug tracking systems of OSS projects
Supporting Custom Quality Models to Analyse and Compare Open-Source Software
textabstractThe analysis and comparison of open source software can be improved by means of quality models supporting the evaluation of the software systems being compared and the final decision about which of them has to be adopted. Since software quality can mean different things in different scenarios, quality models should be flexible in order to accommodate the needs of different users. Over the years several quality models have been proposed. Even though some of them are tool supported, they are not designed to be extended or customized to better accommodate the requirements of specific business contexts. In this paper, instead of having a fixed model, we propose a workflow and a tool chain to support the specification of custom quality models, which can guide the automated analysis of open source software
Disentangled Variational Autoencoder for Emotion Recognition in Conversations
In Emotion Recognition in Conversations (ERC), the emotions of target
utterances are closely dependent on their context. Therefore, existing works
train the model to generate the response of the target utterance, which aims to
recognise emotions leveraging contextual information. However, adjacent
response generation ignores long-range dependencies and provides limited
affective information in many cases. In addition, most ERC models learn a
unified distributed representation for each utterance, which lacks
interpretability and robustness. To address these issues, we propose a
VAD-disentangled Variational AutoEncoder (VAD-VAE), which first introduces a
target utterance reconstruction task based on Variational Autoencoder, then
disentangles three affect representations Valence-Arousal-Dominance (VAD) from
the latent space. We also enhance the disentangled representations by
introducing VAD supervision signals from a sentiment lexicon and minimising the
mutual information between VAD distributions. Experiments show that VAD-VAE
outperforms the state-of-the-art model on two datasets. Further analysis proves
the effectiveness of each proposed module and the quality of disentangled VAD
representations. The code is available at
https://github.com/SteveKGYang/VAD-VAE.Comment: Accepted by IEEE Transactions on Affective Computin
Sustainability strategy and plans beyond the end of the project
The central objective of the Metanet4u project is to contribute to the establishment of a pan-European digital platform that makes available language resources and services, encompassing both datasets and software tools, for speech and language processing, and supports a new generation of exchange facilities for them.Preprin
Trans-disciplinarity and digital humanity: lessons learned from developing text mining tools for textual analysis
This peer-reviewed chapter advances social science research on text mining and data mining, which are key artificial intelligence technologies applied in the digital humanities. The chapter provides a detailed documentation of an interdisciplinary project conducted by a team consisting of social scientists, linguists and software engineers to develop a set of bespoke text-mining tools for researchers in the humanities. Through looking at the user-participatory development processes of the text-mining tools, this chapter aims to improve our understandings of digital humanities in the context of scholarly research and, from a pragmatist perspective, to highlight its trans-disciplinary potential. The paper both analyses and produces an empirical account of interdisciplinary research practices across the social sciences and humanities. It concludes with a discussion of some methodological and socio-technical challenges of the 'digital humanity' emerging in this shift towards trans-disciplinarity, particularly focusing on the topic of 'interpretative flexibility'.
The edited collection, which is interdisciplinary in nature, develops knowledge of how the application of new computational techniques and visualisation technologies in the arts and humanities is resulting in fresh approaches and methodologies for the study of new and traditional corpora. It includes articles from internationally significant scholars such as N. Katherine Hayles and Lev Manovich.
The realisation of this piece has benefited from discussion at the 2009 Media, Communication and Cultural Studies Association (MeCCSA) conference at Bradford, 14–16 January 2009, and the Computational Turn Workshop at Swansea on 9 March 2010, where an earlier version of this paper was presented
COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature
Background
Species occurrence records are very important in the biodiversity domain. While several available corpora contain only annotations of species names or habitats and geographical locations, there is no consolidated corpus that covers all types of entities necessary for extracting species occurrence from biodiversity literature. In order to alleviate this issue, we have constructed the COPIOUS corpus—a gold standard corpus that covers a wide range of biodiversity entities.
Results
Two annotators manually annotated the corpus with five categories of entities, i.e. taxon names, geographical locations, habitats, temporal expressions and person names. The overall inter-annotator agreement on 200 doubly-annotated documents is approximately 81.86% F-score. Amongst the five categories, the agreement on habitat entities was the lowest, indicating that this type of entity is complex. The COPIOUS corpus consists of 668 documents downloaded from the Biodiversity Heritage Library with over 26K sentences and more than 28K entities. Named entity recognisers trained on the corpus could achieve an F-score of 74.58%. Moreover, in recognising taxon names, our model performed better than two available tools in the biodiversity domain, namely the SPECIES tagger and the Global Name Recognition and Discovery. More than 1,600 binary relations of Taxon-Habitat, Taxon-Person, Taxon-Geographical locations and Taxon-Temporal expressions were identified by applying a pattern-based relation extraction system to the gold standard. Based on the extracted relations, we can produce a knowledge repository of species occurrences.
Conclusion
The paper describes in detail the construction of a gold standard named entity corpus for the biodiversity domain. An investigation of the performance of named entity recognition (NER) tools trained on the gold standard revealed that the corpus is sufficiently reliable and sizeable for both training and evaluation purposes. The corpus can be further used for relation extraction to locate species occurrences in literature—a useful task for monitoring species distribution and preserving the biodiversity
Text Mining the History of Medicine
Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, due to large-scale digitisation efforts. However, it can be difficult for researchers to explore and search such large volumes of data in an efficient manner. Text mining (TM) methods can help, through their ability to recognise various types of semantic information automatically, e.g., instances of concepts (places, medical conditions, drugs, etc.), synonyms/variant forms of concepts, and relationships holding between concepts (which drugs are used to treat which medical conditions, etc.). TM analysis allows search systems to incorporate functionality such as automatic suggestions of synonyms of user-entered query terms, exploration of different concepts mentioned within search results or isolation of documents in which concepts are related in specific ways. However, applying TM methods to historical text can be challenging, according to differences and evolutions in vocabulary, terminology, language structure and style, compared to more modern text. In this article, we present our efforts to overcome the various challenges faced in the semantic analysis of published historical medical text dating back to the mid 19th century. Firstly, we used evidence from diverse historical medical documents from different periods to develop new resources that provide accounts of the multiple, evolving ways in which concepts, their variants and relationships amongst them may be expressed. These resources were employed to support the development of a modular processing pipeline of TM tools for the robust detection of semantic information in historical medical documents with varying characteristics. We applied the pipeline to two large-scale medical document archives covering wide temporal ranges as the basis for the development of a publicly accessible semantically-oriented search system. The novel resources are available for research purposes, while the processing pipeline and its modules may be used and configured within the Argo TM platform