12 research outputs found

    Identification of research hypotheses and new knowledge from scientific literature

    Get PDF
    Background: Text mining (TM) methods have been used extensively to extract relations and events from the literature. In addition, TM techniques have been used to extract various types or dimensions of interpretative information, known as Meta-Knowledge (MK), from the context of relations and events , e.g. negation, speculation, certainty and knowledge type. However, most existing methods have focussed on the extraction of individual dimensions of MK, without investigating how they can be combined to obtain even richer contextual information. In this paper, we describe a novel, supervised method to extract new MK dimensions that encode Research Hypotheses (an author’s intended knowledge gain) and New Knowledge (an author’s findings). The method incorporates various features, including a combination of simple MK dimensions. Methods: We identify previously explored dimensions and then use a random forest to combine these with linguistic features into a classification model. To facilitate evaluation of the model, we have enriched two existing corpora annotated with relations and events, i.e., a subset of the GENIA-MK corpus and the EU-ADR corpus, by adding attributes to encode whether each relation or event corresponds to Research Hypothesis or New Knowledge. In the GENIA-MK corpus, these new attributes complement simpler MK dimensions that had previously been annotated. Results: We show that our approach is able to assign different types of MK dimensions to relations and events with a high degree of accuracy. Firstly, our method is able to improve upon the previously reported state of the art performance for an existing dimension, i.e., Knowledge Type. Secondly, we also demonstrate high F1-score in predicting the new dimensions of Research Hypothesis (GENIA: 0.914, EU-ADR 0.802) and New Knowledge (GENIA: 0.829, EU-ADR 0.836). Conclusion: We have presented a novel approach for predicting New Knowledge and Research Hypothesis, which combines simple MK dimensions to achieve high F1-scores. The extraction of such information is valuable for a number of practical TM applications

    Supporting systematic reviews using LDA-based document representations

    No full text
    Abstract Background Identifying relevant studies for inclusion in a systematic review (i.e. screening) is a complex, laborious and expensive task. Recently, a number of studies has shown that the use of machine learning and text mining methods to automatically identify relevant studies has the potential to drastically decrease the workload involved in the screening phase. The vast majority of these machine learning methods exploit the same underlying principle, i.e. a study is modelled as a bag-of-words (BOW). Methods We explore the use of topic modelling methods to derive a more informative representation of studies. We apply Latent Dirichlet allocation (LDA), an unsupervised topic modelling approach, to automatically identify topics in a collection of studies. We then represent each study as a distribution of LDA topics. Additionally, we enrich topics derived using LDA with multi-word terms identified by using an automatic term recognition (ATR) tool. For evaluation purposes, we carry out automatic identification of relevant studies using support vector machine (SVM)-based classifiers that employ both our novel topic-based representation and the BOW representation. Results Our results show that the SVM classifier is able to identify a greater number of relevant studies when using the LDA representation than the BOW representation. These observations hold for two systematic reviews of the clinical domain and three reviews of the social science domain. Conclusions A topic-based feature representation of documents outperforms the BOW representation when applied to the task of automatic citation screening. The proposed term-enriched topics are more informative and less ambiguous to systematic reviewers

    Controllable Readability Corpus

    No full text
    The corpus consists of 28,124 peer-reviewed biomedical research papers along with their technical and PLSs from six PLOS journals that cover a broad range of biomedical research subjects, i.e., PLOS Biology, PLOS Computational Biology, PLOS Genetics, PLOS Medicine, PLOS Neglected Tropical Diseases, and PLOS Pathogens

    HanDeSeT: Hansard Debates with Sentiment Tags

    No full text
    A corpus of Hansard UK Parliament Debates for use in the evaluation of sentiment analysis systems. The corpus consists of 1251 motion-speech units taken from 129 separate debates from the UK House of Commons 1997-2017. Each unit comprises a parliamentary speech of up to five utterances and an associated debate motion. Debates comprise between one and 30 speeches, and speeches range in length from 31 to 1049 words, with a mean of 167.8 words. The debates cover a two decade period from 1997 to 2017 and a wide range of topics from domestic and foreign affairs to procedural matters concerning the running of the House. Each motion has two sentiment polarity labels: 1. A manually applied sentiment polarity label ; and 2. A label derived from the relationship of the MP who proses the motion to the Government. Each speech has two sentiment polarity labels: 1. A speaker-vote label extracted from the division associated with the corresponding debate; and: 2. A manually assigned label. In addition, the following metadata is included with each unit: debate id, speaker party affiliation, motion party affiliation, speaker name, and speaker rebellion rate. Manually applied motion labels are approximately evenly balanced; the other labels are slightly skewed towards the positive class. Hansard transcript data is used under the Open Parliament Licence V3.0. Data regarding speaker rebellion rates is taken from the Public Whip, and used under the Open Data Commons Open Database License (ODbL)

    FN-REQ: Labelled Natural Language Requirements Using FrameNet Semantic Frames

    No full text
    FN-REQ corpus is a dataset of semi-automatic labelled natural language requirements by using FrameNet scheme. Please if use our dataset cite the following: Alhoshan, Waad; Batista-Navarro, Riza; Zhao, Liping (2018), “FN-REQ: Requirements Annotated in FrameNet Semantic Frames”, Mendeley Data, v1 http://dx.doi.org/10.17632/s7gcp54wbv.1 For further details about the annotation practice please refer to the annotation guidelines included in the folder "Annotation Guidelines on Labelling Natural Language Requirements Using FrameNet Semantic Frames (Version 1.2).pdf

    Controllable Readability Corpus

    No full text
    The corpus consists of 28,124 peer-reviewed biomedical research papers along with their technical and PLSs from six PLOS journals that cover a broad range of biomedical research subjects, i.e., PLOS Biology, PLOS Computational Biology, PLOS Genetics, PLOS Medicine, PLOS Neglected Tropical Diseases, and PLOS Pathogens

    Identification of research hypotheses and new knowledge from scientific literature

    No full text
    Abstract Background Text mining (TM) methods have been used extensively to extract relations and events from the literature. In addition, TM techniques have been used to extract various types or dimensions of interpretative information, known as Meta-Knowledge (MK), from the context of relations and events, e.g. negation, speculation, certainty and knowledge type. However, most existing methods have focussed on the extraction of individual dimensions of MK, without investigating how they can be combined to obtain even richer contextual information. In this paper, we describe a novel, supervised method to extract new MK dimensions that encode Research Hypotheses (an authorâ s intended knowledge gain) and New Knowledge (an authorâ s findings). The method incorporates various features, including a combination of simple MK dimensions. Methods We identify previously explored dimensions and then use a random forest to combine these with linguistic features into a classification model. To facilitate evaluation of the model, we have enriched two existing corpora annotated with relations and events, i.e., a subset of the GENIA-MK corpus and the EU-ADR corpus, by adding attributes to encode whether each relation or event corresponds to Research Hypothesis or New Knowledge. In the GENIA-MK corpus, these new attributes complement simpler MK dimensions that had previously been annotated. Results We show that our approach is able to assign different types of MK dimensions to relations and events with a high degree of accuracy. Firstly, our method is able to improve upon the previously reported state of the art performance for an existing dimension, i.e., Knowledge Type. Secondly, we also demonstrate high F1-score in predicting the new dimensions of Research Hypothesis (GENIA: 0.914, EU-ADR 0.802) and New Knowledge (GENIA: 0.829, EU-ADR 0.836). Conclusion We have presented a novel approach for predicting New Knowledge and Research Hypothesis, which combines simple MK dimensions to achieve high F1-scores. The extraction of such information is valuable for a number of practical TM applications

    The Value and Challenges of Making Survey and Digital Trace Datasets Available for Open Access

    No full text
    This poster was presented at the University of Manchester Open Research Conference, 9-10 June 2025.This presentation will demonstrate the conceptual and methodological value and challenges in producing anonymised and standardised variables from survey respondents’ digital trace data (DTD). It will do so using existing YouGov datasets collected over two time periods in the US 2020 and 2024, and a third collected in the UK 2022. The US datasets link individual survey responses to their Twitter feeds and the UK to their browsing history. All three datasets were designed to address research questions about the effects of digital media consumption and exposure on citizen attitudes and behaviours. The presentation will proceed in three main stages. First we will identify a range of new anonymized variables that can be created from the DTD that can address important new substantive questions about the impact of web and social-media content on individuals’ political engagement. We will also specify a set of more methodologically interesting variables that we can extract from the observational trace data that can be used to validate the survey responses. After identifying the range of ‘ideal’ variables that could be generated, we will then select a subset of these variables to show how they can be operationalised and discuss the technical challenges faced in doing so, focusing particularly on comparing Twitter to browser data. We will select the variables by rating them on two core criteria of utility and scientific value and ease of computation. In a final stage we will reflect on the ethical issues raised in this process of linking survey data with digital trace data, and the key ‘take homes’ that our research has identified for future projects of this type to consider, prior to data collection

    The Proteasix Ontology

    No full text
    Abstract Background The Proteasix Ontology (PxO) is an ontology that supports the Proteasix tool; an open-source peptide-centric tool that can be used to predict automatically and in a large-scale fashion in silico the proteases involved in the generation of proteolytic cleavage fragments (peptides) Methods The PxO re-uses parts of the Protein Ontology, the three Gene Ontology sub-ontologies, the Chemical Entities of Biological Interest Ontology, the Sequence Ontology and bespoke extensions to the PxO in support of a series of roles: 1. To describe the known proteases and their target cleaveage sites. 2. To enable the description of proteolytic cleaveage fragments as the outputs of observed and predicted proteolysis. 3. To use knowledge about the function, species and cellular location of a protease and protein substrate to support the prioritisation of proteases in observed and predicted proteolysis. Results The PxO is designed to describe the biological underpinnings of the generation of peptides. The peptide-centric PxO seeks to support the Proteasix tool by separating domain knowledge from the operational knowledge used in protease prediction by Proteasix and to support the confirmation of its analyses and results. Availability The Proteasix Ontology may be found at: http://bioportal.bioontology.org/ontologies/PXO . This ontology is free and open for use by everyone
    corecore