9 research outputs found

    A Taxonomy of Academic Abstract Sentence Classification Modelling

    Get PDF
    Background: Abstract sentence classification modelling has the potential to advance literature discovery capability for the array of academic literature information systems, however, no artefact exists that categorises known models and identifies their key characteristics. Aims: To systematically categorise known abstract sentence classification models and make this knowledge readily available to future researchers and professionals concerned with abstract sentence classification model development and deployment. Method: An information systems taxonomy development methodology was adopted after a literature review to categorise 23 abstract sentence classification models identified from the literature. Corresponding dimensions and characteristics were derived from this process with the resulting taxonomy presented. Results: Abstract sentence classification modelling has evolved significantly with state-of-the-art models now leveraging neural networks to achieve high-performance sentence classification. The resulting taxonomy provides a novel means to observe the development of this research field and enables us to consider how such models can be further improved or deployed in real-world applications

    Automated PDF highlighting to support faster curation of literature for Parkinson's and Alzheimer's disease

    Get PDF
    Neurodegenerative disorders such as Parkinson’s and Alzheimer’s disease are devastating and costly illnesses, a source of major global burden. In order to provide successful interventions for patients and reduce costs, both causes and pathological processes need to be understood. The ApiNATOMY project aims to contribute to our understanding of neurodegenerative disorders by manually curating and abstracting data from the vast body of literature amassed on these illnesses. As curation is labour-intensive, we aimed to speed up the process by automatically highlighting those parts of the PDF document of primary importance to the curator. Using techniques similar to those of summarisation, we developed an algorithm that relies on linguistic, semantic and spatial features. Employing this algorithm on a test set manually corrected for tool imprecision, we achieved a macro F1-measure of 0.51, which is an increase of 132% compared to the best bag-of-words baseline model. A user based evaluation was also conducted to assess the usefulness of the methodology on 40 unseen publications, which reveals that in 85% of cases all highlighted sentences are relevant to the curation task and in about 65% of the cases, the highlights are sufficient to support the knowledge curation task without needing to consult the full text. In conclusion, we believe that these are promising results for a step in automating the recognition of curation-relevant sentences. Refining our approach to pre-digest papers will lead to faster processing and cost reduction in the curation process

    Large Scale Subject Category Classification of Scholarly Papers With Deep Attentive Neural Networks

    Get PDF
    Subject categories of scholarly papers generally refer to the knowledge domain(s) to which the papers belong, examples being computer science or physics. Subject category classification is a prerequisite for bibliometric studies, organizing scientific publications for domain knowledge extraction, and facilitating faceted searches for digital library search engines. Unfortunately, many academic papers do not have such information as part of their metadata. Most existing methods for solving this task focus on unsupervised learning that often relies on citation networks. However, a complete list of papers citing the current paper may not be readily available. In particular, new papers that have few or no citations cannot be classified using such methods. Here, we propose a deep attentive neural network (DANN) that classifies scholarly papers using only their abstracts. The network is trained using nine million abstracts from Web of Science (WoS). We also use the WoS schema that covers 104 subject categories. The proposed network consists of two bi-directional recurrent neural networks followed by an attention layer. We compare our model against baselines by varying the architecture and text representation. Our best model achieves micro- F1 measure of 0.76 with F1 of individual subject categories ranging from 0.50 to 0.95. The results showed the importance of retraining word embedding models to maximize the vocabulary overlap and the effectiveness of the attention mechanism. The combination of word vectors with TFIDF outperforms character and sentence level embedding models. We discuss imbalanced samples and overlapping categories and suggest possible strategies for mitigation. We also determine the subject category distribution in CiteSeerX by classifying a random sample of one million academic papers

    Automatic Population of Structured Reports from Narrative Pathology Reports

    Get PDF
    There are a number of advantages for the use of structured pathology reports: they can ensure the accuracy and completeness of pathology reporting; it is easier for the referring doctors to glean pertinent information from them. The goal of this thesis is to extract pertinent information from free-text pathology reports and automatically populate structured reports for cancer diseases and identify the commonalities and differences in processing principles to obtain maximum accuracy. Three pathology corpora were annotated with entities and relationships between the entities in this study, namely the melanoma corpus, the colorectal cancer corpus and the lymphoma corpus. A supervised machine-learning based-approach, utilising conditional random fields learners, was developed to recognise medical entities from the corpora. By feature engineering, the best feature configurations were attained, which boosted the F-scores significantly from 4.2% to 6.8% on the training sets. Without proper negation and uncertainty detection, the quality of the structured reports will be diminished. The negation and uncertainty detection modules were built to handle this problem. The modules obtained overall F-scores ranging from 76.6% to 91.0% on the test sets. A relation extraction system was presented to extract four relations from the lymphoma corpus. The system achieved very good performance on the training set, with 100% F-score obtained by the rule-based module and 97.2% F-score attained by the support vector machines classifier. Rule-based approaches were used to generate the structured outputs and populate them to predefined templates. The rule-based system attained over 97% F-scores on the training sets. A pipeline system was implemented with an assembly of all the components described above. It achieved promising results in the end-to-end evaluations, with 86.5%, 84.2% and 78.9% F-scores on the melanoma, colorectal cancer and lymphoma test sets respectively

    An ontological framework for the formal representation and management of human stress knowledge

    Get PDF
    There is a great deal of information on the topic of human stress which is embedded within numerous papers across various databases. However, this information is stored, retrieved, and used often discretely and dispersedly. As a result, discovery and identification of the links and interrelatedness between different aspects of knowledge on stress is difficult. This restricts the effective search and retrieval of desired information. There is a need to organize this knowledge under a unifying framework, linking and analysing it in mutual combinations so that we can obtain an inclusive view of the related phenomena and new knowledge can emerge. Furthermore, there is a need to establish evidence-based and evolving relationships between the ontology concepts.Previous efforts to classify and organize stress-related phenomena have not been sufficiently inclusive and none of them has considered the use of ontology as an effective facilitating tool for the abovementioned issues.There have also been some research works on the evolution and refinement of ontology concepts and relationships. However, these fail to provide any proposals for an automatic and systematic methodology with the capacity to establish evidence-based/evolving ontology relationships.In response to these needs, we have developed the Human Stress Ontology (HSO), a formal framework which specifies, organizes, and represents the domain knowledge of human stress. This machine-readable knowledge model is likely to help researchers and clinicians find theoretical relationships between different concepts, resulting in a better understanding of the human stress domain and its related areas. The HSO is formalized using OWL language and Protégé tool.With respect to the evolution and evidentiality of ontology relationships in the HSO and other scientific ontologies, we have proposed the Evidence-Based Evolving Ontology (EBEO), a methodology for the refinement and evolution of ontology relationships based on the evidence gleaned from scientific literature. The EBEO is based on the implementation of a Fuzzy Inference System (FIS).Our evaluation results showed that almost all stress-related concepts of the sample articles can be placed under one or more category of the HSO. Nevertheless, there were a number of limitations in this work which need to be addressed in future undertakings.The developed ontology has the potential to be used for different data integration and interoperation purposes in the domain of human stress. It can also be regarded as a foundation for the future development of semantic search engines in the stress domain

    Big Data Analytics and Information Science for Business and Biomedical Applications

    Get PDF
    The analysis of Big Data in biomedical as well as business and financial research has drawn much attention from researchers worldwide. This book provides a platform for the deep discussion of state-of-the-art statistical methods developed for the analysis of Big Data in these areas. Both applied and theoretical contributions are showcased
    corecore