6 research outputs found

    Agile in-litero experiments:how can semi-automated information extraction from neuroscientific literature help neuroscience model building?

    Get PDF
    In neuroscience, as in many other scientific domains, the primary form of knowledge dissemination is through published articles in peer-reviewed journals. One challenge for modern neuroinformatics is to design methods to make the knowledge from the tremendous backlog of publications accessible for search, analysis and its integration into computational models. In this thesis, we introduce novel natural language processing (NLP) models and systems to mine the neuroscientific literature. In addition to in vivo, in vitro or in silico experiments, we coin the NLP methods developed in this thesis as in litero experiments, aiming at analyzing and making accessible the extended body of neuroscientific literature. In particular, we focus on two important neuroscientific entities: brain regions and neural cells. An integrated NLP model is designed to automatically extract brain region connectivity statements from very large corpora. This system is applied to a large corpus of 25M PubMed abstracts and 600K full-text articles. Central to this system is the creation of a searchable database of brain region connectivity statements, allowing neuroscientists to gain an overview of all brain regions connected to a given region of interest. More importantly, the database enables researcher to provide feedback on connectivity results and links back to the original article sentence to provide the relevant context. The database is evaluated by neuroanatomists on real connectomics tasks (targets of Nucleus Accumbens) and results in significant effort reduction in comparison to previous manual methods (from 1 week to 2h). Subsequently, we introduce neuroNER to identify, normalize and compare instances of identify neuronsneurons in the scientific literature. Our method relies on identifying and analyzing each of the domain features used to annotate a specific neuron mention, like the morphological term 'basket' or brain region 'hippocampus'. We apply our method to the same corpus of 25M PubMed abstracts and 600K full-text articles and find over 500K unique neuron type mentions. To demonstrate the utility of our approach, we also apply our method towards cross-comparing the NeuroLex and Human Brain Project (HBP) cell type ontologies. By decoupling a neuron mention's identity into its specific compositional features, our method can successfully identify specific neuron types even if they are not explicitly listed within a predefined neuron type lexicon, thus greatly facilitating cross-laboratory studies. In order to build such large databases, several tools and infrastructureslarge-scale NLP were developed: a robust pipeline to preprocess full-text PDF articles, as well as bluima, an NLP processing pipeline specialized on neuroscience to perform text-mining at PubMed scale. During the development of those two NLP systems, we acknowledged the need for novel NLP approaches to rapidly develop custom text mining solutions. This led to the formalization of the agile text miningagile text-mining methodology to improve the communication and collaboration between subject matter experts and text miners. Agile text mining is characterized by short development cycles, frequent tasks redefinition and continuous performance monitoring through integration tests. To support our approach, we developed Sherlok, an NLP framework designed for the development of agile text mining applications

    A Framework for Collaborative Curation of Neuroscientific Literature

    Get PDF
    Large models of complex neuronal circuits require specifying numerous parameters, with values that often need to be extracted from the literature, a tedious and error-prone process. To help establishing shareable curated corpora of annotations, we have developed a literature curation framework comprising an annotation format, a Python API (NeuroAnnotation Toolbox; NAT), and a user-friendly graphical interface (NeuroCurator). This framework allows the systematic annotation of relevant statements and model parameters. The context of the annotated content is made explicit in a standard way by associating it with ontological terms (e.g., species, cell types, brain regions). The exact position of the annotated content within a document is specified by the starting character of the annotated text, or the number of the figure, the equation, or the table, depending on the context. Alternatively, the provenance of parameters can also be specified by bounding boxes. Parameter types are linked to curated experimental values so that they can be systematically integrated into models. We demonstrate the use of this approach by releasing a corpus describing different modeling parameters associated with thalamo-cortical circuitry. The proposed framework supports a rigorous management of large sets of parameters, solving common difficulties in their traceability. Further, it allows easier classification of literature information and more efficient and systematic integration of such information into models and analyses

    Adopting Text Mining on Rehabilitation Therapy Repositioning for Stroke

    Get PDF
    Stroke is a common disabling disease that severely affects the daily life of patients. Accumulating evidence indicates that rehabilitation therapy can improve movement function. However, no clear guidelines have specific and effective rehabilitation therapy schemes, and the development of new rehabilitation techniques has been relatively slow. This study used a text mining approach, the ABC model, to identify an existing rehabilitation candidate therapy method that is most likely to be repositioned for stroke. In the model, we built the internal links of stroke (A), assessment scales (B), and rehabilitation therapies (C) in PubMed and the links were related to upper limb function measurements for patients with stroke. In the first step, using E-utility, we retrieved both stroke-related assessment scales and rehabilitation therapy records and then compiled two datasets, which were called Stroke_Scales and Stroke_Therapies, respectively. In the next step, we crawled all rehabilitation therapies co-occurring with the Stroke_Therapies and then named them as All_Therapies. Therapies that were already included in Stroke_Therapies were deleted from All_Therapies; therefore, the remaining therapies were the potential rehabilitation therapies, which could be repositioned for stroke after subsequent filtration by a manual check. We identified the top-ranked repositioning rehabilitation therapy and subsequently examined its clinical validation. Hand-arm bimanual intensive training (HABIT) was ranked the first in our repositioning rehabilitation therapies and had the most interaction links with Stroke_Scales. HABIT significantly improved clinical scores on assessment scales [Fugl-Meyer Assessment (FMA) and action research arm test (ARAT)] in the clinical validation study for acute stroke patients with upper limb dysfunction. Therefore, based on the ABC model and clinical validation, HABIT is a promising repositioned rehabilitation therapy for stroke, and the ABC model is an effective text mining approach for rehabilitation therapy repositioning. The findings in this study would be helpful in clinical knowledge discovery

    Text mining for neuroanatomy using WhiteText with an updated corpus and a new web application

    Get PDF
    We describe the WhiteText project, and its progress towards automatically extracting statements of neuroanatomical connectivity from text. We review progress to date on the three main steps of the project: recognition of brain region mentions, standardization of brain region mentions to neuroanatomical nomenclature, and connectivity statement extraction. We further describe a new version of our manually curated corpus that adds 2,111 connectivity statements from 1,828 additional abstracts. Cross-validation classification within the new corpus replicates results on our original corpus, recalling 51% of connectivity statements at 67% precision. The resulting merged corpus provides 5,208 connectivity statements that can be used to seed species-specific connectivity matrices and to better train automated techniques. Finally, we present a new web application that allows fast interactive browsing of the over 70,000 sentences indexed by the system, as a tool for accessing the data and assisting in further curation. Software and data are freely available at http://www.chibi.ubc.ca/WhiteText/
    corecore