4,597 research outputs found

    Text-mining and ontologies: new approaches to knowledge discovery of microbial diversity

    Full text link
    Microbiology research has access to a very large amount of public information on the habitats of microorganisms. Many areas of microbiology research uses this information, primarily in biodiversity studies. However the habitat information is expressed in unstructured natural language form, which hinders its exploitation at large-scale. It is very common for similar habitats to be described by different terms, which makes them hard to compare automatically, e.g. intestine and gut. The use of a common reference to standardize these habitat descriptions as claimed by (Ivana et al., 2010) is a necessity. We propose the ontology called OntoBiotope that we have been developing since 2010. The OntoBiotope ontology is in a formal machine-readable representation that enables indexing of information as well as conceptualization and reasoning.Comment: 5 page

    Multimodal Machine Learning for Automated ICD Coding

    Full text link
    This study presents a multimodal machine learning model to predict ICD-10 diagnostic codes. We developed separate machine learning models that can handle data from different modalities, including unstructured text, semi-structured text and structured tabular data. We further employed an ensemble method to integrate all modality-specific models to generate ICD-10 codes. Key evidence was also extracted to make our prediction more convincing and explainable. We used the Medical Information Mart for Intensive Care III (MIMIC -III) dataset to validate our approach. For ICD code prediction, our best-performing model (micro-F1 = 0.7633, micro-AUC = 0.9541) significantly outperforms other baseline models including TF-IDF (micro-F1 = 0.6721, micro-AUC = 0.7879) and Text-CNN model (micro-F1 = 0.6569, micro-AUC = 0.9235). For interpretability, our approach achieves a Jaccard Similarity Coefficient (JSC) of 0.1806 on text data and 0.3105 on tabular data, where well-trained physicians achieve 0.2780 and 0.5002 respectively.Comment: Machine Learning for Healthcare 201

    Firsthand Opiates Abuse on Social Media: Monitoring Geospatial Patterns of Interest Through a Digital Cohort

    Get PDF
    In the last decade drug overdose deaths reached staggering proportions in the US. Besides the raw yearly deaths count that is worrisome per se, an alarming picture comes from the steep acceleration of such rate that increased by 21% from 2015 to 2016. While traditional public health surveillance suffers from its own biases and limitations, digital epidemiology offers a new lens to extract signals from Web and Social Media that might be complementary to official statistics. In this paper we present a computational approach to identify a digital cohort that might provide an updated and complementary view on the opioid crisis. We introduce an information retrieval algorithm suitable to identify relevant subspaces of discussion on social media, for mining data from users showing explicit interest in discussions about opioid consumption in Reddit. Moreover, despite the pseudonymous nature of the user base, almost 1.5 million users were geolocated at the US state level, resembling the census population distribution with a good agreement. A measure of prevalence of interest in opiate consumption has been estimated at the state level, producing a novel indicator with information that is not entirely encoded in the standard surveillance. Finally, we further provide a domain specific vocabulary containing informal lexicon and street nomenclature extracted by user-generated content that can be used by researchers and practitioners to implement novel digital public health surveillance methodologies for supporting policy makers in fighting the opioid epidemic.Comment: Proceedings of the 2019 World Wide Web Conference (WWW '19

    Recent acquisition of Helicobacter pylori by Baka Pygmies

    Get PDF
    Both anatomically modern humans and the gastric pathogen Helicobacter pylori originated in Africa, and both species have been associated for at least 100,000 years. Seven geographically distinct H. pylori populations exist, three of which are indigenous to Africa: hpAfrica1, hpAfrica2, and hpNEAfrica. The oldest and most divergent population, hpAfrica2, evolved within San hunter-gatherers, who represent one of the deepest branches of the human population tree. Anticipating the presence of ancient H. pylori lineages within all hunter-gatherer populations, we investigated the prevalence and population structure of H. pylori within Baka Pygmies in Cameroon. Gastric biopsies were obtained by esophagogastroduodenoscopy from 77 Baka from two geographically separated populations, and from 101 non-Baka individuals from neighboring agriculturalist populations, and subsequently cultured for H. pylori. Unexpectedly, Baka Pygmies showed a significantly lower H. pylori infection rate (20.8%) than non-Baka (80.2%). We generated multilocus haplotypes for each H. pylori isolate by DNA sequencing, but were not able to identify Baka-specific lineages, and most isolates in our sample were assigned to hpNEAfrica or hpAfrica1. The population hpNEAfrica, a marker for the expansion of the Nilo-Saharan language family, was divided into East African and Central West African subpopulations. Similarly, a new hpAfrica1 subpopulation, identified mainly among Cameroonians, supports eastern and western expansions of Bantu languages. An age-structured transmission model shows that the low H. pylori prevalence among Baka Pygmies is achievable within the timeframe of a few hundred years and suggests that demographic factors such as small population size and unusually low life expectancy can lead to the eradication of H. pylori from individual human populations. The Baka were thus either H. pylori-free or lost their ancient lineages during past demographic fluctuations. Using coalescent simulations and phylogenetic inference, we show that Baka almost certainly acquired their extant H. pylori through secondary contact with their agriculturalist neighbors

    Textpresso for Neuroscience: Searching the Full Text of Thousands of Neuroscience Research Papers

    Get PDF
    Textpresso is a text-mining system for scientific literature. Its two major features are access to the full text of research papers and the development and use of categories of biological concepts as well as categories that describe or relate objects. A search engine enables the user to search for one or a combination of these categories and/or keywords within an entire literature. Here we describe Textpresso for Neuroscience, part of the core Neuroscience Information Framework (NIF). The Textpresso site currently consists of 67,500 full text papers and 131,300 abstracts. We show that using categories in literature can make a pure keyword query more refined and meaningful. We also show how semantic queries can be formulated with categories only. We explain the build and content of the database and describe the main features of the web pages and the advanced search options. We also give detailed illustrations of the web service developed to provide programmatic access to Textpresso. This web service is used by the NIF interface to access Textpresso. The standalone website of Textpresso for Neuroscience can be accessed at http://www.textpresso.org/neuroscience

    Replication issues in syntax-based aspect extraction for opinion mining

    Full text link
    Reproducing experiments is an important instrument to validate previous work and build upon existing approaches. It has been tackled numerous times in different areas of science. In this paper, we introduce an empirical replicability study of three well-known algorithms for syntactic centric aspect-based opinion mining. We show that reproducing results continues to be a difficult endeavor, mainly due to the lack of details regarding preprocessing and parameter setting, as well as due to the absence of available implementations that clarify these details. We consider these are important threats to validity of the research on the field, specifically when compared to other problems in NLP where public datasets and code availability are critical validity components. We conclude by encouraging code-based research, which we think has a key role in helping researchers to understand the meaning of the state-of-the-art better and to generate continuous advances.Comment: Accepted in the EACL 2017 SR

    Representing and coding the knowledge embedded in texts of Health Science Web published articles

    Get PDF
    Despite the fact that electronic publishing is a common activity to scholars electronic journals are still based in the print model and do not take full advantage of the facilities offered by the Semantic Web environment. This is a report of the results of a research project with the aim of investigating the possibilities of electronic publishing journal articles both as text for human reading and in machine readable format recording the new knowledge contained in the article. This knowledge is identified with the scientific methodology elements such as problem, methodology, hypothesis, results, and conclusions. A model integrating all those elements is proposed which makes explicit and records the knowledge embedded in the text of scientific articles as an ontology. Knowledge thus represented enables its processing by intelligent software agents The proposed model aims to take advantage of these facilities enabling semantic retrieval and validation of the knowledge contained in articles. To validate and enhance the model a set of electronic journal articles were analyzed
    corecore