212 research outputs found

    Ensemble labeling towards scientific information extraction (ELSIE)

    Get PDF
    Extracting scientific facts from unstructured text is difficult due to challenges specific to the ambiguity of the language, the complexity of the scientific named entities and relations to be extracted. This problem is well illustrated through the extraction of polymer names and their properties. Even in the cases where the property is a temperature, identifying the polymer name associated with the temperature may require expertise due to the use of acronyms, synonyms, complicated naming conventions and by the fact that new polymer names are being “introduced” to the vernacular as polymer science advances. While there exist domain-specific machine learning toolkits that address these challenges, perhaps the greatest challenge is the lack of—time-consuming, error-prone and costly—labeled data to train these machine learning models. Our work repurposes Snorkel, a data programming tool, in a novel approach as a way to identify sentences that contain the relation of interest in order to generate training data, and as a first step towards extracting the entities themselves. We achieve 94% recall and demonstrate the importance of identifying the complex sentences prior to extraction by comparing to a state-of-the-art domain-aware natural language processing toolkit. We also show that our system captures sentences missed by both the toolkit and the expert labelers

    Text summarization towards scientific information extraction

    Get PDF
    Despite the exponential growth in scientific textual content, research publications are still the primary means for disseminating vital discoveries to experts within their respective fields. These texts are predominantly written for human consumption resulting in two primary challenges; experts cannot efficiently remain well-informed to leverage the latest discoveries, and applications that rely on valuable insights buried in these texts cannot effectively build upon published results. As a result, scientific progress stalls. Automatic Text Summarization (ATS) and Information Extraction (IE) are two essential fields that address this problem. While the two research topics are often studied independently, this work proposes to look at ATS in the context of IE, specifically in relation to Scientific IE. However, Scientific IE faces several challenges, chiefly, the scarcity of relevant entities and insufficient training data. In this paper, we focus on extractive ATS, which identifies the most valuable sentences from textual content for the purpose of ultimately extracting scientific relations. We account for the associated challenges by means of an ensemble method through the integration of three weakly supervised learning models, one for each entity of the target relation. It is important to note that while the relation is well defined, we do not require previously annotated data for the entities composing the relation. Our central objective is to generate balanced training data, which many advanced natural language processing models require. We apply our idea in the domain of materials science, extracting the polymer-glass transition temperature relation and achieve 94.7% recall (i.e., sentences that contain relations annotated by humans), while reducing the text by 99.3% of the original document

    Regularized Data Programming with Automated Bayesian Prior Selection

    Full text link
    The cost of manual data labeling can be a significant obstacle in supervised learning. Data programming (DP) offers a weakly supervised solution for training dataset creation, wherein the outputs of user-defined programmatic labeling functions (LFs) are reconciled through unsupervised learning. However, DP can fail to outperform an unweighted majority vote in some scenarios, including low-data contexts. This work introduces a Bayesian extension of classical DP that mitigates failures of unsupervised learning by augmenting the DP objective with regularization terms. Regularized learning is achieved through maximum a posteriori estimation with informative priors. Majority vote is proposed as a proxy signal for automated prior parameter selection. Results suggest that regularized DP improves performance relative to maximum likelihood and majority voting, confers greater interpretability, and bolsters performance in low-data regimes

    Link communities reveal multiscale complexity in networks

    Full text link
    Networks have become a key approach to understanding systems of interacting objects, unifying the study of diverse phenomena including biological organisms and human society. One crucial step when studying the structure and dynamics of networks is to identify communities: groups of related nodes that correspond to functional subunits such as protein complexes or social spheres. Communities in networks often overlap such that nodes simultaneously belong to several groups. Meanwhile, many networks are known to possess hierarchical organization, where communities are recursively grouped into a hierarchical structure. However, the fact that many real networks have communities with pervasive overlap, where each and every node belongs to more than one group, has the consequence that a global hierarchy of nodes cannot capture the relationships between overlapping groups. Here we reinvent communities as groups of links rather than nodes and show that this unorthodox approach successfully reconciles the antagonistic organizing principles of overlapping communities and hierarchy. In contrast to the existing literature, which has entirely focused on grouping nodes, link communities naturally incorporate overlap while revealing hierarchical organization. We find relevant link communities in many networks, including major biological networks such as protein-protein interaction and metabolic networks, and show that a large social network contains hierarchically organized community structures spanning inner-city to regional scales while maintaining pervasive overlap. Our results imply that link communities are fundamental building blocks that reveal overlap and hierarchical organization in networks to be two aspects of the same phenomenon.Comment: Main text and supplementary informatio

    Dissolved organic carbon compounds in deep-sea hydrothermal vent fluids from the East Pacific Rise at 9°50′N

    Get PDF
    Author Posting. © The Author(s), 2018. This is the author's version of the work. It is posted here under a nonexclusive, irrevocable, paid-up, worldwide license granted to WHOI. It is made available for personal use, not for redistribution. The definitive version was published in Organic Geochemistry 125 (2018): 41-49, doi:10.1016/j.orggeochem.2018.08.004.Deep-sea hydrothermal vents are unique ecosystems that may release chemically distinct dissolved organic matter to the deep ocean. Here, we describe the composition and concentrations of polar dissolved organic compounds observed in low and high temperature hydrothermal vent fluids at 9°50’N on the East Pacific Rise. The concentration of dissolved organic carbon was 46 μM in the low temperature hydrothermal fluids and 14 μM in the high temperature hydrothermal fluids. In the low temperature vent fluids, quantifiable dissolved organic compounds were dominated by water-soluble vitamins and amino acids. Derivatives of benzoic acid and the organic sulfur compound 2,3-dihydroxypropane-1-sulfonate (DHPS) were also present in low and high temperature hydrothermal fluids. The low temperature vent fluids contain organic compounds that are central to biological processes, suggesting that they are a by-product of biological activity in the subseafloor. These compounds may fuel heterotrophic and other metabolic processes at deep-sea hydrothermal vents and beyond.This project was funded by a grant from WHOI’s Deep Ocean Exploration Institute and WHOI’s Ocean Ridge Initiative (to EBK and SMS) and by NSF OCE-1154320 (to EBK and KL), OCE- 1136727 (to SMS and JSS), and OCE 1131095 (to SMS)

    When Silver Is As Good As Gold: Using Weak Supervision to Train Machine Learning Models on Social Media Data

    Get PDF
    Over the last decade, advances in machine learning have led to an exponential growth in artificial intelligence i.e., machine learning models capable of learning from vast amounts of data to perform several tasks such as text classification, regression, machine translation, speech recognition, and many others. While massive volumes of data are available, due to the manual curation process involved in the generation of training datasets, only a percentage of the data is used to train machine learning models. The process of labeling data with a ground-truth value is extremely tedious, expensive, and is the major bottleneck of supervised learning. To curtail this, the theory of noisy learning can be employed where data labeled through heuristics, knowledge bases and weak classifiers can be utilized for training, instead of data obtained through manual annotation. The assumption here is that a large volume of training data, which contains noise and acquired through an automated process, can compensate for the lack of manual labels. In this study, we utilize heuristic based approaches to create noisy silver standard datasets. We extensively tested the theory of noisy learning on four different applications by training several machine learning models using the silver standard dataset with several sample sizes and class imbalances and tested the performance using a gold standard dataset. Our evaluations on the four applications indicate the success of silver standard datasets in identifying a gold standard dataset. We conclude the study with evidence that noisy social media data can be utilized for weak supervisio

    Refining and Casting of Steel

    Get PDF
    Steel has become the most requested material all over the world during the rapid technological evolution of recent centuries. As our civilization grows and its technological development becomes connected with more demanding processes, it is more and more challenging to fit the required physical and mechanical properties for steel in its huge portfolio of grades for each steel producer. It is necessary to improve the refining and casting processes continuously to meet customer requirements and to lower the production costs to remain competitive. New challenges related to both the precise design of steel properties and reduction in production costs are combined with paying special attention to environmental protection. These contradictory demands are the theme of this book

    Unmasking The Language Of Science Through Textual Analyses On Biomedical Preprints And Published Papers

    Get PDF
    Scientific communication is essential for science as it enables the field to grow. This task is often accomplished through a written form such as preprints and published papers. We can obtain a high-level understanding of science and how scientific trends adapt over time by analyzing these resources. This thesis focuses on conducting multiple analyses using biomedical preprints and published papers. In Chapter 2, we explore the language contained within preprints and examine how this language changes due to the peer-review process. We find that token differences between published papers and preprints are stylistically based, suggesting that peer-review results in modest textual changes. We also discovered that preprints are eventually published and adopted quickly within the life science community. Chapter 3 investigates how biomedical terms and tokens change their meaning and usage through time. We show that multiple machine learning models can correct for the latent variation contained within the biomedical text. Also, we provide the scientific community with a listing of over 43,000 potential change points. Tokens with notable changepoints such as “sars” and “cas9” appear within our listing, providing some validation for our approach. In Chapter 4, we use the weak supervision paradigm to examine the possibility of speeding up the labeling function generation process for multiple biomedical relationship types. We found that the language used to describe a biomedical relationship is often distinct, leading to a modest performance in terms of transferability. An exception to this trend is Compound-binds-Gene and Gene-interacts-Gene relationship types
    • …
    corecore