1,518 research outputs found

    Integrating high dimensional bi-directional parsing models for gene mention tagging

    Get PDF
    Motivation: Tagging gene and gene product mentions in scientific text is an important initial step of literature mining. In this article, we describe in detail our gene mention tagger participated in BioCreative 2 challenge and analyze what contributes to its good performance. Our tagger is based on the conditional random fields model (CRF), the most prevailing method for the gene mention tagging task in BioCreative 2. Our tagger is interesting because it accomplished the highest F-scores among CRF-based methods and second over all. Moreover, we obtained our results by mostly applying open source packages, making it easy to duplicate our results

    Overview of BioCreative II gene mention recognition.

    Get PDF
    Nineteen teams presented results for the Gene Mention Task at the BioCreative II Workshop. In this task participants designed systems to identify substrings in sentences corresponding to gene name mentions. A variety of different methods were used and the results varied with a highest achieved F1 score of 0.8721. Here we present brief descriptions of all the methods used and a statistical analysis of the results. We also demonstrate that, by combining the results from all submissions, an F score of 0.9066 is feasible, and furthermore that the best result makes use of the lowest scoring submissions

    The gene normalization task in BioCreative III

    Get PDF
    BACKGROUND: We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k). RESULTS: We received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively. CONCLUSIONS: By using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance

    Effective Field Theory Methods in Gravitational Physics and Tests of Gravity

    Full text link
    In this PhD thesis I make use of the "Effective Field Theory of Gravity for Extended Objects" by Goldberger and Rothstein in order to investigate theories of gravity and to take a different point of view on the physical information that can be extracted from experiments. In the first work I present, I study a scalar-tensor theory of gravity and I address the renormalization of the energy-momentum tensor for point-like and string-like sources. The second and third study I report are set in the context of testing gravity. So far experiments have probed dynamical regimes only up to order (v/c)^5 in the post-Newtonian expansion, which corresponds to the very first term of the radiative sector in General Relativity. In contrast, by means of gravitational-wave astronomy, one aims at testing General Relativity up to (v/c)^(12)! It is then relevant to envisage testing frameworks which are appropriate to this strong-field/radiative regime. In the last two chapters of this thesis a new such framework is presented. Using the effective field theory approach, General Relativity non-linearities are described by Feynman diagrams in which classical gravitons interact with matter sources and among themselves. Tagging the self-interaction vertices of gravitons with parameters it is possible, for example, to translate the measure of the period decay of Hulse-Taylor pulsar in a constraint on the three-graviton vertex at the 0.1% level; for comparison, LEP constraints on the triple-gauge-boson couplings of weak interactions are accurate at 3%. With future observations of gravitational waves, higher order graviton vertices can in principle be constrained through a Fisher matrix analysis.Comment: This PhD Thesis has been conducted at the University of Geneva (Switzerland) under the direction of Professor Michele Maggiore and the codirection of Doctor Riccardo Sturani. Version 2: abstract slightly changed; one typo corrected; layout issue fixe

    Text Classification

    Get PDF
    There is an abundance of text data in this world but most of it is raw. We need to extract information from this data to make use of it. One way to extract this information from raw text is to apply informative labels drawn from a pre-defined fixed set i.e. Text Classification. In this thesis, we focus on the general problem of text classification, and work towards solving challenges associated to binary/multi-class/multi-label classification. More specifically, we deal with the problem of (i) Zero-shot labels during testing; (ii) Active learning for text screening; (iii) Multi-label classification under low supervision; (iv) Structured label space; (v) Classifying pairs of words in raw text i.e. Relation Extraction. For (i), we use a zero-shot classification model that utilizes independently learned semantic embeddings. Regarding (ii), we propose a novel active learning algorithm that reduces problem of bias in naive active learning algorithms. For (iii), we propose neural candidate-selector architecture that starts from a set of high-recall candidate labels to obtain high-precision predictions. In the case of (iv), we proposed an attention based neural tree decoder that recursively decodes an abstract into the ontology tree. For (v), we propose using second-order relations that are derived by explicitly connecting pairs of words via context token(s) for improved relation extraction. We use a wide variety of both traditional and deep machine learning tools. More specifically, we used traditional machine learning models like multi-valued linear regression and logistic regression for (i, ii), deep convolutional neural networks for (iii), recurrent neural networks for (iv) and transformer networks for (v)

    Identifying the Genetic Population Structure Knowledge Gaps Hindering an Improved Management of the Spurdog (Squalus acanthias) stock in the Northeast Atlantic: A Systematic Review

    Get PDF
    Despite its long history of exploitation, there is limited information about the spurdog. Therefore, it is important to identify which general and genetic information is available for the species and what is missing to resolve stock structure and advice future management schemes. The goal of this study was to identify the knowledge gaps, in terms of genetic population structure and diversity, which could inform an improved fisheries management for the spurdog in the Northeast Atlantic Ocean and Mediterranean Sea. To achieve this, a systematic review and a series of phylogenetic analyses using the NADH2 marker were done. Results from the review showed there is very limited general information about the species in the study regions, with only 38 documents found out of over 6000 hits. Only 3 studies were found concerning its genetic structure and diversity, with high diversity found for all studies but no genetic differentiation, except for the subpopulation in the Adriatic Sea. No genetic structure was found for the species in the Northeast Atlantic, but fine structuring was found for the Mediterranean Sea, indicating different stocks. The phylogenetic trees showed complex taxonomical relationships within the Squalus genus and no clear formation of monophyletic clades according to the location in which the samples were taken for sequences of S. acanthias. Major conclusions indicate the need for collecting more information, particularly with less invasive methods, given the zero TAC in the areas and the limitations of fisheries surveys, as well as more sampling efforts in regions different from Norway, the United Kingdom, and the Adriatic Sea. Furthermore, given that the phylogenetic analysis of the species with the NADH2 marker was inconclusive, the use of more efficient genetic markers, such as microsatellites or single nucleotide polymorphisms, is recommended for identifying fine genetic structuring in the populations
    corecore