3 research outputs found

    Linking named entities to Wikipedia

    Get PDF
    Natural language is fraught with problems of ambiguity, including name reference. A name in text can refer to multiple entities just as an entity can be known by different names. This thesis examines how a mention in text can be linked to an external knowledge base (KB), in our case, Wikipedia. The named entity linking (NEL) task requires systems to identify the KB entry, or Wikipedia article, that a mention refers to; or, if the KB does not contain the correct entry, return NIL. Entity linking systems can be complex and we present a framework for analysing their different components, which we use to analyse three seminal systems which are evaluated on a common dataset and we show the importance of precise search for linking. The Text Analysis Conference (TAC) is a major venue for NEL research. We report on our submissions to the entity linking shared task in 2010, 2011 and 2012. The information required to disambiguate entities is often found in the text, close to the mention. We explore apposition, a common way for authors to provide information about entities. We model syntactic and semantic restrictions with a joint model that achieves state-of-the-art apposition extraction performance. We generalise from apposition to examine local descriptions specified close to the mention. We add local description to our state-of-the-art linker by using patterns to extract the descriptions and matching against this restricted context. Not only does this make for a more precise match, we are also able to model failure to match. Local descriptions help disambiguate entities, further improving our state-of-the-art linker. The work in this thesis seeks to link textual entity mentions to knowledge bases. Linking is important for any task where external world knowledge is used and resolving ambiguity is fundamental to advancing research into these problems

    Genomic signatures of sex, selection and speciation in the microbial world

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Computational and Systems Biology Program, 2010.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student-submitted PDF version of thesis.Includes bibliographical references (p. 218-228).Understanding the microbial world is key to understanding global biogeochemistry, human health and disease, yet this world is largely inaccessible. Microbial genomes, an increasingly accessible data source, provide an ideal entry point. The genome sequences of different microbes may be compared using the tools of population genetics to infer important genetic changes allowing them to diversify ecologically and adapt to distinct ecological niches. Yet the toolkit of population genetics was developed largely with sexual eukaryotes in mind. In this work, I assess and develop tools for inferring natural selection in microbial genomes. Many tools rely on population genetics theory, and thus require defining distinct populations, or species, of bacteria. Because sex (recombination) is not required for reproduction, some bacteria recombine only rarely, while others are extremely promiscuous, exchanging genes across great genetic distances. This behavior poses a challenge for defining microbial population boundaries. This thesis begins with a discussion of how recombination and positive selection interact to promote ecological adaptation. I then describe a general pipeline for quantifying the impacts of mutation, recombination and selection on microbial genomes, and apply it to two closely related, yet ecologically distinct populations of Vibrio splendidus, each with its own microhabitat preference. I introduce a new tool, STARRInIGHTS, for inferring homologous recombination events. By assessing rates of recombination within and between ecological populations, I conclude that ecological differentiation is driven by small number of habitat-specific alleles, while most loci are shared freely across habitats. The remainder of the thesis focuses on lineage-specific changes in natural selection among anciently diverged species of gamma proteobacteria. I develop two new metrics, selective signatures and slow:fast, for detecting deviations from the expected rate of evolution in 'core' proteins (present in single copy in most species). Because they rely on empirical distributions of evolutionary rates across species, these methods should become increasingly powerful as more and more microbial genomes are sampled. Overall, the methods described here significantly expand the repertoire of tools available for microbial population genomics, both for investigating the process of ecological differentiation at the finest of time scales, and over billions of years of microbial evolution.by B. Jesse Shapiro.Ph.D
    corecore