20 research outputs found
Enhancing navigation in biomedical databases by community voting and database-driven text classification
<p>Abstract</p> <p>Background</p> <p>The breadth of biological databases and their information content continues to increase exponentially. Unfortunately, our ability to query such sources is still often suboptimal. Here, we introduce and apply community voting, database-driven text classification, and visual aids as a means to incorporate distributed expert knowledge, to automatically classify database entries and to efficiently retrieve them.</p> <p>Results</p> <p>Using a previously developed peptide database as an example, we compared several machine learning algorithms in their ability to classify abstracts of published literature results into categories relevant to peptide research, such as related or not related to cancer, angiogenesis, molecular imaging, etc. Ensembles of bagged decision trees met the requirements of our application best. No other algorithm consistently performed better in comparative testing. Moreover, we show that the algorithm produces meaningful class probability estimates, which can be used to visualize the confidence of automatic classification during the retrieval process. To allow viewing long lists of search results enriched by automatic classifications, we added a dynamic heat map to the web interface. We take advantage of community knowledge by enabling users to cast votes in Web 2.0 style in order to correct automated classification errors, which triggers reclassification of all entries. We used a novel framework in which the database "drives" the entire vote aggregation and reclassification process to increase speed while conserving computational resources and keeping the method scalable. In our experiments, we simulate community voting by adding various levels of noise to nearly perfectly labelled instances, and show that, under such conditions, classification can be improved significantly.</p> <p>Conclusion</p> <p>Using PepBank as a model database, we show how to build a classification-aided retrieval system that gathers training data from the community, is completely controlled by the database, scales well with concurrent change events, and can be adapted to add text classification capability to other biomedical databases.</p> <p>The system can be accessed at <url>http://pepbank.mgh.harvard.edu</url>.</p
Provenance, propagation and quality of biological annotation
PhD ThesisBiological databases have become an integral part of the life sciences, being used
to store, organise and share ever-increasing quantities and types of data. Biological
databases are typically centred around raw data, with individual entries being
assigned to a single piece of biological data, such as a DNA sequence. Although essential,
a reader can obtain little information from the raw data alone. Therefore,
many databases aim to supplement their entries with annotation, allowing the current
knowledge about the underlying data to be conveyed to a reader. Although annotations
come in many di erent forms, most databases provide some form of free text
annotation.
Given that annotations can form the foundations of future work, it is important that a
user is able to evaluate the quality and correctness of an annotation. However, this is
rarely straightforward. The amount of annotation, and the way in which it is curated,
varies between databases. For example, the production of an annotation in some
databases is entirely automated, without any manual intervention. Further, sections
of annotations may be reused, being propagated between entries and, potentially,
external databases. This provenance and curation information is not always apparent
to a user.
The work described within this thesis explores issues relating to biological annotation
quality. While the most valuable annotation is often contained within free text, its lack
of structure makes it hard to assess. Initially, this work describes a generic approach
that allows textual annotations to be quantitatively measured. This approach is based
upon the application of Zipf's Law to words within textual annotation, resulting in a
single value, . The relationship between the value and Zipf's principle of least e ort
provides an indication as to the annotations quality, whilst also allowing annotations
to be quantitatively compared.
Secondly, the thesis focuses on determining annotation provenance and tracking any
subsequent propagation. This is achieved through the development of a visualisation
- i -
framework, which exploits the reuse of sentences within annotations. Utilising this
framework a number of propagation patterns were identi ed, which on analysis appear
to indicate low quality and erroneous annotation.
Together, these approaches increase our understanding in the textual characteristics
of biological annotation, and suggests that this understanding can be used to increase
the overall quality of these resources
Ontology Enrichment from Free-text Clinical Documents: A Comparison of Alternative Approaches
While the biomedical informatics community widely acknowledges the utility of domain ontologies, there remain many barriers to their effective use. One important requirement of domain ontologies is that they achieve a high degree of coverage of the domain concepts and concept relationships. However, the development of these ontologies is typically a manual, time-consuming, and often error-prone process. Limited resources result in missing concepts and relationships, as well as difficulty in updating the ontology as domain knowledge changes. Methodologies developed in the fields of Natural Language Processing (NLP), Information Extraction (IE), Information Retrieval (IR), and Machine Learning (ML) provide techniques for automating the enrichment of ontology from free-text documents. In this dissertation, I extended these methodologies into biomedical ontology development. First, I reviewed existing methodologies and systems developed in the fields of NLP, IR, and IE, and discussed how existing methods can benefit the development of biomedical ontologies. This previously unconducted review was published in the Journal of Biomedical Informatics. Second, I compared the effectiveness of three methods from two different approaches, the symbolic (the Hearst method) and the statistical (the Church and Lin methods), using clinical free-text documents. Third, I developed a methodological framework for Ontology Learning (OL) evaluation and comparison. This framework permits evaluation of the two types of OL approaches that include three OL methods. The significance of this work is as follows: 1) The results from the comparative study showed the potential of these methods for biomedical ontology enrichment. For the two targeted domains (NCIT and RadLex), the Hearst method revealed an average of 21% and 11% new concept acceptance rates, respectively. The Lin method produced a 74% acceptance rate for NCIT; the Church method, 53%. As a result of this study (published in the Journal of Methods of Information in Medicine), many suggested candidates have been incorporated into the NCIT; 2) The evaluation framework is flexible and general enough that it can analyze the performance of ontology enrichment methods for many domains, thus expediting the process of automation and minimizing the likelihood that key concepts and relationships would be missed as domain knowledge evolves
Mineração de informação biomédica a partir de literatura científica
Doutoramento conjunto MAP-iThe rapid evolution and proliferation of a world-wide computerized network,
the Internet, resulted in an overwhelming and constantly growing
amount of publicly available data and information, a fact that was also verified
in biomedicine. However, the lack of structure of textual data inhibits
its direct processing by computational solutions. Information extraction is
the task of text mining that intends to automatically collect information
from unstructured text data sources. The goal of the work described in this
thesis was to build innovative solutions for biomedical information extraction
from scientific literature, through the development of simple software
artifacts for developers and biocurators, delivering more accurate, usable
and faster results. We started by tackling named entity recognition - a crucial
initial task - with the development of Gimli, a machine-learning-based
solution that follows an incremental approach to optimize extracted linguistic
characteristics for each concept type. Afterwards, Totum was built to
harmonize concept names provided by heterogeneous systems, delivering a
robust solution with improved performance results. Such approach takes
advantage of heterogenous corpora to deliver cross-corpus harmonization
that is not constrained to specific characteristics. Since previous solutions
do not provide links to knowledge bases, Neji was built to streamline the
development of complex and custom solutions for biomedical concept name
recognition and normalization. This was achieved through a modular and
flexible framework focused on speed and performance, integrating a large
amount of processing modules optimized for the biomedical domain. To
offer on-demand heterogenous biomedical concept identification, we developed
BeCAS, a web application, service and widget. We also tackled relation
mining by developing TrigNER, a machine-learning-based solution for
biomedical event trigger recognition, which applies an automatic algorithm
to obtain the best linguistic features and model parameters for each event
type. Finally, in order to assist biocurators, Egas was developed to support
rapid, interactive and real-time collaborative curation of biomedical documents,
through manual and automatic in-line annotation of concepts and
relations. Overall, the research work presented in this thesis contributed
to a more accurate update of current biomedical knowledge bases, towards
improved hypothesis generation and knowledge discovery.A rápida evolução e proliferação de uma rede mundial de computadores, a
Internet, resultou num esmagador e constante crescimento na quantidade
de dados e informação publicamente disponíveis, o que também se verificou
na biomedicina. No entanto, a inexistência de estrutura em dados textuais
inibe o seu processamento direto por parte de soluções informatizadas. Extração
de informação é a tarefa de mineração de texto que pretende extrair
automaticamente informação de fontes de dados de texto não estruturados.
O objetivo do trabalho descrito nesta tese foi essencialmente focado em
construir soluções inovadoras para extração de informação biomédica a partir
da literatura científica, através do desenvolvimento de aplicações simples
de usar por programadores e bio-curadores, capazes de fornecer resultados
mais precisos, usáveis e de forma mais rápida. Começámos por abordar o
reconhecimento de nomes de conceitos - uma tarefa inicial e fundamental -
com o desenvolvimento de Gimli, uma solução baseada em inteligência artificial
que aplica uma estratégia incremental para otimizar as características
linguísticas extraídas do texto para cada tipo de conceito. Posteriormente,
Totum foi implementado para harmonizar nomes de conceitos provenientes
de sistemas heterogéneos, oferecendo uma solução mais robusta e com melhores
resultados. Esta aproximação recorre a informação contida em corpora
heterogéneos para disponibilizar uma solução não restrita às característica
de um único corpus. Uma vez que as soluções anteriores não oferecem
ligação dos nomes a bases de conhecimento, Neji foi construído para facilitar
o desenvolvimento de soluções complexas e personalizadas para o
reconhecimento de conceitos nomeados e respectiva normalização. Isto foi
conseguido através de uma plataforma modular e flexível focada em rapidez
e desempenho, integrando um vasto conjunto de módulos de processamento
optimizados para o domínio biomédico. De forma a disponibilizar identificação
de conceitos biomédicos em tempo real, BeCAS foi desenvolvido para
oferecer um serviço, aplicação e widget Web. A extracção de relações entre
conceitos também foi abordada através do desenvolvimento de TrigNER,
uma solução baseada em inteligência artificial para o reconhecimento de
palavras que desencadeiam a ocorrência de eventos biomédicos. Esta ferramenta
aplica um algoritmo automático para encontrar as melhores características
linguísticas e parâmetros para cada tipo de evento. Finalmente,
de forma a auxiliar o trabalho de bio-curadores, Egas foi desenvolvido para
suportar a anotação rápida, interactiva e colaborativa em tempo real de
documentos biomédicos, através da anotação manual e automática de conceitos
e relações de forma contextualizada. Resumindo, este trabalho contribuiu
para a actualização mais precisa das actuais bases de conhecimento,
auxiliando a formulação de hipóteses e a descoberta de novo conhecimento
Systems Analytics and Integration of Big Omics Data
A “genotype"" is essentially an organism's full hereditary information which is obtained from its parents. A ""phenotype"" is an organism's actual observed physical and behavioral properties. These may include traits such as morphology, size, height, eye color, metabolism, etc. One of the pressing challenges in computational and systems biology is genotype-to-phenotype prediction. This is challenging given the amount of data generated by modern Omics technologies. This “Big Data” is so large and complex that traditional data processing applications are not up to the task. Challenges arise in collection, analysis, mining, sharing, transfer, visualization, archiving, and integration of these data. In this Special Issue, there is a focus on the systems-level analysis of Omics data, recent developments in gene ontology annotation, and advances in biological pathways and network biology. The integration of Omics data with clinical and biomedical data using machine learning is explored. This Special Issue covers new methodologies in the context of gene–environment interactions, tissue-specific gene expression, and how external factors or host genetics impact the microbiome
Word-sense disambiguation in biomedical ontologies
With the ever increase in biomedical literature, text-mining has emerged as an important technology to support bio-curation and search. Word sense disambiguation (WSD), the correct identification of terms in text in the light of ambiguity, is an important problem in text-mining. Since the late 1940s many approaches based on supervised (decision trees, naive Bayes, neural networks, support vector machines) and unsupervised machine learning (context-clustering, word-clustering, co-occurrence graphs) have been developed. Knowledge-based methods that make use of the WordNet computational lexicon have also been developed. But only few make use of ontologies, i.e. hierarchical controlled vocabularies, to solve the problem and none exploit inference over ontologies and the use of metadata from publications.
This thesis addresses the WSD problem in biomedical ontologies by suggesting different approaches for word sense disambiguation that use ontologies and metadata. The "Closest Sense" method assumes that the ontology defines multiple senses of the term; it computes the shortest path of co-occurring terms in the document to one of these senses. The "Term Cooc" method defines a log-odds ratio for co-occurring terms including inferred co-occurrences. The "MetaData" approach trains a classifier on metadata; it does not require any ontology, but requires training data, which the other methods do not. These approaches are compared to each other when applied to a manually curated training corpus of 2600 documents for seven ambiguous terms from the Gene Ontology and MeSH. All approaches over all conditions achieve 80% success rate on average. The MetaData approach performs best with 96%, when trained on high-quality data. Its performance deteriorates as quality of the training data decreases. The Term Cooc approach performs better on Gene Ontology (92% success) than on MeSH (73% success) as MeSH is not a strict is-a/part-of, but rather a loose is-related-to hierarchy. The Closest Sense approach achieves on average 80% success rate.
Furthermore, the thesis showcases applications ranging from ontology design to semantic search where WSD is important
Towards generic relation extraction
A vast amount of usable electronic data is in the form of unstructured text. The relation
extraction task aims to identify useful information in text (e.g., PersonW works
for OrganisationX, GeneY encodes ProteinZ) and recode it in a format such as a relational
database that can be more effectively used for querying and automated reasoning.
However, adapting conventional relation extraction systems to new domains
or tasks requires significant effort from annotators and developers. Furthermore, previous
adaptation approaches based on bootstrapping start from example instances of
the target relations, thus requiring that the correct relation type schema be known in
advance. Generic relation extraction (GRE) addresses the adaptation problem by applying
generic techniques that achieve comparable accuracy when transferred, without
modification of model parameters, across domains and tasks.
Previous work on GRE has relied extensively on various lexical and shallow syntactic
indicators. I present new state-of-the-art models for GRE that incorporate governordependency
information. I also introduce a dimensionality reduction step into the GRE
relation characterisation sub-task, which serves to capture latent semantic information
and leads to significant improvements over an unreduced model. Comparison of dimensionality
reduction techniques suggests that latent Dirichlet allocation (LDA) – a
probabilistic generative approach – successfully incorporates a larger and more interdependent
feature set than a model based on singular value decomposition (SVD) and
performs as well as or better than SVD on all experimental settings. Finally, I will
introduce multi-document summarisation as an extrinsic test bed for GRE and present
results which demonstrate that the relative performance of GRE models is consistent
across tasks and that the GRE-based representation leads to significant improvements
over a standard baseline from the literature.
Taken together, the experimental results 1) show that GRE can be improved using
dependency parsing and dimensionality reduction, 2) demonstrate the utility of GRE
for the content selection step of extractive summarisation and 3) validate the GRE
claim of modification-free adaptation for the first time with respect to both domain and
task. This thesis also introduces data sets derived from publicly available corpora for
the purpose of rigorous intrinsic evaluation in the news and biomedical domains
Network Analysis and Comparative Phylogenomics of MicroRNAs and their Respective Messenger RNA Targets Using Twelve Drosophila species
MicroRNAs represent a special class of small (~21–25 nucleotides) non-coding RNA molecules which exert powerful post-transcriptional control over gene expression in eukaryotes. Indeed microRNAs likely represent the most abundant class of regulators in animal gene regulatory networks. This study describes the recovery and network analyses of a suite of homologous microRNA targets recovered through two different prediction methods for whole gene regions across twelve Drosophila species. Phylogenetic criteria under an accepted tree topology were used as a reference frame to 1) make inference into microRNA-target predictions, 2) study mathematical properties of microRNA-gene regulatory networks, 3) and conduct novel phylogenetic analyses using character data derived from weighted edges of the microRNA-target networks. This study investigates the evidences of natural selection and phylogenetic signatures inherent within the microRNA regulatory networks and quantifies time and mutation necessary to rewire a microRNA regulatory network. Selective factors that appear to operate upon seed aptamers include cooperativity (redundancy) of interactions and transcript length. Topological analyses of microRNA regulatory networks recovered significant enrichment for a motif possessing a redundant link in all twelve species sampled. This would suggest that optimization of the whole interactome topology itself has been historically subject to natural selection where resilience to attack have offered selective advantage. It seems that only a modest number of microRNA–mRNA interactions exhibit conservation over Drosophila cladogenesis. The decrease in conserved microRNA-target interactions with increasing phylogenetic distance exhibited a cure typical of a saturation phenomena. Scale free properties of a network intersection of microRNA target predictions methods were found to transect taxonomic hierarchy