64 research outputs found
Text Mining applied to Molecular Biology
This thesis describes the development of text-mining
algorithms for molecular biology, in particular for DNA microarray
data analysis. Concept profiles were introduced, which characterize
the context in which a gene is mentioned in literature, to retrieve
functional associations between genes. The method was shown to
efficiently annotate DNA microarray data and complement existing
methods. Concept profiles were also used for other types of concepts
and were successfully applied for functional annotation of genes
through automatic assignment of Gene Ontology terms to genes. A
generic framework has been developed based on concept profiles, dubbed
Anni (www.biosemantics.org/anni), to provide researchers with an
ontology-based interface to the literature and we demonstrated its
utility for literature-based knowledge discovery. Use and development
of text-mining tools to identify relations between genes and to
automatically annotate sets of genes resulting from !
microarray experiments.
Comparing DNA microarray studies can reveal interesting parallels.
However, such analyses are hampered by the large influences of design,
technical and statistical factors on the found differentially
expressed genes. Comparisons based on perturbed biological processes
could be more robust. Concept profiles were used to reveal overlapping
biological processes between microarray studies in a comparative meta-
analysis of 102 muscle-related microarray studies. We demonstrated
that many more biologically meaningful links could be retrieved
between studies, even between studies without differentially expressed
genes in common
Anni 2.0: a multipurpose text-mining tool for the life sciences
Anni 2.0 provides an ontology-based interface to MEDLINE
CoPub Mapper: mining MEDLINE based on search term co-publication
BACKGROUND: High throughput microarray analyses result in many differentially expressed genes that are potentially responsible for the biological process of interest. In order to identify biological similarities between genes, publications from MEDLINE were identified in which pairs of gene names and combinations of gene name with specific keywords were co-mentioned. RESULTS: MEDLINE search strings for 15,621 known genes and 3,731 keywords were generated and validated. PubMed IDs were retrieved from MEDLINE and relative probability of co-occurrences of all gene-gene and gene-keyword pairs determined. To assess gene clustering according to literature co-publication, 150 genes consisting of 8 sets with known connections (same pathway, same protein complex, or same cellular localization, etc.) were run through the program. Receiver operator characteristics (ROC) analyses showed that most gene sets were clustered much better than expected by random chance. To test grouping of genes from real microarray data, 221 differentially expressed genes from a microarray experiment were analyzed with CoPub Mapper, which resulted in several relevant clusters of genes with biological process and disease keywords. In addition, all genes versus keywords were hierarchical clustered to reveal a complete grouping of published genes based on co-occurrence. CONCLUSION: The CoPub Mapper program allows for quick and versatile querying of co-published genes and keywords and can be successfully used to cluster predefined groups of genes and microarray data
Mining microarray datasets aided by knowledge stored in literature
DNA microarray technology produces large amounts of data. For data mining
of these datasets, background information on genes can be helpful.
Unfortunately most information is stored in free text. Here, we present an
approach to use this information for DNA microarray data mining
Literature-aided meta-analysis of microarray data: a compendium study on muscle development and disease
Background: Comparative analysis of expression microarray studies is difficult due to the large influence of technical factors on experimental outcome. Still, the identified differentially expressed genes may hint at the same biological processes. However, manually curated assignment of genes to biological processes, such as pursued by the Gene Ontology (GO) consortium, is incomplete and limited. We hypothesised that automatic association of genes with biological processes through thesaurus-controlled mining of Medline abstracts would be more effective. Therefore, we developed a novel algorithm (LAMA: Literature-Aided Meta-Analysis) to quantify the similarity between transcriptomics studies. We evaluated our algorithm on a large compendium of 102 microarray studies published in the field of muscle development and disease, and compared it to similarity measures based on gene overlap and over-representation of biological processes assigned by GO. Results: While the overlap in both genes and overrepresented GO-terms was poor, LAMA retrieved many more biologically meaningful links between studies, with substantially lower influence of technical factors. LAMA correctly grouped muscular dystrophy, regeneration and myositis studies, and linked patient and corresponding mouse model studies. LAMA also retrieves the connecting biological concepts. Among other new discoveries, we associated cullin proteins, a class of ubiquitinylation proteins, with genes down-regulated during muscle regeneration, whereas ubiquitinylation was previously reported to be activated during the inverse process: muscle atrophy. Conclusion: Our literature-based association analysis is capable of finding hidden common biological denominators in microarray studies, and circumvents the need for raw data analysis or curated gene annotation databases
Adaptation to high ethanol reveals complex evolutionary pathways
Tolerance to high levels of ethanol is an ecologically and industrially relevant phenotype of microbes, but the molecular mechanisms underlying this complex trait remain largely unknown. Here, we use long-term experimental evolution of isogenic yeast populations of different initial ploidy to study adaptation to increasing levels of ethanol. Whole-genome sequencing of more than 30 evolved populations and over 100 adapted clones isolated throughout this two-year evolution experiment revealed how a complex interplay of de novo single nucleotide mutations, copy number variation, ploidy changes, mutator phenotypes, and clonal interference led to a significant increase in ethanol tolerance. Although the specific mutations differ between different evolved lineages, application of a novel computational pipeline, PheNetic, revealed that many mutations target functional modules involved in stress response, cell cycle regulation, DNA repair and respiration. Measuring the fitness effects of selected mutations introduced in non-evolved ethanol-sensitive cells revealed several adaptive mutations that had previously not been implicated in ethanol tolerance, including mutations in PRT1, VPS70 and MEX67. Interestingly, variation in VPS70 was recently identified as a QTL for ethanol tolerance in an industrial bio-ethanol strain. Taken together, our results show how, in contrast to adaptation to some other stresses, adaptation to a continuous complex and severe stress involves interplay of different evolutionary mechanisms. In addition, our study reveals functional modules involved in ethanol resistance and identifies several mutations that could help to improve the ethanol tolerance of industrial yeasts
Ambiguity of human gene symbols in LocusLink and MEDLINE: creating an inventory and a disambiguation test collection
Genes are discovered almost on a daily basis and new names have to be
found. Although there are guidelines for gene nomenclature, the naming
process is highly creative. Human genes are often named with a gene symbol
and a longer, more descriptive term; the short form is very often an
abbreviation of the long form. Abbreviations in biomedical language are
highly ambiguous, i.e., one gene symbol often refers to more than one
gene.Using an existing abbreviation expansion algorithm,we explore MEDLINE
for the use of human gene symbols derived from LocusLink. It turns out
that just over 40% of these symbols occur in MEDLINE, however, many of
these occurrences are not related to genes. Along the process of making an
inventory, a disambiguation test collection is constructed automatically
Co-occurrence based meta-analysis of scientific texts: retrieving biological relationships between genes
MOTIVATION: The advent of high-throughput experiments in molecular biology creates a need for methods to efficiently extract and use information for large numbers of genes. Recently, the associative concept space (ACS) has been developed for the representation of information extracted from biomedical literature. The ACS is a Euclidean space in which thesaurus concepts are positioned and the distances between concepts indicates their relatedness. The ACS uses co-occurrence of concepts as a source of information. In this paper we evaluate how well the system can retrieve functionally related genes and we compare its performance with a simple gene co-occurrence method. RESULTS: To assess the performance of the ACS we composed a test set of five groups of functionally related genes. With the ACS good scores were obtained for four of the five groups. When compared to the gene co-occurrence method, the ACS is capable of revealing more functional biological relations and can achieve results with less literature available per gene. Hierarchical clustering was performed on the ACS output, as a potential aid to users, and was found to provide useful clusters. Our results suggest that the algorithm can be of value for researchers studying large numbers of genes. AVAILABILITY: The ACS program is available upon request from the authors
Using contextual queries
Search engines generally treat search requests in isolation. The results
for a given query are identical, independent of the user, or the context
in which the user made the request. An approach is demonstrated that
explores implicit contexts as obtained from a document the user is
reading. The approach inserts into an original (web) document
functionality to directly activate context driven queries that yield
related articles obtained from various information sources
- …