2,265 research outputs found
Semantic distillation: a method for clustering objects by their contextual specificity
Techniques for data-mining, latent semantic analysis, contextual search of
databases, etc. have long ago been developed by computer scientists working on
information retrieval (IR). Experimental scientists, from all disciplines,
having to analyse large collections of raw experimental data (astronomical,
physical, biological, etc.) have developed powerful methods for their
statistical analysis and for clustering, categorising, and classifying objects.
Finally, physicists have developed a theory of quantum measurement, unifying
the logical, algebraic, and probabilistic aspects of queries into a single
formalism. The purpose of this paper is twofold: first to show that when
formulated at an abstract level, problems from IR, from statistical data
analysis, and from physical measurement theories are very similar and hence can
profitably be cross-fertilised, and, secondly, to propose a novel method of
fuzzy hierarchical clustering, termed \textit{semantic distillation} --
strongly inspired from the theory of quantum measurement --, we developed to
analyse raw data coming from various types of experiments on DNA arrays. We
illustrate the method by analysing DNA arrays experiments and clustering the
genes of the array according to their specificity.Comment: Accepted for publication in Studies in Computational Intelligence,
Springer-Verla
09081 Abstracts Collection -- Similarity-based learning on structures
From 15.02. to 20.02.2009, the Dagstuhl Seminar 09081 ``Similarity-based learning on structures \u27\u27 was held in Schloss Dagstuhl~--~Leibniz Center for Informatics.
During the seminar, several participants presented their current
research, and ongoing work and open problems were discussed. Abstracts of
the presentations given during the seminar as well as abstracts of
seminar results and ideas are put together in this paper. The first section
describes the seminar topics and goals in general.
Links to extended abstracts or full papers are provided, if available
An application in bioinformatics : a comparison of affymetrix and compugen human genome microarrays
The human genome microarrays from Compugen® and Affymetrix® were compared in the context of the emerging field of computational biology. The two premier database servers for genomic sequence data, the National Center for Biotechnology Information and the European Bioinformatics Institute, were described in detail. The various databases and data mining tools available through these data servers were also discussed. Microarrays were examined from a historical perspective and their main current applications-expression analysis, mutation analysis, and comparative genomic hybridization-were discussed. The two main types of microarrays, cDNA spotted microarrays and high-density spotted microarrays were analyzed by exploring the human genome microarray from Compugen® and the HGU133 Set from Affymetrix® respectively. Array design issues, sequence collection and analysis, and probe selection processes for the two representative types of arrays were described. The respective chip design of the two types of microarrays was also analyzed. It was found that the human genome microarray from Compugen 0 contains probes that interrogate 1,119,840 bases corresponding to 18,664 genes, while the HG-U133 Set from Affymetrix® contains probes that interrogate only 825,000 bases corresponding to 33,000 genes. Based on this, the efficiency of the 25-mer probes of the HG-U133 Set from Affymetrix® compared to the 60-mer probes of the microarray from Compugen® was questioned
Machine learning methods for histopathological image analysis
Abundant accumulation of digital histopathological images has led to the
increased demand for their analysis, such as computer-aided diagnosis using
machine learning techniques. However, digital pathological images and related
tasks have some issues to be considered. In this mini-review, we introduce the
application of digital pathological image analysis using machine learning
algorithms, address some problems specific to such analysis, and propose
possible solutions.Comment: 23 pages, 4 figure
Bioinformatics protocols for analysis of functional genomics data applied to neuropathy microarray datasets
Microarray technology allows the simultaneous measurement of the
abundance of thousands of transcripts in living cells. The high-throughput
nature of microarray technology means that automatic analytical procedures
are required to handle the sheer amount of data, typically generated in a single
microarray experiment. Along these lines, this work presents a contribution to
the automatic analysis of microarray data by attempting to construct protocols
for the validation of publicly available methods for microarray.
At the experimental level, an evaluation of amplification of RNA targets prior
to hybridisation with the physical array was undertaken. This had the
important consequence of revealing the extent to which the significance of
intensity ratios between varying biological conditions may be compromised
following amplification as well as identifying the underlying cause of this
effect. On the basis of these findings, recommendations regarding the usability
of RNA amplification protocols with microarray screening were drawn in the
context of varying microarray experimental conditions.
On the data analysis side, this work has had the important outcome of
developing an automatic framework for the validation of functional analysis
methods for microarray. This is based on using a GO semantic similarity
scoring metric to assess the similarity between functional terms found enriched by functional analysis of a model dataset and those anticipated from
prior knowledge of the biological phenomenon under study. Using such
validation system, this work has shown, for the first time, that ‘Catmap’, an
early functional analysis method performs better than the more recent and
most popular methods of its kind. Crucially, the effectiveness of this
validation system implies that such system may be reliably adopted for
validation of newly developed functional analysis methods for microarray
Meta-analysis of muscle transcriptome data using the MADMuscle database reveals biologically relevant gene patterns
<p>Abstract</p> <p>Background</p> <p>DNA microarray technology has had a great impact on muscle research and microarray gene expression data has been widely used to identify gene signatures characteristic of the studied conditions. With the rapid accumulation of muscle microarray data, it is of great interest to understand how to compare and combine data across multiple studies. Meta-analysis of transcriptome data is a valuable method to achieve it. It enables to highlight conserved gene signatures between multiple independent studies. However, using it is made difficult by the diversity of the available data: different microarray platforms, different gene nomenclature, different species studied, etc.</p> <p>Description</p> <p>We have developed a system tool dedicated to muscle transcriptome data. This system comprises a collection of microarray data as well as a query tool. This latter allows the user to extract similar clusters of co-expressed genes from the database, using an input gene list. Common and relevant gene signatures can thus be searched more easily. The dedicated database consists in a large compendium of public data (more than 500 data sets) related to muscle (skeletal and heart). These studies included seven different animal species from invertebrates (<it>Drosophila melanogaster, Caenorhabditis elegans</it>) and vertebrates (<it>Homo sapiens, Mus musculus, Rattus norvegicus, Canis familiaris, Gallus gallus</it>). After a renormalization step, clusters of co-expressed genes were identified in each dataset. The lists of co-expressed genes were annotated using a unified re-annotation procedure. These gene lists were compared to find significant overlaps between studies.</p> <p>Conclusions</p> <p>Applied to this large compendium of data sets, meta-analyses demonstrated that conserved patterns between species could be identified. Focusing on a specific pathology (Duchenne Muscular Dystrophy) we validated results across independent studies and revealed robust biomarkers and new pathways of interest. The meta-analyses performed with MADMuscle show the usefulness of this approach. Our method can be applied to all public transcriptome data.</p
Text Mining and Gene Expression Analysis Towards Combined Interpretation of High Throughput Data
Microarrays can capture gene expression activity for thousands of genes simultaneously and thus make it possible to analyze cell physiology and disease processes on molecular level. The interpretation of microarray gene expression experiments profits from knowledge on the analyzed genes and proteins and the biochemical networks in which they play a role. The trend is towards the development of data analysis methods that integrate diverse data types. Currently, the most comprehensive biomedical knowledge source is a large repository of free text articles. Text mining makes it possible to automatically extract and use information from texts.
This thesis addresses two key aspects, biomedical text mining and gene expression data analysis, with the focus on providing high-quality methods and data that contribute to the development of integrated analysis approaches. The work is structured in three parts. Each part begins by providing the relevant background, and each chapter describes the developed methods as well as applications and results.
Part I deals with biomedical text mining:
Chapter 2 summarizes the relevant background of text mining; it describes text mining fundamentals, important text mining tasks, applications and particularities of text mining in the biomedical domain, and evaluation issues.
In Chapter 3, a method for generating high-quality gene and protein name dictionaries is described. The analysis of the generated dictionaries revealed important properties of individual nomenclatures and the used databases (Fundel and Zimmer, 2006). The dictionaries are publicly available via a Wiki, a web service, and several client applications (Szugat et al., 2005).
In Chapter 4, methods for the dictionary-based recognition of gene and protein names in texts and their mapping onto unique database identifiers are described. These methods make it possible to extract information from texts and to integrate text-derived information with data from other sources. Three named entity identification systems have been set up, two of them building upon the previously existing tool ProMiner (Hanisch et al., 2003). All of them have shown very good performance in the BioCreAtIvE challenges (Fundel et al., 2005a; Hanisch et al., 2005; Fundel and Zimmer, 2007).
In Chapter 5, a new method for relation extraction (Fundel et al., 2007) is presented. It was applied on the largest collection of biomedical literature abstracts, and thus a comprehensive network of human gene and protein relations has been generated. A classification approach (Küffner et al., 2006) can be used to specify relation types further; e. g., as activating, direct physical, or gene regulatory relation.
Part II deals with gene expression data analysis:
Gene expression data needs to be processed so that differentially expressed genes can be identified. Gene expression data processing consists of several sequential steps. Two important steps are normalization, which aims at removing systematic variances between measurements, and quantification of differential expression by p-value and fold change determination. Numerous methods exist for these tasks.
Chapter 6 describes the relevant background of gene expression data analysis; it presents the biological and technical principles of microarrays and gives an overview of the most relevant data processing steps. Finally, it provides a short introduction to osteoarthritis, which is in the focus of the analyzed gene expression data sets.
In Chapter 7, quality criteria for the selection of normalization methods are described, and a method for the identification of differentially expressed genes is proposed, which is appropriate for data with large intensity variances between spots representing the same gene (Fundel et al., 2005b). Furthermore, a system is described that selects an appropriate combination of feature selection method and classifier, and thus identifies genes which lead to good classification results and show consistent behavior in different sample subgroups (Davis et al., 2006).
The analysis of several gene expression data sets dealing with osteoarthritis is described in Chapter 8. This chapter contains the biomedical analysis of relevant disease processes and distinct disease stages (Aigner et al., 2006a), and a comparison of various microarray platforms and osteoarthritis models.
Part III deals with integrated approaches and thus provides the connection between parts I and II:
Chapter 9 gives an overview of different types of integrated data analysis approaches, with a focus on approaches that integrate gene expression data with manually compiled data, large-scale networks, or text mining.
In Chapter 10, a method for the identification of genes which are consistently regulated and have a coherent literature background (Küffner et al., 2005) is described. This method indicates how gene and protein name identification and gene expression data can be integrated to return clusters which contain genes that are relevant for the respective experiment together with literature information that supports interpretation.
Finally, in Chapter 11 ideas on how the described methods can contribute to current research and possible future directions are presented
- …