601 research outputs found
Mixture of experts models to exploit global sequence similarity on biomolecular sequence labeling
Background: Identification of functionally important sites in biomolecular sequences has broad applications ranging from rational drug design to the analysis of metabolic and signal transduction networks. Experimental determination of such sites lags far behind the number of known biomolecular sequences. Hence, there is a need to develop reliable computational methods for identifying functionally important sites from biomolecular sequences.
Results: We present a mixture of experts approach to biomolecular sequence labeling that takes into account the global similarity between biomolecular sequences. Our approach combines unsupervised and supervised learning techniques. Given a set of sequences and a similarity measure defined on pairs of sequences, we learn a mixture of experts model by using spectral clustering to learn the hierarchical structure of the model and by using bayesian techniques to combine the predictions of the experts. We evaluate our approach on two biomolecular sequence labeling problems: RNA-protein and DNA-protein interface prediction problems. The results of our experiments show that global sequence similarity can be exploited to improve the performance of classifiers trained to label biomolecular sequence data.
Conclusion: The mixture of experts model helps improve the performance of machine learning methods for identifying functionally important sites in biomolecular sequences.This is a proceeding from IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 10 (2009): S4, doi: 10.1186/1471-2105-10-S4-S4. Posted with permission.</p
Recommended from our members
Mixture of experts models to exploit global sequence similarity on biomolecular sequence labeling
Article discussing models for increasing the reliability of computational methods for identifying functionally important sites from biomolecular sequences
UNIPred: Unbalance-aware Network Integration and Prediction of protein functions
Abstract The proper integration of multiple sources of data and the unbalance between annotated and unannotated proteins represent two of the main issues of the Automated Function Prediction (AFP) problem. Most of supervised and semi-supervised learning algorithms for AFP proposed in literature do not jointly consider these items, with a negative impact on both sensitivity and precision performances, due to the unbalance between annotated and unannotated proteins that characterize the majority of functional classes and to the specific and complementary information content embedded in each available source of data. We propose UNIPred (Unbalance-aware Network Integration and Prediction of protein functions), an algorithm that properly combines different biomolecular networks and predicts protein functions using parametric semi-supervised neural models. The algorithm explicitly takes into account the unbalance between unannotated and annotated proteins both to construct the integrated network and to predict protein annotations for each functional class. Full-genome and ontology-wide experiments with three Eukaryotic model organisms show that the proposed method compares favourably with state-of-the-art learning algorithms for AFP
From Text to Knowledge
The global information space provided by the World Wide Web has changed dramatically
the way knowledge is shared all over the world. To make this unbelievable huge information
space accessible, search engines index the uploaded contents and provide efficient
algorithmic machinery for ranking the importance of documents with respect to an input
query. All major search engines such as Google, Yahoo or Bing are keyword-based, which
is indisputable a very powerful tool for accessing information needs centered around documents.
However, this unstructured, document-oriented paradigm of the World Wide Web has serious drawbacks, when searching for specific knowledge about real-world entities.
When asking for advanced facts about entities, today's search engines are not very good in providing accurate answers. Hand-built knowledge bases such as Wikipedia or its structured counterpart DBpedia are excellent sources that provide common facts. However, these knowledge bases are far from being complete and most of the knowledge lies still buried in unstructured documents.
Statistical machine learning methods have the great potential to help to bridge the gap between text and knowledge by (semi-)automatically transforming the unstructured representation of the today's World Wide Web to a more structured representation. This
thesis is devoted to reduce this gap with Probabilistic Graphical Models. Probabilistic
Graphical Models play a crucial role in modern pattern recognition as they merge two important fields of applied mathematics: Graph Theory and Probability Theory.
The first part of the thesis will present a novel system called Text2SemRel that is able to (semi-)automatically construct knowledge bases from textual document collections. The resulting knowledge base consists of facts centered around entities and their relations.
Essential part of the system is a novel algorithm for extracting relations between entity
mentions that is based on Conditional Random Fields, which are Undirected Probabilistic Graphical Models.
In the second part of the thesis, we will use the power of Directed Probabilistic Graphical Models to solve important knowledge discovery tasks in semantically annotated large document collections. In particular, we present extensions of the Latent Dirichlet Allocation framework that are able to learn in an unsupervised way the statistical semantic
dependencies between unstructured representations such as documents and their semantic annotations. Semantic annotations of documents might refer to concepts originating from a thesaurus or ontology but also to user-generated informal tags in social tagging
systems. These forms of annotations represent a first step towards the conversion to a more structured form of the World Wide Web.
In the last part of the thesis, we prove the large-scale applicability of the proposed fact extraction system Text2SemRel. In particular, we extract semantic relations between genes and diseases from a large biomedical textual repository. The resulting knowledge
base contains far more potential disease genes exceeding the number of disease genes that
are currently stored in curated databases. Thus, the proposed system is able to unlock
knowledge currently buried in the literature. The literature-derived human gene-disease
network is subject of further analysis with respect to existing curated state of the art
databases. We analyze the derived knowledge base quantitatively by comparing it with
several curated databases with regard to size of the databases and properties of known
disease genes among other things. Our experimental analysis shows that the facts extracted
from the literature are of high quality
固有表現抽出のための素性の一般化の研究
学位の種別:課程博士University of Tokyo(東京大学
Machine Learning
Machine Learning can be defined in various ways related to a scientific domain concerned with the design and development of theoretical and implementation tools that allow building systems with some Human Like intelligent behavior. Machine learning addresses more specifically the ability to improve automatically through experience
Recommended from our members
Variational Multi-Task Models for Image Analysis: Applications to Magnetic Resonance Imaging
This thesis deals with the study and development of several variational multi-task models for solving inverse problems in imaging, with a particular focus on Magnetic Resonance Imaging (MRI). In most image processing problems, one usually deals with the reconstruction task, i.e., the task of reconstructing an image from indirect measurements, and then performs various operations, one after the other (i.e. sequentially), to improve the quality of the reconstruction and to extract useful information.
However, recent developments in a variational context, have shown that performing those tasks jointly (i.e. in a multi-task framework) offers great benefits, and this is the perspective that we follow in this thesis. We go beyond traditional sequential approaches and set a new basis for variational multi-task methods for MRI analysis. We demonstrate that by sharing representation between tasks and carefully interconnecting them, one can create synergies across challenging problems and reduce error propagation.
More precisely, firstly we propose a multi-task variational model to tackle the problems of image reconstruction and image segmentation using non-convex Bregman iteration. We describe theoretical and numerical details of the problem and its optimisation scheme. Moreover, we show that our multi-task model achieves better results in several examples and MRI applications than existing approaches in the same context.
Secondly, we show that our approach can be extended to a multi-task reconstruction and segmentation model for the nonlinear inverse problem of velocity-encoded MRI. In this context, the aim is to estimate not only the magnitude from MRI data, but also the phase and its flow information, whilst simultaneously identify regions of interest through the segmentation task.
Finally, we go beyond two-task frameworks and introduce for the first time a variational multi-task model to handle three imaging tasks. To this end, we design a variational multi-task framework addressing reconstruction, super-resolution and registration for improving the quality of MRI reconstruction. We demonstrate that our model is theoretically well-motivated and it outperforms sequential models whilst requiring less computational cost. Furthermore, we show through experimental results the potential of this approach for clinical applications
- …