Search CORE

37,347 research outputs found

An empirical comparison of supervised machine learning techniques in bioinformatics

Author: Gilbert D
Tan A C
Publication venue: Australian Computer Society
Publication date: 01/01/2003
Field of study

Research in bioinformatics is driven by the experimental data. Current biological databases are populated by vast amounts of experimental data. Machine learning has been widely applied to bioinformatics and has gained a lot of success in this research area. At present, with various learning algorithms available in the literature, researchers are facing difficulties in choosing the best method that can apply to their data. We performed an empirical study on 7 individual learning systems and 9 different combined methods on 4 different biological data sets, and provide some suggested issues to be considered when answering the following questions: (i) How does one choose which algorithm is best suitable for their data set? (ii) Are combined methods better than a single approach? (iii) How does one compare the effectiveness of a particular algorithm to the others

CiteSeerX

Brunel University Research Archive

Identification of functionally related enzymes by learning-to-rank methods

Author: Airola Antti
De Baets Bernard
Fober Thomas
Glinca Serghei
Hüllermeier Eyke
Klebe Gerhard
Pahikkala Tapio
Stock Michiel
Waegeman Willem
Publication venue
Publication date: 01/01/2014
Field of study

Enzyme sequences and structures are routinely used in the biological sciences as queries to search for functionally related enzymes in online databases. To this end, one usually departs from some notion of similarity, comparing two enzymes by looking for correspondences in their sequences, structures or surfaces. For a given query, the search operation results in a ranking of the enzymes in the database, from very similar to dissimilar enzymes, while information about the biological function of annotated database enzymes is ignored. In this work we show that rankings of that kind can be substantially improved by applying kernel-based learning algorithms. This approach enables the detection of statistical dependencies between similarities of the active cleft and the biological function of annotated enzymes. This is in contrast to search-based approaches, which do not take annotated training data into account. Similarity measures based on the active cleft are known to outperform sequence-based or structure-based measures under certain conditions. We consider the Enzyme Commission (EC) classification hierarchy for obtaining annotated enzymes during the training phase. The results of a set of sizeable experiments indicate a consistent and significant improvement for a set of similarity measures that exploit information about small cavities in the surface of enzymes

arXiv.org e-Print Archive

Ghent University Academic Bibliography

Extracting protein-protein interactions from text using rich feature vectors and feature selection

Author: De Baets Bernard
Saeys Yvan
Van de Peer Yves
Van Landeghem Sofie
Publication venue: Turku Centre for Computer Sciences (TUCS)
Publication date: 01/01/2008
Field of study

Because of the intrinsic complexity of natural language, automatically extracting accurate information from text remains a challenge. We have applied rich featurevectors derived from dependency graphs to predict protein-protein interactions using machine learning techniques. We present the first extensive analysis of applyingfeature selection in this domain, and show that it can produce more cost-effective models. For the first time, our technique was also evaluated on several large-scalecross-dataset experiments, which offers a more realistic view on model performance. During benchmarking, we encountered several fundamental problems hindering comparability with other methods. We present a set of practical guidelines to set up ameaningful evaluation. Finally, we have analysed the feature sets from our experiments before and after feature selection, and evaluated the contribution of both lexical and syntacticinformation to our method. The gained insight will be useful to develop better performing methods in this domain

Ghent University Academic Bibliography

DNALinux Virtual Desktop Edition

Author: Sebastian Bassi
Virginia V. C. Gonzalez
Publication venue
Publication date: 09/08/2007
Field of study

The new version of DNALinux (VDE) is presented. DNALinux VDE is a departure from traditional distributions since it uses a virtual machine to bundle together the operating system and bioinformatics applications. The main advantage of this approach is that a virtualized environment doesn't affect a installed system. With a virtual machine a Linux system can be run under a Windows system, provided that the virtual machine player is installed. The included programs are listed and specifications to add more programs are explained. We believe that DNALinux could be used as a standardized virtual machine for learning, using, developing and testing bioinformatics applications

Crossref

Nature Precedings

Recommended from our members

Integrative machine learning approach for multi-class SCOP protein fold classification

Author: Deville Y
Gilbert D
Tan A C
Publication venue: GCB
Publication date: 01/01/2003
Field of study

Classification and prediction of protein structure has been a central research theme in structural bioinformatics. Due to the imbalanced distribution of proteins over multi SCOP classification, most discriminative machine learning suffers the well-known ‘False Positives ’ problem when learning over these types of problems. We have devised eKISS, an ensemble machine learning specifically designed to increase the coverage of positive examples when learning under multiclass imbalanced data sets. We have applied eKISS to classify 25 SCOP folds and show that our learning system improved over classical learning methods

Brunel University Research Archive

Bioinformatics tools in predictive ecology: Applications to fisheries

Author: Allan Tucker
Anvar Y.
Bishop C. M.
Bundy A.
Choi J. S.
Daniel Duplisea
Ghahramani Z.
Hand D. J.
Hartemink A. J.
Imoto S.
Langley P.
Liang S.
Pe'er D.
Pe'er D.
Pearl J.
Spirtes P.
Steele E.
Publication venue: 'The Royal Society'
Publication date: 19/01/2012
Field of study

This article is made available throught the Brunel Open Access Publishing Fund - Copygith @ 2012 Tucker et al.There has been a huge effort in the advancement of analytical techniques for molecular biological data over the past decade. This has led to many novel algorithms that are specialized to deal with data associated with biological phenomena, such as gene expression and protein interactions. In contrast, ecological data analysis has remained focused to some degree on off-the-shelf statistical techniques though this is starting to change with the adoption of state-of-the-art methods, where few assumptions can be made about the data and a more explorative approach is required, for example, through the use of Bayesian networks. In this paper, some novel bioinformatics tools for microarray data are discussed along with their ‘crossover potential’ with an application to fisheries data. In particular, a focus is made on the development of models that identify functionally equivalent species in different fish communities with the aim of predicting functional collapse

Crossref

PubMed Central

Brunel University Research Archive

Exact and efficient top-K inference for multi-target prediction by querying separable linear relational models

Author: De Baets Bernard
Dembczynski Krzysztof
Stock Michiel
Waegeman Willem
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Many complex multi-target prediction problems that concern large target spaces are characterised by a need for efficient prediction strategies that avoid the computation of predictions for all targets explicitly. Examples of such problems emerge in several subfields of machine learning, such as collaborative filtering, multi-label classification, dyadic prediction and biological network inference. In this article we analyse efficient and exact algorithms for computing the top-

K

predictions in the above problem settings, using a general class of models that we refer to as separable linear relational models. We show how to use those inference algorithms, which are modifications of well-known information retrieval methods, in a variety of machine learning settings. Furthermore, we study the possibility of scoring items incompletely, while still retaining an exact top-K retrieval. Experimental results in several application domains reveal that the so-called threshold algorithm is very scalable, performing often many orders of magnitude more efficiently than the naive approach

arXiv.org e-Print Archive

Ghent University Academic Bibliography

ProLanGO: Protein Function Prediction Using Neural~Machine Translation Based on a Recurrent Neural Network

Author: Cao Renzhi
Chan Leong
Chen Zhangxin
Freitas Colton
Jiang Haiqing
Sun Miao
Publication venue
Publication date: 01/10/2017
Field of study

With the development of next generation sequencing techniques, it is fast and cheap to determine protein sequences but relatively slow and expensive to extract useful information from protein sequences because of limitations of traditional biological experimental techniques. Protein function prediction has been a long standing challenge to fill the gap between the huge amount of protein sequences and the known function. In this paper, we propose a novel method to convert the protein function problem into a language translation problem by the new proposed protein sequence language "ProLan" to the protein function language "GOLan", and build a neural machine translation model based on recurrent neural networks to translate "ProLan" language to "GOLan" language. We blindly tested our method by attending the latest third Critical Assessment of Function Annotation (CAFA 3) in 2016, and also evaluate the performance of our methods on selected proteins whose function was released after CAFA competition. The good performance on the training and testing datasets demonstrates that our new proposed method is a promising direction for protein function prediction. In summary, we first time propose a method which converts the protein function prediction problem to a language translation problem and applies a neural machine translation model for protein function prediction.Comment: 13 pages, 5 figure

arXiv.org e-Print Archive

Directory of Open Access Journals

Machine Learning and Integrative Analysis of Biomedical Big Data.

Author: Choi Howard
Chung Neo Christopher
Mirza Bilal
Ping Peipei
Wang Jie
Wang Wei
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

Multidisciplinary Digital Publishing Institute

Ezid

Directory of Open Access Journals

eScholarship - University of California