20 research outputs found
Deep Robust Framework for Protein Function Prediction using Variable-Length Protein Sequences
Amino acid sequence portrays most intrinsic form of a protein and expresses
primary structure of protein. The order of amino acids in a sequence enables a
protein to acquire a particular stable conformation that is responsible for the
functions of the protein. This relationship between a sequence and its function
motivates the need to analyse the sequences for predicting protein functions.
Early generation computational methods using BLAST, FASTA, etc. perform
function transfer based on sequence similarity with existing databases and are
computationally slow. Although machine learning based approaches are fast, they
fail to perform well for long protein sequences (i.e., protein sequences with
more than 300 amino acid residues). In this paper, we introduce a novel method
for construction of two separate feature sets for protein sequences based on
analysis of 1) single fixed-sized segments and 2) multi-sized segments, using
bi-directional long short-term memory network. Further, model based on proposed
feature set is combined with the state of the art Multi-lable Linear
Discriminant Analysis (MLDA) features based model to improve the accuracy.
Extensive evaluations using separate datasets for biological processes and
molecular functions demonstrate promising results for both single-sized and
multi-sized segments based feature sets. While former showed an improvement of
+3.37% and +5.48%, the latter produces an improvement of +5.38% and +8.00%
respectively for two datasets over the state of the art MLDA based classifier.
After combining two models, there is a significant improvement of +7.41% and
+9.21% respectively for two datasets compared to MLDA based classifier.
Specifically, the proposed approach performed well for the long protein
sequences and superior overall performance
Noisy multi-label semi-supervised dimensionality reduction
Noisy labeled data represent a rich source of information that often are
easily accessible and cheap to obtain, but label noise might also have many
negative consequences if not accounted for. How to fully utilize noisy labels
has been studied extensively within the framework of standard supervised
machine learning over a period of several decades. However, very little
research has been conducted on solving the challenge posed by noisy labels in
non-standard settings. This includes situations where only a fraction of the
samples are labeled (semi-supervised) and each high-dimensional sample is
associated with multiple labels. In this work, we present a novel
semi-supervised and multi-label dimensionality reduction method that
effectively utilizes information from both noisy multi-labels and unlabeled
data. With the proposed Noisy multi-label semi-supervised dimensionality
reduction (NMLSDR) method, the noisy multi-labels are denoised and unlabeled
data are labeled simultaneously via a specially designed label propagation
algorithm. NMLSDR then learns a projection matrix for reducing the
dimensionality by maximizing the dependence between the enlarged and denoised
multi-label space and the features in the projected space. Extensive
experiments on synthetic data, benchmark datasets, as well as a real-world case
study, demonstrate the effectiveness of the proposed algorithm and show that it
outperforms state-of-the-art multi-label feature extraction algorithms.Comment: 38 page
Fast Label Embeddings via Randomized Linear Algebra
Many modern multiclass and multilabel problems are characterized by
increasingly large output spaces. For these problems, label embeddings have
been shown to be a useful primitive that can improve computational and
statistical efficiency. In this work we utilize a correspondence between rank
constrained estimation and low dimensional label embeddings that uncovers a
fast label embedding algorithm which works in both the multiclass and
multilabel settings. The result is a randomized algorithm whose running time is
exponentially faster than naive algorithms. We demonstrate our techniques on
two large-scale public datasets, from the Large Scale Hierarchical Text
Challenge and the Open Directory Project, where we obtain state of the art
results.Comment: To appear in the proceedings of the ECML/PKDD 2015 conference.
Reference implementation available at https://github.com/pmineiro/randembe
Structured Sparse Methods for Imaging Genetics
abstract: Imaging genetics is an emerging and promising technique that investigates how genetic variations affect brain development, structure, and function. By exploiting disorder-related neuroimaging phenotypes, this class of studies provides a novel direction to reveal and understand the complex genetic mechanisms. Oftentimes, imaging genetics studies are challenging due to the relatively small number of subjects but extremely high-dimensionality of both imaging data and genomic data. In this dissertation, I carry on my research on imaging genetics with particular focuses on two tasks---building predictive models between neuroimaging data and genomic data, and identifying disorder-related genetic risk factors through image-based biomarkers. To this end, I consider a suite of structured sparse methods---that can produce interpretable models and are robust to overfitting---for imaging genetics. With carefully-designed sparse-inducing regularizers, different biological priors are incorporated into learning models. More specifically, in the Allen brain image--gene expression study, I adopt an advanced sparse coding approach for image feature extraction and employ a multi-task learning approach for multi-class annotation. Moreover, I propose a label structured-based two-stage learning framework, which utilizes the hierarchical structure among labels, for multi-label annotation. In the Alzheimer's disease neuroimaging initiative (ADNI) imaging genetics study, I employ Lasso together with EDPP (enhanced dual polytope projections) screening rules to fast identify Alzheimer's disease risk SNPs. I also adopt the tree-structured group Lasso with MLFre (multi-layer feature reduction) screening rules to incorporate linkage disequilibrium information into modeling. Moreover, I propose a novel absolute fused Lasso model for ADNI imaging genetics. This method utilizes SNP spatial structure and is robust to the choice of reference alleles of genotype coding. In addition, I propose a two-level structured sparse model that incorporates gene-level networks through a graph penalty into SNP-level model construction. Lastly, I explore a convolutional neural network approach for accurate predicting Alzheimer's disease related imaging phenotypes. Experimental results on real-world imaging genetics applications demonstrate the efficiency and effectiveness of the proposed structured sparse methods.Dissertation/ThesisDoctoral Dissertation Computer Science 201
Semi-supervised deep embedded clustering
National Research Foundation (NRF) Singapor