984 research outputs found
Structured Sparse Methods for Imaging Genetics
abstract: Imaging genetics is an emerging and promising technique that investigates how genetic variations affect brain development, structure, and function. By exploiting disorder-related neuroimaging phenotypes, this class of studies provides a novel direction to reveal and understand the complex genetic mechanisms. Oftentimes, imaging genetics studies are challenging due to the relatively small number of subjects but extremely high-dimensionality of both imaging data and genomic data. In this dissertation, I carry on my research on imaging genetics with particular focuses on two tasks---building predictive models between neuroimaging data and genomic data, and identifying disorder-related genetic risk factors through image-based biomarkers. To this end, I consider a suite of structured sparse methods---that can produce interpretable models and are robust to overfitting---for imaging genetics. With carefully-designed sparse-inducing regularizers, different biological priors are incorporated into learning models. More specifically, in the Allen brain image--gene expression study, I adopt an advanced sparse coding approach for image feature extraction and employ a multi-task learning approach for multi-class annotation. Moreover, I propose a label structured-based two-stage learning framework, which utilizes the hierarchical structure among labels, for multi-label annotation. In the Alzheimer's disease neuroimaging initiative (ADNI) imaging genetics study, I employ Lasso together with EDPP (enhanced dual polytope projections) screening rules to fast identify Alzheimer's disease risk SNPs. I also adopt the tree-structured group Lasso with MLFre (multi-layer feature reduction) screening rules to incorporate linkage disequilibrium information into modeling. Moreover, I propose a novel absolute fused Lasso model for ADNI imaging genetics. This method utilizes SNP spatial structure and is robust to the choice of reference alleles of genotype coding. In addition, I propose a two-level structured sparse model that incorporates gene-level networks through a graph penalty into SNP-level model construction. Lastly, I explore a convolutional neural network approach for accurate predicting Alzheimer's disease related imaging phenotypes. Experimental results on real-world imaging genetics applications demonstrate the efficiency and effectiveness of the proposed structured sparse methods.Dissertation/ThesisDoctoral Dissertation Computer Science 201
DEEPred: Automated Protein Function Prediction with Multi-task Feed-forward Deep Neural Networks
Automated protein function prediction is critical for the annotation of uncharacterized protein sequences, where accurate prediction methods are still required. Recently, deep learning based methods have outperformed conventional algorithms in computer vision and natural language processing due to the prevention of overfitting and efficient training. Here, we propose DEEPred, a hierarchical stack of multi-task feed-forward deep neural networks, as a solution to Gene Ontology (GO) based protein function prediction. DEEPred was optimized through rigorous hyper-parameter tests, and benchmarked using three types of protein descriptors, training datasets with varying sizes and GO terms form different levels. Furthermore, in order to explore how training with larger but potentially noisy data would change the performance, electronically made GO annotations were also included in the training process. The overall predictive performance of DEEPred was assessed using CAFA2 and CAFA3 challenge datasets, in comparison with the state-of-the-art protein function prediction methods. Finally, we evaluated selected novel annotations produced by DEEPred with a literature-based case study considering the 'biofilm formation process' in Pseudomonas aeruginosa. This study reports that deep learning algorithms have significant potential in protein function prediction; particularly when the source data is large. The neural network architecture of DEEPred can also be applied to the prediction of the other types of ontological associations. The source code and all datasets used in this study are available at: https://github.com/cansyl/DEEPred
Recommended from our members
Network modularity and local environment similarity as descriptors of protein structure
As the number of solved protein structures increases, the opportunities for meta-analysis of this dataset increase too. Here we explore two approaches for analysing protein structure, both starting from the three-dimensional co-ordinates of each atom within the structure, which are then abstracted into a more useful form.
The first method transforms the protein into a network in which its amino acids are the nodes, and where the edges are generated using a simple proximity test. By applying the Infomap community detection algorithm, we can fragment the protein into highly intra-connected subregions - these subregions are compact and globular, and can be compared with known structural and functional subunits of the protein (also known as domains). By performing this fragmentation process systematically across a large set of proteins, and checking for structurally conserved fragments, we can search for novel candidate domains. This method for automatically decomposing a protein into compact substructures may also be useful in coarse-graining molecular dynamics, analysing the protein’s topology, in de novo protein design, or in fitting electron density maps derived from single particle electron microscopy.
The second method calculates a descriptor for each atom of the protein based on its local environment, known as a Smooth Overlap of Atomic Positions (SOAP) descriptor. Using these descriptors we can perform overall comparisons of the subregions identified above. In addition, by comparing the descriptors of a set of proteins known to share common structural or functional features (such as binding of a particular ligand), we can automatically identify the most highly conserved atoms of the set. These atoms may line ligand binding pockets or correspond to allosteric sites, which could inform drug design
Predict gram - positive and gram - negative subcellular localization via incorporating evolutionary information and physicochemical features into Chou’s general PseAAC
In this study, we used structural and evolutionary
based features to represent the sequences of gram-positive and gram-negative subcellular localizations. To do this, we proposed a normalization method to construct a normalize Position Specific Scoring Matrix (PSSM) using the information from original PSSM. To investigate the effectiveness of the proposed method we compute feature vectors from normalize PSSM and by applying Support Vector Machine (SVM) and NaĂŻve Bayes classifier, respectively, we compared achieved results with the
previously reported results. We also computed features from original PSSM and normalized PSSM and compared their
results. The archived results show enhancement in gram-positive and gram-negative subcellular localizations. Evaluating localization for each feature, our results indicate that employing SVM and concatenating features (amino acid composition feature, Dubchak feature (physicochemical-based features), normalized PSSM based auto-covariance feature and normalized PSSM based bigram feature) have higher accuracy while employing NaĂŻve Bayes classifier with normalized PSSM based auto-covariance feature proves to have high sensitivity for both
benchmarks. Our reported results in terms of overall locative accuracy is 84.8% and overall absolute accuracy is 85.16% for gram-positive dataset; and, for gram- negative dataset, overall locative accuracy is 85.4% and overall absolute accuracy is 86.3%
Clustering cliques for graph-based summarization of the biomedical research literature
BACKGROUND: Graph-based notions are increasingly used in biomedical data mining and knowledge discovery tasks. In this paper, we present a clique-clustering method to automatically summarize graphs of semantic predications produced from PubMed citations (titles and abstracts). RESULTS: SemRep is used to extract semantic predications from the citations returned by a PubMed search. Cliques were identified from frequently occurring predications with highly connected arguments filtered by degree centrality. Themes contained in the summary were identified with a hierarchical clustering algorithm based on common arguments shared among cliques. The validity of the clusters in the summaries produced was compared to the Silhouette-generated baseline for cohesion, separation and overall validity. The theme labels were also compared to a reference standard produced with major MeSH headings. CONCLUSIONS: For 11 topics in the testing data set, the overall validity of clusters from the system summary was 10% better than the baseline (43% versus 33%). While compared to the reference standard from MeSH headings, the results for recall, precision and F-score were 0.64, 0.65, and 0.65 respectively
Previsão e análise da estrutura e dinâmica de redes biológicas
Increasing knowledge about the biological processes that govern the
dynamics of living organisms has fostered a better understanding of the
origin of many diseases as well as the identification of potential therapeutic
targets. Biological systems can be modeled through biological networks,
allowing to apply and explore methods of graph theory in their investigation
and characterization. This work had as main motivation the inference of
patterns and rules that underlie the organization of biological networks.
Through the integration of different types of data, such as gene expression,
interaction between proteins and other biomedical concepts, computational
methods have been developed so that they can be used to predict and study
diseases.
The first contribution, was the characterization a subsystem of the human
protein interactome through the topological properties of the networks that
model it. As a second contribution, an unsupervised method using biological
criteria and network topology was used to improve the understanding of
the genetic mechanisms and risk factors of a disease through co-expression
networks. As a third contribution, a methodology was developed to remove
noise (denoise) in protein networks, to obtain more accurate models, using
the network topology. As a fourth contribution, a supervised methodology
was proposed to model the protein interactome dynamics, using exclusively
the topology of protein interactions networks that are part of the dynamic
model of the system.
The proposed methodologies contribute to the creation of more precise,
static and dynamic biological models through the identification and use of
topological patterns of protein interaction networks, which can be used to
predict and study diseases.O conhecimento crescente sobre os processos biolĂłgicos que regem a
dinâmica dos organismos vivos tem potenciado uma melhor compreensão da
origem de muitas doenças, assim como a identificação de potenciais alvos
terapêuticos. Os sistemas biológicos podem ser modelados através de redes
biológicas, permitindo aplicar e explorar métodos da teoria de grafos na sua
investigação e caracterização. Este trabalho teve como principal motivação
a inferência de padrões e de regras que estão subjacentes à organização de
redes biolĂłgicas.
Através da integração de diferentes tipos de dados, como a expressão
de genes, interação entre proteĂnas e outros conceitos biomĂ©dicos, foram
desenvolvidos métodos computacionais, para que possam ser usados na
previsão e no estudo de doenças.
Como primeira contribuição, foi proposto um método de caracterização de
um subsistema do interactoma de proteĂnas humano atravĂ©s das propriedades
topológicas das redes que o modelam. Como segunda contribuição, foi
utilizado um método não supervisionado que utiliza critérios biológicos e
topologia de redes para, através de redes de co-expressão, melhorar a
compreensão dos mecanismos genéticos e dos fatores de risco de uma
doença. Como terceira contribuição, foi desenvolvida uma metodologia
para remover ruĂdo (denoise) em redes de proteĂnas, para obter modelos
mais precisos, utilizando a topologia das redes. Como quarta contribuição,
propôs-se uma metodologia supervisionada para modelar a dinâmica do
interactoma de proteĂnas, usando exclusivamente a topologia das redes de
interação de proteĂnas que fazem parte do modelo dinâmico do sistema.
As metodologias propostas contribuem para a criação de modelos biológicos,
estáticos e dinâmicos, mais precisos, através da identificação e uso de
padrões topolĂłgicos das redes de interação de proteĂnas, que podem ser
usados na previsão e no estudo doenças.Programa Doutoral em Engenharia Informátic
Quantitative and evolutionary global analysis of enzyme reaction mechanisms
The most widely used classification system describing enzyme-catalysed reactions
is the Enzyme Commission (EC) number. Understanding enzyme
function is important for both fundamental scientific and pharmaceutical
reasons. The EC classification is essentially unrelated to the reaction mechanism.
In this work we address two important questions related to enzyme
function diversity. First, to investigate the relationship between the reaction
mechanisms as described in the MACiE (Mechanism, Annotation,
and Classification in Enzymes) database and the main top-level class of the
EC classification. Second, how well these enzymes biocatalysis are adapted
in nature.
In this thesis, we have retrieved 335 enzyme reactions from the MACiE
database. We consider two ways of encoding the reaction mechanism in
descriptors, and three approaches that encode only the overall chemical
reaction.
To proceed through my work, we first develop a basic model to cluster
the enzymatic reactions. Global study of enzyme reaction mechanism
may provide important insights for better understanding of the diversity of
chemical reactions of enzymes. Clustering analysis in such research is very
common practice. Clustering algorithms suffer from various issues, such as
requiring determination of the input parameters and stopping criteria, and
very often a need to specify the number of clusters in advance.
Using several well known metrics, we tried to optimize the clustering
outputs for each of the algorithms, with equivocal results that suggested the
existence of between two and over a hundred clusters. This motivated us to
design and implement our algorithm, PFClust (Parameter-Free Clustering),
where no prior information is required to determine the number of cluster. The analysis highlights the structure of the enzyme overall and mechanistic
reaction. This suggests that mechanistic similarity can influence approaches
for function prediction and automatic annotation of newly discovered protein
and gene sequences.
We then develop and evaluate the method for enzyme function prediction
using machine learning methods. Our results suggest that pairs of similar
enzyme reactions tend to proceed by different mechanisms. The machine
learning method needs only chemoinformatics descriptors as an input and
is applicable for regression analysis.
The last phase of this work is to test the evolution of chemical mechanisms
mapped onto ancestral enzymes. This domain occurrence and abundance
in modern proteins has showed that the / architecture is probably
the oldest fold design. These observations have important implications for
the origins of biochemistry and for exploring structure-function relationships.
Over half of the known mechanisms are introduced before architectural
diversification over the evolutionary time. The other halves of the mechanisms
are invented gradually over the evolutionary timeline just after organismal
diversification. Moreover, many common mechanisms includes fundamental
building blocks of enzyme chemistry were found to be associated
with the ancestral fold
- …