2,791 research outputs found

    Genome-wide sequence-based prediction of peripheral proteins using a novel semi-supervised learning technique

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In <it>supervised learning</it>, traditional approaches to building a classifier use two sets of examples with pre-defined classes along with a learning algorithm. The main limitation of this approach is that examples from both classes are required which might be infeasible in certain cases, especially those dealing with biological data. Such is the case for membrane-binding peripheral domains that play important roles in many biological processes, including cell signaling and membrane trafficking by reversibly binding to membranes. For these domains, a well-defined <it>positive </it>set is available with domains known to bind membrane along with a large <it>unlabeled </it>set of domains whose membrane binding affinities have not been measured. The aforementioned limitation can be addressed by a special class of <it>semi-supervised </it>machine learning called <it>positive-unlabeled (PU) </it>learning that uses a positive set with a large unlabeled set.</p> <p>Methods</p> <p>In this study, we implement the first application of <it>PU-learning </it>to a protein function prediction problem: identification of peripheral domains. <it>PU-learning </it>starts by identifying reliable negative (<it>RN</it>) examples iteratively from the unlabeled set until convergence and builds a classifier using the positive and the final <it>RN </it>set. A data set of 232 positive cases and ~3750 unlabeled ones were used to construct and validate the protocol.</p> <p>Results</p> <p>Holdout evaluation of the protocol on a left-out positive set showed that the accuracy of prediction reached up to 95% during two independent implementations.</p> <p>Conclusion</p> <p>These results suggest that our protocol can be used for predicting membrane-binding properties of a wide variety of modular domains. Protocols like the one presented here become particularly useful in the case of availability of information from one class only.</p

    Global gene expression profiling of healthy human brain and its application in studying neurological disorders

    Get PDF
    The human brain is the most complex structure known to mankind and one of the greatest challenges in modern biology is to understand how it is built and organized. The power of the brain arises from its variety of cells and structures, and ultimately where and when different genes are switched on and off throughout the brain tissue. In other words, brain function depends on the precise regulation of gene expression in its sub-anatomical structures. But, our understanding of the complexity and dynamics of the transcriptome of the human brain is still incomplete. To fill in the need, we designed a gene expression model that accurately defines the consistent blueprint of the brain transcriptome; thereby, identifying the core brain specific transcriptional processes conserved across individuals. Functionally characterizing this model would provide profound insights into the transcriptional landscape, biological pathways and the expression distribution of neurotransmitter systems. Here, in this dissertation we developed an expression model by capturing the similarly expressed gene patterns across congruently annotated brain structures in six individual brains by using data from the Allen Brain Atlas (ABA). We found that 84% of genes are expressed in at least one of the 190 brain structures. By employing hierarchical clustering we were able to show that distinct structures of a bigger brain region can cluster together while still retaining their expression identity. Further, weighted correlation network analysis identified 19 robust modules of coexpressing genes in the brain that demonstrated a wide range of functional associations. Since signatures of local phenomena can be masked by larger signatures, we performed local analysis on each distinct brain structure. Pathway and gene ontology enrichment analysis on these structures showed, striking enrichment for brain region specific processes. Besides, we also mapped the structural distribution of the gene expression profiles of genes associated with major neurotransmission systems in the human. We also postulated the utility of healthy brain tissue gene expression to predict potential genes involved in a neurological disorder, in the absence of data from diseased tissues. To this end, we developed a supervised classification model, which achieved an accuracy of 84% and an AUC (Area Under the Curve) of 0.81 from ROC plots, for predicting autism-implicated genes using the healthy expression model as the baseline. This study represents the first use of healthy brain gene expression to predict the scope of genes in autism implication and this generic methodology can be applied to predict genes involved in other neurological disorders

    The effect of organelle discovery upon sub-cellular protein localisation.

    Get PDF
    Prediction of protein sub-cellular localisation by employing quantitative mass spectrometry experiments is an expanding field. Several methods have led to the assignment of proteins to specific subcellular localisations by partial separation of organelles across a fractionation scheme coupled with computational analysis. Methods developed to analyse organelle data have largely employed supervised machine learning algorithms to map unannotated abundance profiles to known protein–organelle associations. Such approaches are likely to make association errors if organelle-related groupings present in experimental output are not included in data used to create a protein–organelle classifier. Currently, there is no automated way to detect organelle-specific clusters within such datasets. In order to address the above issues we adapted a phenotype discovery algorithm, originally created to filter image-based output for RNAi screens, to identify putative subcellular groupings in organelle proteomics experiments. We were able to mine datasets to a deeper level and extract interesting phenotype clusters for more comprehensive evaluation in an unbiased fashion upon application of this approach. Organelle-related protein clusters were identified beyond those sufficiently annotated for use as training data. Furthermore, we propose avenues for the incorporation of observations made into general practice for the classification of protein–organelle membership from quantitative MS experiments. Biological significance Protein sub-cellular localisation plays an important role in molecular interactions, signalling and transport mechanisms. The prediction of protein localisation by quantitative mass-spectrometry (MS) proteomics is a growing field and an important endeavour in improving protein annotation. Several such approaches use gradient-based separation of cellular organelle content to measure relative protein abundance across distinct gradient fractions. The distribution profiles are commonly mapped in silico to known protein–organelle associations via supervised machine learning algorithms, to create classifiers that associate unannotated proteins to specific organelles. These strategies are prone to error, however, if organelle-related groupings present in experimental output are not represented, for example owing to the lack of existing annotation, when creating the protein–organelle mapping. Here, the application of a phenotype discovery approach to LOPIT gradient-based MS data identifies candidate organelle phenotypes for further evaluation in an unbiased fashion. Software implementation and usage guidelines are provided for application to wider protein–organelle association experiments. In the wider context, semi-supervised organelle discovery is discussed as a paradigm with which to generate new protein annotations from MS-based organelle proteomics experiments. This article is part of a Special Issue entitled: New Horizons and Applications for Proteomics [EuPA 2012]

    Algorithms to Explore the Structure and Evolution of Biological Networks

    Get PDF
    High-throughput experimental protocols have revealed thousands of relationships amongst genes and proteins under various conditions. These putative associations are being aggressively mined to decipher the structural and functional architecture of the cell. One useful tool for exploring this data has been computational network analysis. In this thesis, we propose a collection of novel algorithms to explore the structure and evolution of large, noisy, and sparsely annotated biological networks. We first introduce two information-theoretic algorithms to extract interesting patterns and modules embedded in large graphs. The first, graph summarization, uses the minimum description length principle to find compressible parts of the graph. The second, VI-Cut, uses the variation of information to non-parametrically find groups of topologically cohesive and similarly annotated nodes in the network. We show that both algorithms find structure in biological data that is consistent with known biological processes, protein complexes, genetic diseases, and operational taxonomic units. We also propose several algorithms to systematically generate an ensemble of near-optimal network clusterings and show how these multiple views can be used together to identify clustering dynamics that any single solution approach would miss. To facilitate the study of ancient networks, we introduce a framework called ``network archaeology'') for reconstructing the node-by-node and edge-by-edge arrival history of a network. Starting with a present-day network, we apply a probabilistic growth model backwards in time to find high-likelihood previous states of the graph. This allows us to explore how interactions and modules may have evolved over time. In experiments with real-world social and biological networks, we find that our algorithms can recover significant features of ancestral networks that have long since disappeared. Our work is motivated by the need to understand large and complex biological systems that are being revealed to us by imperfect data. As data continues to pour in, we believe that computational network analysis will continue to be an essential tool towards this end

    Drug-drug interactions: A machine learning approach

    Get PDF
    Automatic detection of drug-drug interaction (DDI) is a difficult problem in pharmaco-surveillance. Recent practice for in vitro and in vivo pharmacokinetic drug-drug interaction studies have been based on carefully selected drug characteristics such as their pharmacological effects, and on drug-target networks, in order to identify and comprehend anomalies in a drug\u27s biochemical function upon co-administration.;In this work, we present a novel DDI prediction framework that combines several drug-attribute similarity measures to construct a feature space from which we train three machine learning algorithms: Support Vector Machine (SVM), J48 Decision Tree and K-Nearest Neighbor (KNN) using a partially supervised classification algorithm called Positive Unlabeled Learning (PU-Learning) tailored specifically to suit our framework.;In summary, we extracted 1,300 U.S. Food and Drug Administration-approved pharmaceutical drugs and paired them to create 1,688,700 feature vectors. Out of 397 drug-pairs known to interact prior to our experiments, our system was able to correctly identify 80% of them and from the remaining 1,688,303 pairs for which no interaction had been determined, we were able to predict 181 potential DDIs with confidence levels greater than 97%. The latter is a set of DDIs unrecognized by our source of ground truth at the time of study.;Evaluation of the effectiveness of our system involved querying the U.S. Food and Drug Administration\u27s Adverse Effect Reporting System (AERS) database for cases involving drug-pairs used in this study. The results returned from the query listed incidents reported for a number of patients, some of whom had experienced severe adverse reactions leading to outcomes such as prolonged hospitalization, diminished medicinal effect of one or more drugs, and in some cases, death

    Machine Learning Approaches for Cancer Analysis

    Get PDF
    In addition, we propose many machine learning models that serve as contributions to solve a biological problem. First, we present Zseq, a linear time method that identifies the most informative genomic sequences and reduces the number of biased sequences, sequence duplications, and ambiguous nucleotides. Zseq finds the complexity of the sequences by counting the number of unique k-mers in each sequence as its corresponding score and also takes into the account other factors, such as ambiguous nucleotides or high GC-content percentage in k-mers. Based on a z-score threshold, Zseq sweeps through the sequences again and filters those with a z-score less than the user-defined threshold. Zseq is able to provide a better mapping rate; it reduces the number of ambiguous bases significantly in comparison with other methods. Evaluation of the filtered reads has been conducted by aligning the reads and assembling the transcripts using the reference genome as well as de novo assembly. The assembled transcripts show a better discriminative ability to separate cancer and normal samples in comparison with another state-of-the-art method. Studying the abundance of select mRNA species throughout prostate cancer progression may provide some insight into the molecular mechanisms that advance the disease. In the second contribution of this dissertation, we reveal that the combination of proper clustering, distance function and Index validation for clusters are suitable in identifying outlier transcripts, which show different trending than the majority of the transcripts, the trending of the transcript is the abundance throughout different stages of prostate cancer. We compare this model with standard hierarchical time-series clustering method based on Euclidean distance. Using time-series profile hierarchical clustering methods, we identified stage-specific mRNA species termed outlier transcripts that exhibit unique trending patterns as compared to most other transcripts during disease progression. This method is able to identify those outliers rather than finding patterns among the trending transcripts compared to the hierarchical clustering method based on Euclidean distance. A wet-lab experiment on a biomarker (CAM2G gene) confirmed the result of the computational model. Genes related to these outlier transcripts were found to be strongly associated with cancer, and in particular, prostate cancer. Further investigation of these outlier transcripts in prostate cancer may identify them as potential stage-specific biomarkers that can predict the progression of the disease. Breast cancer, on the other hand, is a widespread type of cancer in females and accounts for a lot of cancer cases and deaths in the world. Identifying the subtype of breast cancer plays a crucial role in selecting the best treatment. In the third contribution, we propose an optimized hierarchical classification model that is used to predict the breast cancer subtype. Suitable filter feature selection methods and new hybrid feature selection methods are utilized to find discriminative genes. Our proposed model achieves 100% accuracy for predicting the breast cancer subtypes using the same or even fewer genes. Studying breast cancer survivability among different patients who received various treatments may help understand the relationship between the survivability and treatment therapy based on gene expression. In the fourth contribution, we have built a classifier system that predicts whether a given breast cancer patient who underwent some form of treatment, which is either hormone therapy, radiotherapy, or surgery will survive beyond five years after the treatment therapy. Our classifier is a tree-based hierarchical approach that partitions breast cancer patients based on survivability classes; each node in the tree is associated with a treatment therapy and finds a predictive subset of genes that can best predict whether a given patient will survive after that particular treatment. We applied our tree-based method to a gene expression dataset that consists of 347 treated breast cancer patients and identified potential biomarker subsets with prediction accuracies ranging from 80.9% to 100%. We have further investigated the roles of many biomarkers through the literature. Studying gene expression through various time intervals of breast cancer survival may provide insights into the recovery of the patients. Discovery of gene indicators can be a crucial step in predicting survivability and handling of breast cancer patients. In the fifth contribution, we propose a hierarchical clustering method to separate dissimilar groups of genes in time-series data as outliers. These isolated outliers, genes that trend differently from other genes, can serve as potential biomarkers of breast cancer survivability. In the last contribution, we introduce a method that uses machine learning techniques to identify transcripts that correlate with prostate cancer development and progression. We have isolated transcripts that have the potential to serve as prognostic indicators and may have significant value in guiding treatment decisions. Our study also supports PTGFR, NREP, scaRNA22, DOCK9, FLVCR2, IK2F3, USP13, and CLASP1 as potential biomarkers to predict prostate cancer progression, especially between stage II and subsequent stages of the disease

    Dementia with Lewy Bodies: Genomics, Transcriptomics, and Its Future with Data Science

    Get PDF
    Dementia with Lewy bodies (DLB) is a significant public health issue. It is the second most common neurodegenerative dementia and presents with severe neuropsychiatric symptoms. Genomic and transcriptomic analyses have provided some insight into disease pathology. Variants within SNCA, GBA, APOE, SNCB, and MAPT have been shown to be associated with DLB in repeated genomic studies. Transcriptomic analysis, conducted predominantly on candidate genes, has identified signatures of synuclein aggregation, protein degradation, amyloid deposition, neuroinflammation, mitochondrial dysfunction, and the upregulation of heat-shock proteins in DLB. Yet, the understanding of DLB molecular pathology is incomplete. This precipitates the current clinical position whereby there are no available disease-modifying treatments or blood-based diagnostic biomarkers. Data science methods have the potential to improve disease understanding, optimising therapeutic intervention and drug development, to reduce disease burden. Genomic prediction will facilitate the early identification of cases and the timely application of future disease-modifying treatments. Transcript-level analyses across the entire transcriptome and machine learning analysis of multi-omic data will uncover novel signatures that may provide clues to DLB pathology and improve drug development. This review will discuss the current genomic and transcriptomic understanding of DLB, highlight gaps in the literature, and describe data science methods that may advance the field
    • …
    corecore