7 research outputs found

    A Machine Learning Model for Discovery of Protein Isoforms as Biomarkers

    Get PDF
    Prostate cancer is the most common cancer in men. One in eight Canadian men will be diagnosed with prostate cancer in their lifetime. The accurate detection of the disease’s subtypes is critical for providing adequate therapy; hence, it is critical for increasing both survival rates and quality of life. Next generation sequencing can be beneficial when studying cancer. This technology generates a large amount of data that can be used to extract information about biomarkers. This thesis proposes a model that discovers protein isoforms for different stages of prostate cancer progression. A tool has been developed that utilizes RNA-Seq data to infer open reading frames (ORFs) corresponding to transcripts. These ORFs are used as features for classificatio. A quantification measurement, Adaptive Fragments Per Kilobase of transcript per Million mapped reads (AFPKM), is proposed to compute the expression level for ORFs. The new measurement considers the actual length of the ORF and the length of the transcript. Using these ORFs and the new expression measure, several classifiers were built using different machine learning techniques. That enabled the identification of some protein isoforms related to prostate cancer progression. The biomarkers have had a great impact on the discrimination of prostate cancer stages and are worth further investigation

    IsoSVM – Distinguishing isoforms and paralogs on the protein level

    Get PDF
    BACKGROUND: Recent progress in cDNA and EST sequencing is yielding a deluge of sequence data. Like database search results and proteome databases, this data gives rise to inferred protein sequences without ready access to the underlying genomic data. Analysis of this information (e.g. for EST clustering or phylogenetic reconstruction from proteome data) is hampered because it is not known if two protein sequences are isoforms (splice variants) or not (i.e. paralogs/orthologs). However, even without knowing the intron/exon structure, visual analysis of the pattern of similarity across the alignment of the two protein sequences is usually helpful since paralogs and orthologs feature substitutions with respect to each other, as opposed to isoforms, which do not. RESULTS: The IsoSVM tool introduces an automated approach to identifying isoforms on the protein level using a support vector machine (SVM) classifier. Based on three specific features used as input of the SVM classifier, it is possible to automatically identify isoforms with little effort and with an accuracy of more than 97%. We show that the SVM is superior to a radial basis function network and to a linear classifier. As an example application we use IsoSVM to estimate that a set of Xenopus laevis EST clusters consists of approximately 81% cases where sequences are each other's paralogs and 19% cases where sequences are each other's isoforms. The number of isoforms and paralogs in this allotetraploid species is of interest in the study of evolution. CONCLUSION: We developed an SVM classifier that can be used to distinguish isoforms from paralogs with high accuracy and without access to the genomic data. It can be used to analyze, for example, EST data and database search results. Our software is freely available on the Web, under the name IsoSVM

    New Statistical Learning Approaches with Applications to RNA-seq Data

    Get PDF
    This dissertation examines statistical learning problems in both the supervised and unsupervised settings. The dissertation is composed of three major parts. In the first two, we address the important question of significance of clustering, and in the third, we describe a novel framework for unifying hard and soft classification through a spectrum of binary learning problems. In the unsupervised task of clustering, determining whether the identified clusters represent important underlying structure, or are artifacts of natural sampling variation, has been a critical and challenging question. In this dissertation, we introduce two new methods for addressing this question using statistical significance. In the first part of the dissertation, we describe SigFuge, an approach for identifying genomic loci exhibiting differential transcription patterns across many RNA-seq samples. In the second part of this dissertation, we describe statistical Significance of Hierarchical Clustering (SHC), a Monte Carlo based approach for testing significance in hierarchical clustering, and demonstrate the power of the method to identify significant clustering using two cancer gene expression datasets. Both methods were implemented and made available as open source packages in R. In the final part of this dissertation, we propose a spectrum of supervised learning problems which spans the hard and soft classification tasks based on fitting multiple decision rules to a dataset. By doing so, we reveal a novel collection of binary supervised learning problems. We study the problems using the framework of large-margin classification and a class of piecewise linear surrogate losses, for which we derive statistical properties. We evaluate our approach using simulations and a magnetic resonance imaging (MRI) dataset from the Alzheimer's Disease Neuroimaging Initiative (ADNI) study.Doctor of Philosoph

    Understanding Neuromuscular Health and Disease: Advances in Genetics, Omics, and Molecular Function

    Get PDF
    This compilation focuses on recent advances in the molecular and cellular understandingof neuromuscular biology, and the treatment of neuromuscular disease.These advances are at the forefront of modern molecular methodologies, oftenintegrating across wet-lab cell and tissue models, dry-lab computational approaches,and clinical studies. The continuing development and application ofmultiomics methods offer particular challenges and opportunities in the field,not least in the potential for personalized medicine

    A classification of alternatively spliced cassette exons using AdaBoost-based algorithm

    No full text

    Investigation of IRQ domain containing proteins in Arabidopsis thaliana

    Get PDF
    The endomembrane system in eukaryotic cells plays a vital role in the movement of membranes and substances around the cell in response to abiotic and biotic stimuli. Recent work on an actin binding protein, NETWORKED4B (NET4B), revealed that this vacuolar localised protein (Deeks et al. 2012) contains a domain termed the IRQ domain responsible for interacting with particular regulatory proteins of the endomembrane system. Bioinformatic analysis revealed that this IRQ domain was present in six novel proteins not containing the characteristic NET Actin Binding (NAB) domain. These proteins were termed the IRQ proteins (IRQ1-6). Work outlined in this thesis explores the evolution and localisation of expression of these proteins but in particular looks at IRQ4. Phylogenetic analysis revealed that the IRQ proteins represent a eudicot specific group of proteins and that they evolved from the NET proteins. The IRQ proteins can be subdivided based on sequence similarity into three groups: IRQ1 and IRQ6, IRQ2 and IRQ3, IRQ4 and IRQ5. Using promoter GUS lines for IRQ1 and IRQ6 revealed that these proteins may be involved in the initiation or regulation of lateral root growth. IRQ4 was expressed most strongly in the root. Subcellular localisation analysis using promIRQ4::IRQ4-GFP and live cell imaging showed that IRQ4 localises to the prevacuolar compartment (PVC)/multivesicular body (MVB). Immunogold labelling using an IRQ4 specific antibody revealed an additional localisation to autophagasomes. This project investigates a group of novel eudicot specific proteins and shows that IRQ4 may be involved in key endomembrane pathways in plants
    corecore