285 research outputs found

    A Machine Learning Model for Discovery of Protein Isoforms as Biomarkers

    Get PDF
    Prostate cancer is the most common cancer in men. One in eight Canadian men will be diagnosed with prostate cancer in their lifetime. The accurate detection of the disease’s subtypes is critical for providing adequate therapy; hence, it is critical for increasing both survival rates and quality of life. Next generation sequencing can be beneficial when studying cancer. This technology generates a large amount of data that can be used to extract information about biomarkers. This thesis proposes a model that discovers protein isoforms for different stages of prostate cancer progression. A tool has been developed that utilizes RNA-Seq data to infer open reading frames (ORFs) corresponding to transcripts. These ORFs are used as features for classificatio. A quantification measurement, Adaptive Fragments Per Kilobase of transcript per Million mapped reads (AFPKM), is proposed to compute the expression level for ORFs. The new measurement considers the actual length of the ORF and the length of the transcript. Using these ORFs and the new expression measure, several classifiers were built using different machine learning techniques. That enabled the identification of some protein isoforms related to prostate cancer progression. The biomarkers have had a great impact on the discrimination of prostate cancer stages and are worth further investigation

    An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets

    Full text link

    Investigating Disease Mechanisms and Drug Response Differences in Transcriptomics Sequencing Data

    Get PDF
    Indiana University-Purdue University Indianapolis (IUPUI)In eukaryotes, genetic information is encoded by DNA, transcribed to precursor messenger RNA (pre-mRNA), processed into mature messenger RNA (mRNA), and translated into functional proteins. Splicing of pre-mRNA is an important epigenetic process that alters the function of proteins through modifying the exon structure of mature mRNA transcripts and is known to greatly contribute to diversity of the human proteome. The vast majority of human genes are expressed through multiple transcript isoforms. Expression of genes through splicing of pre-mRNA plays crucial roles in cellular development, identity, and processes. Both the identity of genes selected for transcription and the specific transcript isoforms that are expressed are essential for normal cellular function. Deviations in gene expression or isoform proportion can be an indication or the cause of disease. RNA sequencing (RNAseq) is a high-throughput next-generation sequencing technology that allows for the interrogation of gene expression on a massive scale. RNAseq generates short sequences that reflect pieces of mRNAs present in a sample. RNAseq can therefore be used to explore differences in gene expression, reveal transcript isoform identities and compare changes in isoform proportions. In this dissertation, I design and apply advanced analysis techniques to RNAseq, phenotypic and drug response data to investigate disease mechanisms and drug sensitivity. Research Goals: The work described in this dissertation accomplishes 4 aims. Aim 1) Evaluate the gene expression signature of concussion in collegiate athletes and identify potential biomarkers for response and recovery. Aim 2) Implement a machine-learning algorithm to determine if splicing can predict drug response in cancer cell lines. Aim 3) Design a fast, scalable method to identify differentially spliced events related to cancer drug response. Aim 4) Construct a drug-splicing network and use a systems biology approach to search for similarities in underlying splicing events

    Improving the Performance and Precision of Bioinformatics Algorithms

    Get PDF
    Recent advances in biotechnology have enabled scientists to generate and collect huge amounts of biological experimental data. Software tools for analyzing both genomic (DNA) and proteomic (protein) data with high speed and accuracy have thus become very important in modern biological research. This thesis presents several techniques for improving the performance and precision of bioinformatics algorithms used by biologists. Improvements in both the speed and cost of automated DNA sequencers have allowed scientists to sequence the DNA of an increasing number of organisms. One way biologists can take advantage of this genomic DNA data is to use it in conjunction with expressed sequence tag (EST) and cDNA sequences to find genes and their splice sites. This thesis describes ESTmapper, a tool designed to use an eager write-only top-down (WOTD) suffix tree to efficiently align DNA sequences against known genomes. Experimental results show that ESTmapper can be much faster than previous techniques for aligning and clustering DNA sequences, and produces alignments of comparable or better quality. Peptide identification by tandem mass spectrometry (MS/MS) is becoming the dominant high-throughput proteomics workflow for protein characterization in complex samples. Biologists currently rely on protein database search engines to identify peptides producing experimentally observed mass spectra. This thesis describes two approaches for improving peptide identification precision using statistical machine learning. HMMatch (HMM MS/MS Match) is a hidden Markov model approach to spectral matching, in which many examples of a peptide fragmentation spectrum are summarized in a generative probabilistic model that captures the consensus and variation of each peak's intensity. Experimental results show that HMMatch can identify many peptides missed by traditional spectral matching and search engines. PepArML (Peptide Identification Arbiter by Machine Learning) is a machine learning based framework for improving the precision of peptide identification. It uses classification algorithms to effectively utilize spectra features and scores from multiple search engines in a single model-free framework that can be trained in an unsupervised manner. Experimental results show that PepArML can improve the sensitivity of peptide identification for several synthetic protein mixtures compared with individual search engines

    Doctor of Philosophy

    Get PDF
    dissertationThe MAKER genome annotation and curation software tool was developed in response to increased demand for genome annotation services, secondary to decreased genome sequencing costs. MAKER currently has over 1000 registered users throughout the world. This wide adoption of MAKER has uncovered the need for additional functionalities. Here I addressed moving MAKER into the domain of plant annotation, expanding MAKER to include new methods of gene and noncoding RNA annotation, and improving usability of MAKER through documentation and community outreach. To move MAKER into the plant annotation domain, I benchmarked MAKER on the well-annotated Arabidopsis thaliana genome. MAKER performs well on the Arabidopsis genome in de novo genome annotation and was able to improve the current TAIR10 gene models by incorporating mRNA-seq data not available during the original annotation efforts. In addition to this benchmarking, I annotated the genome of the sacred lotus Nelumbo Nucifera. I enabled noncoding RNA annotation in MAKER by adding the ability for MAKER to run and process the outputs of tRNAscan-SE and snoscan. These functionalities were tested on the Arabidopsis genome and used MAKER to annotate tRNAs and snoRNAs in Zea mays. The resulting version of MAKER was named MAKER-P. I added the functionality of a combiner by adding EVidence Modeler to the MAKER code base. iv As the number of MAKER users has grown, so have the help requests sent to the MAKER developers list. Motivated by the belief that improving the MAKER documentation would obviate the need for many of these requests, I created a media wiki that was linked to the MAKER download page, and the MAKER developers list was made searchable. Additionally I have written a unit on genome annotation using MAKER for Current Protocols in Bioinformatics. In response to these efforts I have seen a corresponding decrease in help requests, even though the number of registered MAKER users continues to increase. Taken together these products and activities have moved MAKER into the domain of plant annotation, expanded MAKER to include new methods of gene and noncoding RNA annotation, and improved the usability of MAKER through documentation and community outreach

    Splice site prediction using transfer learning

    Get PDF
    Ένα από τα ανοιχτά προβλήματα της βιοπληροφορικής, είναι η αυτόματη πρόβλεψη γονιδίων (αλληλουχία νουκλεοτιδίων που κωδικοποιεί πρωτεΐνες). Πιο συγκεκριμένα, οι ερευνητές προσπαθούν να προβλέψουν τις θέσεις που αντιστοιχούν στην αρχή και το τέλος των γονιδίων σε ένα γονιδίωμα. Οι θέσεις αυτές είναι γνωστές ως σήματα ματίσματος (splice sites). Διάφορες τεχνικές της μηχανικής μάθησης έχουν χρησιμοποιηθεί για το συγκεκριμένο πρόβλημα. Παρόλα αυτά, η απόκτηση των επισημειωμένων δεδομένων που είναι αναγκαία για να εφαρμοστούν οι τεχνικές επιβλεπόμενης μάθησης, αποτελεί μια σημαντική πρόκληση, καθώς το κόστος είναι πολύ μεγάλο. Μία από τις προσεγγίσεις για την αντιμετώπιση αυτού του προβλήματος είναι η μεταφορά μάθησης (transfer learning). Στόχος της παρούσας εργασίας είναι η μελέτη της αναπαράστασης των γονιδίων, ώστε να λαμβάνεται υπόψιν η αλληλουχία των νουκλεοτιδίων σε ένα γονιδίωμα, και ο ρόλος της αναπαράστασης αυτής σε μεθόδους μεταφοράς μάθησης.One of the open problems in the field of bioinformatics, is the automatic gene prediction (nucleotide sequence that encodes proteins). More specifically, researchers are trying to predict those positions that correspond to the beginning and the end of genes within a genome. These positions are known as splice sites. Several machine learning techniques have been used for the specific problem. Nevertheless, the acquisition of annotated data, necessary to implement supervised learning techniques, is a significant challenge, as the cost is very large. One of the approaches for addressing this problem is the transferring of knowledge (transfer learning approach). The aim of this work is the study of the representation of genes in order to take into account the sequence of nucleotides within a genome and the role of this representation in transfer learning methods

    The role of Fused in Sarcoma (FUS) in the alternative splicing of TAU

    Get PDF
    Neurodegenerative disease patients suffer from cognitive decline and/or motor dysfunctions, depending on the different regions affected by the neuron loss. With aging being the major risk factor and a society with increased life expectancy, there is an urgent need to develop new effective treatments to alleviate the situation faced by patients, their families and society. Although neurodegenerative diseases including Alzheimer’s disease (AD), amyotrophic lateral sclerosis (ALS) or frontotemporal dementia (FTD) lead to different clinical symptoms, they share common pathomechanisms, such as protein aggregation and altered RNA metabolism. A subset of ALS and FTD cases, for instance, is pathologically characterized by neuronal cytoplasmic inclusions containing aggregated Fused in Sarcoma (FUS) protein. There is also a genetic link, since FUS mutations cause ALS with FUS pathology. FUS is a DNA/RNA-binding protein known to regulate different steps of RNA metabolism, however, its exact function and target genes in neurons were unknown. In this study, I evaluated the neuronal role of FUS in alternative splicing using a candidate approach focusing on the microtubule-associated protein TAU. TAU is one of the most widely studied proteins in neurodegeneration research due to its aggregation in different tauopathies, most notably AD. Mutations in the TAU gene MAPT, that affect alternative splicing of exon 10, are known to cause another subtype of FTD. Here, I demonstrate that FUS depleted rat neurons, although having normal viability, show aberrant alternative splicing of TAU, with increased inclusion of exon 3 and exon 10, resulting in higher expression of the 2N and 4R TAU isoforms. Importantly, reintroduction of human FUS rescues aberrant splicing of TAU in FUS depleted neurons. Accordingly, overexpression of FUS decreases expression of 2N and 4R TAU isoforms. In mouse brain lysates, I detected direct FUS binding to TAU pre-mRNA, with strong binding around the regulated exon 10, often at AUU-rich RNA stretches. Since TAU splicing is regulated differently in humans and rodents, I also confirmed the role of human FUS in TAU exon 10 splicing using a TAU minigene and a human neuronal cell line. In addition, I analyzed the morphology and development of axons to evaluate the functional consequences of FUS depletion in neurons. Although FUS depleted neurons develop neurites normally, their axons are significantly shorter than in the control cells. Similar to observations in TAU/MAP1B knockout neurons, axons of FUS depleted neurons develop significantly larger growth cones with abnormal cytoskeletal organization. The development of growth cones in vivo is an essential step in axonal maintenance and repair. Altogether, this study identified TAU as the first physiological splice target of FUS in neurons. The newly discovered role of FUS in regulating the axonal cytoskeleton indicates that aberrant axonal function could contribute to the neuron loss seen in ALS/FTD cases with FUS aggregates.Patienten mit neurodegenerativen Erkrankungen können an kognitivem Abbau und/oder motorische Störungen leiden, je nachdem welche Gehirnregion von dem Verlust von Neuronen betroffen ist. Da sich das Risiko einer neurodegenerativen Erkrankung mit zunehmendem Alter drastisch erhöht und wir eine Gesellschaft mit steigender Lebenserwartung haben, ist es dringend notwending, neue wirksame Behandlungsmethoden zu entwickeln, um die Situation, mit der sich Patienten, ihre Familien und die Gesellschaft konfrontiert sehen, zu erleichtern. Obwohl sich verschiedene neurodegenerative Erkrankungen wie die Alzheimer-Erkrankung (AD), Amyotrophe Lateralsklerose (ALS) oder Frontotemporale Demenz (FTD) klinisch unterscheiden, gibt es gemeinsame Pathomechanismen, wie Proteinaggregation und Störungen im RNA-Metabolismus. Bei einem Teil der ALS und FTD Patienten beobachtet man Ablagerungen aus aggregiertem Fused in Sarcoma (FUS) Protein. Des Weiteren verursachen FUS Mutationen ALS mit FUS neuronalen Aggregaten. FUS ist ein DNA/RNA-bindendes Protein, das verschiedene Schritte des RNA-Metabolismus reguliert. Die genaue Funktion von FUS und seine Zielgene in Neuronen waren jedoch bisher unbekannt. In dieser Studie habe ich die Funktion von FUS auf neuronales alternatives Spleißen mit einem Kandidaten-Ansatz untersucht, und mich insbesondere auf das Mikrotubuli-bindende Protein TAU fokussiert. Tau ist eines der bekanntesten Proteine in der Demenzforschung, da TAU Aggregate in verschiedenen sogenannten Tauopathien, insbesondere AD, gefunden wurden. Mutationen im TAU Gen MAPT, die das alternative Spleißen von TAU Exon 10 beeinflussen, können einen anderen Subtyp der FTD verursachen. Diese Studie zeigt, dass die Herunterregulierung (Gen-Knockdown) von FUS in murinen Neuronen das Überleben der Neuronen nicht beeinträchtigt, aber zu verändertem alternativen Spleißen von TAU mit einem erhöhten Einschluss von Exon 3 und Exon 10 führt und somit eine höhere Expression von den 2N und 4R TAU Isoformen verursacht. Eine wichtige Beobachtung dieser Studie war auch, dass die Expression von humanem FUS in FUS knockdown Neuronen aberrantes TAU Spleißen korrigieren kann. Dementsprechend führte auch die alleinige Überexpression von FUS zu einer verminderten Expression von 2N und 4R TAU. In Lysaten von Mausgehirnen konnte ich eine direkte Interaktion zwischen FUS und TAU RNA nachweisen, und zwar mit bevorzugter FUS Bindung nahe am regulierten TAU Exon 10 und oft an AUU-reichen RNA-Abschnitten. Da das Spleißen von TAU in Menschen und Nagetieren unterschiedlich reguliert wird, bestätigte ich mit sowohl einer menschlichen neuronalen Zelllinie als auch einem TAU-Minigen Konstrukt die Rolle von humanem FUS in TAU Exon 10 Spleißen. Um die funktionalen Konsequenzen von FUS knockdown in Neuronen zu bewerten, analysierte ich die Morphologie und Entwicklung der Axone. Obwohl Neuronen mit FUS knockdown normalen Neuriten bilden, sind ihre Axone deutlich kürzer als die der Kontroll-Neuronen. Wie auch schon in TAU/MAP1B knockout Neuronen beobachtet wurde, entwickeln FUS knockdown Neuronen Axone mit einem deutlich größeren Wachstumskegel und abnormer Zytoskelett-Organisation. Die dynamische Bildung axonaler Wachstumskegel ist ein wesentlicher Schritt in der axonalen Aufrechterhaltung und Reparatur in vivo. Insgesamt konnte diese Studie TAU als erstes physiologisches splice Zielgen von FUS in Neuronen identifizieren. Die neu entdeckte Funktion von FUS bei der Regulation des axonalen Zytoskelettes spricht für eine mögliche Rolle der veränderten axonalen Funktion beim Verlust von Neuronen in ALS/FTD Fällen mit FUS Aggregaten
    corecore