503 research outputs found

    Evaluation of noise reduction techniques in the splice junction recognition problem

    Full text link
    The Human Genome Project has generated a large amount of sequence data. A number of works are currently concerned with analyzing these data. One of the analyses carried out is the identification of genes' structures on the sequences obtained. As such, one can search for particular signals associated with gene expression. Splice junctions represent a type of signal present on eukaryote genes. Many studies have applied Machine Learning techniques in the recognition of such regions. However, most of the genetic databases are characterized by the presence of noisy data, which can affect the performance of the learning techniques. This paper evaluates the effectiveness of five data pre-processing algorithms in the elimination of noisy instances from two splice junction recognition datasets. After the pre-processing phase, two learning techniques, Decision Trees and Support Vector Machines, are employed in the recognition process

    Combined optimization algorithms applied to pattern classification

    Get PDF
    Accurate classification by minimizing the error on test samples is the main goal in pattern classification. Combinatorial optimization is a well-known method for solving minimization problems, however, only a few examples of classifiers axe described in the literature where combinatorial optimization is used in pattern classification. Recently, there has been a growing interest in combining classifiers and improving the consensus of results for a greater accuracy. In the light of the "No Ree Lunch Theorems", we analyse the combination of simulated annealing, a powerful combinatorial optimization method that produces high quality results, with the classical perceptron algorithm. This combination is called LSA machine. Our analysis aims at finding paradigms for problem-dependent parameter settings that ensure high classifica, tion results. Our computational experiments on a large number of benchmark problems lead to results that either outperform or axe at least competitive to results published in the literature. Apart from paxameter settings, our analysis focuses on a difficult problem in computation theory, namely the network complexity problem. The depth vs size problem of neural networks is one of the hardest problems in theoretical computing, with very little progress over the past decades. In order to investigate this problem, we introduce a new recursive learning method for training hidden layers in constant depth circuits. Our findings make contributions to a) the field of Machine Learning, as the proposed method is applicable in training feedforward neural networks, and to b) the field of circuit complexity by proposing an upper bound for the number of hidden units sufficient to achieve a high classification rate. One of the major findings of our research is that the size of the network can be bounded by the input size of the problem and an approximate upper bound of 8 + √2n/n threshold gates as being sufficient for a small error rate, where n := log/SL and SL is the training set

    Prediction of Alternative Splice Sites in Human Genes

    Get PDF
    This thesis addresses the problem of predicting alternative splice sites in human genes. The most common way to identify alternative splice sites are the use of expressed sequence tags and microarray data. Since genes only produce alternative proteins under certain conditions, these methods are limited to detecting only alternative splice sites in genes whose alternative protein forms are expressed under the tested conditions. I have introduced three multiclass support vector machines that predict upstream and downstream alternative 3’ splice sites, upstream and downstream alternative 5’ splice sites, and the 3’ splice site of skipped and cryptic exons. On a test set extracted from the Alternative Splice Annotation Project database, I was able to correctly classify about 68% of the splice sites in the alternative 3’ set, about 62% of the splice sites in the alternative 5’ set, and about 66% in the exon skipping set

    Machine Learning-based Predictive Maintenance for Optical Networks

    Get PDF
    Optical networks provide the backbone of modern telecommunications by connecting the world faster than ever before. However, such networks are susceptible to several failures (e.g., optical fiber cuts, malfunctioning optical devices), which might result in degradation in the network operation, massive data loss, and network disruption. It is challenging to accurately and quickly detect and localize such failures due to the complexity of such networks, the time required to identify the fault and pinpoint it using conventional approaches, and the lack of proactive efficient fault management mechanisms. Therefore, it is highly beneficial to perform fault management in optical communication systems in order to reduce the mean time to repair, to meet service level agreements more easily, and to enhance the network reliability. In this thesis, the aforementioned challenges and needs are tackled by investigating the use of machine learning (ML) techniques for implementing efficient proactive fault detection, diagnosis, and localization schemes for optical communication systems. In particular, the adoption of ML methods for solving the following problems is explored: - Degradation prediction of semiconductor lasers, - Lifetime (mean time to failure) prediction of semiconductor lasers, - Remaining useful life (the length of time a machine is likely to operate before it requires repair or replacement) prediction of semiconductor lasers, - Optical fiber fault detection, localization, characterization, and identification for different optical network architectures, - Anomaly detection in optical fiber monitoring. Such ML approaches outperform the conventionally employed methods for all the investigated use cases by achieving better prediction accuracy and earlier prediction or detection capability

    Pre-processing for noise detection in gene expression classification data

    Get PDF
    Due to the imprecise nature of biological experiments, biological data is often characterized by the presence of redundant and noisy data. This may be due to errors that occurred during data collection, such as contaminations in laboratorial samples. It is the case of gene expression data, where the equipments and tools currently used frequently produce noisy biological data. Machine Learning algorithms have been successfully used in gene expression data analysis. Although many Machine Learning algorithms can deal with noise, detecting and removing noisy instances from the training data set can help the induction of the target hypothesis. This paper evaluates the use of distance-based pre-processing techniques for noise detection in gene expression data classification problems. This evaluation analyzes the effectiveness of the techniques investigated in removing noisy data, measured by the accuracy obtained by different Machine Learning classifiers over the pre-processed data.São Paulo State Research Foundation (FAPESP)CNP

    Alternative splicing: regulation, function and evolution

    Full text link
    Tesis doctoral inédita leída en la Universidad Autónoma de Madrid, Facultad de Medicina, Departamento de Bioquímica. Fecha de lectura: 13-01-2021Introns populate eukaryotic genes to a variable extent across species, being widespread in vertebrates and mammals. While the evolutionary advantages, if any, of introns, remain unclear, their expansion has provided the opportunity to splice genes in more than a single way, allowing the production of diferent mRNAs from a single gene through Alternative splicing (AS). AS patterns change during the development of complex organisms and diverge across diferent tissues and experimental conditions. These highly reproducible changes evidences the existence of a regulatory network that ensures repeatable responses to certain stimuli and suggest that, at least some of them, play a role in the overall physiological response or adaptation. Not surprisingly, perturbation of some elements of this network is often associated with pathological conditions. However, not only we are far from a complete characterization of the molecular mechanisms that drive AS changes in most pathologies like those afecting the heart, but the computational tools that are currently used to study these regulatory networks are limiting our ability to extract all the information that is hidden in the data. It has been long hypothesized that AS contributes to a great expansion of the proteome and facilitates the evolution of new functions from pre-existing ones without gene duplication. While there are very well known examples of how AS enables the production of diferent functional proteins or mRNAs, the proportion of AS isoforms that are actually functional remains large unknown. Indeed, recent studies from diferent perspectives, including both transcriptomic, proteomics and sequence evolutionary analysis suggest that this percentage may be rather small and that much of the observed transcriptomic diversity is driven by non-functional noise in the splicing process. In this thesis, we have studied global AS patterns through computational analysis of large RNA-seq datasets to characterize the causes and consequences of AS changes from diferent perspectives. First, we have analyzed how AS global patterns change during heart development and disease using data from a variety of mouse models. We found that AS changes modulate diferent biological processes than gene expression ones and are associated to isoform speci c protein-protein interactions. Disease patterns partially recapitulate developmental patterns probably through the upregulation of PTBP1, which is suficient to induce pathological changes in the heart. Second, in an attempt to improve computational tools for identi cation of regulatory elements, we have developed dSreg. This tool leverages the power of bayesian inference and hierarchical models to pool information across the whole transcriptome to infer, not only the changes in the activities of the underlying regulatory elements, but also the changes in inclusion rates, outperforming competing methods and tools made for both purposes separately. Finally, we have studied the evolutionary process driving AS divergence during mammalian evolution using models of phenotypic evolution in a phylogenetic framework. We found that AS patterns have evolved under weak stabilizing selection that allows widespread variability in AS patterns across species, with only about 5% of the genes probably encoding AS isoforms with dif erent functions. Rates of neutral evolution are high, preventing the identi cation of adaptive changes at this long evolutionary scale. In summary, this thesis provides new computational tools and knowledge about the evolution and regulation of AS in diferent biological conditions and helps to better understand its relevance from diferent persepectives

    Identification and Functional Annotation of Alternatively Spliced Isoforms

    Full text link
    Alternative splicing is a key mechanism for increasing the complexity of transcriptome and proteome in eukaryotic cells. A large portion of multi-exon genes in humans undergo alternative splicing, and this can have significant functional consequences as the proteins translated from alternatively spliced mRNA might have different amino acid sequences and structures. The study of alternative splicing events has been accelerated by the next-generation sequencing technology. However, reconstruction of transcripts from short-read RNA sequencing is not sufficiently accurate. Recent progress in single-molecule long-read sequencing has provided researchers alternative ways to help solve this problem. With the help of both short and long RNA sequencing technologies, tens of thousands of splice isoforms have been catalogued in humans and other species, but relatively few of the protein products of splice isoforms have been characterized functionally, structurally and biochemically. The scope of this dissertation includes using short and long RNA sequencing reads together for the purpose of transcript reconstruction, and using high-throughput RNA-sequencing data and gene ontology functional annotations on gene level to predict functions for alternatively spliced isoforms in mouse and human. In the first chapter, I give an introduction of alternative splicing and discuss the existing studies where next generation sequencing is used for transcript identification. Then, I define the isoform function prediction problem, and explain how it differs from better known gene function prediction problem. In the second chapter of this dissertation, I describe our study where the overall transcriptome of kidney is studied using both long reads from PacBio platform and RNA-seq short reads from Illumina platform. We used short reads to validate full-length transcripts found by long PacBio reads, and generated two high quality sets of transcript isoforms that are expressed in glomerular and tubulointerstitial compartments. In the third chapter, I describe our generic framework, where we implemented and evaluated several related algorithms for isoform function prediction for mouse isoforms. We tested these algorithms through both computational evaluation and experimental validation of the predicted ‘responsible’ isoform(s) and the predicted disparate functions of the isoforms of Cdkn2a and of Anxa6. Our algorithm is the first effort to predict and differentiate isoform functions through large-scale genomic data integration. In the fourth chapter, I present the extension of isoform function prediction study to the protein coding isoforms in human. We used a similar multiple instance learning (MIL)-based approach for predicting the function of protein coding splice variants in human. We evaluated our predictions using literature evidence of ADAM15, LMNA/C, and DMXL2 genes. And in the fifth and final chapter, I give a summary of previous chapters and outline the future directions for alternatively spliced isoform reconstruction and function prediction studies.PHDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/144017/1/ridvan_1.pd

    A novel approach to handwritten character recognition

    Get PDF
    A number of new techniques and approaches for off-line handwritten character recognition are presented which individually make significant advancements in the field. First. an outline-based vectorization algorithm is described which gives improved accuracy in producing vector representations of the pen strokes used to draw characters. Later. Vectorization and other types of preprocessing are criticized and an approach to recognition is suggested which avoids separate preprocessing stages by incorporating them into later stages. Apart from the increased speed of this approach. it allows more effective alteration of the character images since more is known about them at the later stages. It also allows the possibility of alterations being corrected if they are initially detrimental to recognition. A new feature measurement. the Radial Distance/Sector Area feature. is presented which is highly robust. tolerant to noise. distortion and style variation. and gives high accuracy results when used for training and testing in a statistical or neural classifier. A very powerful classifier is therefore obtained for recognizing correctly segmented characters. The segmentation task is explored in a simple system of integrated over-segmentation. Character classification and approximate dictionary checking. This can be extended to a full system for handprinted word recognition. In addition to the advancements made by these methods. a powerful new approach to handwritten character recognition is proposed as a direction for future research. This proposal combines the ideas and techniques developed in this thesis in a hierarchical network of classifier modules to achieve context-sensitive. off-line recognition of handwritten text. A new type of "intelligent" feedback is used to direct the search to contextually sensible classifications. A powerful adaptive segmentation system is proposed which. when used as the bottom layer in the hierarchical network. allows initially incorrect segmentations to be adjusted according to the hypotheses of the higher level context modules
    • …
    corecore