689 research outputs found

    PIntron: a Fast Method for Gene Structure Prediction via Maximal Pairings of a Pattern and a Text

    Full text link
    Current computational methods for exon-intron structure prediction from a cluster of transcript (EST, mRNA) data do not exhibit the time and space efficiency necessary to process large clusters of over than 20,000 ESTs and genes longer than 1Mb. Guaranteeing both accuracy and efficiency seems to be a computational goal quite far to be achieved, since accuracy is strictly related to exploiting the inherent redundancy of information present in a large cluster. We propose a fast method for the problem that combines two ideas: a novel algorithm of proved small time complexity for computing spliced alignments of a transcript against a genome, and an efficient algorithm that exploits the inherent redundancy of information in a cluster of transcripts to select, among all possible factorizations of EST sequences, those allowing to infer splice site junctions that are highly confirmed by the input data. The EST alignment procedure is based on the construction of maximal embeddings that are sequences obtained from paths of a graph structure, called Embedding Graph, whose vertices are the maximal pairings of a genomic sequence T and an EST P. The procedure runs in time linear in the size of P, T and of the output. PIntron, the software tool implementing our methodology, is able to process in a few seconds some critical genes that are not manageable by other gene structure prediction tools. At the same time, PIntron exhibits high accuracy (sensitivity and specificity) when compared with ENCODE data. Detailed experimental data, additional results and PIntron software are available at http://www.algolab.eu/PIntron

    A knowledge engineering approach to the recognition of genomic coding regions

    Get PDF
    ได้ทุนอุดหนุนการวิจัยจากมหาวิทยาลัยเทคโนโลยีสุรนารี ปีงบประมาณ พ.ศ.2556-255

    NOVEL COMPUTATIONAL METHODS FOR SEQUENCING DATA ANALYSIS: MAPPING, QUERY, AND CLASSIFICATION

    Get PDF
    Over the past decade, the evolution of next-generation sequencing technology has considerably advanced the genomics research. As a consequence, fast and accurate computational methods are needed for analyzing the large data in different applications. The research presented in this dissertation focuses on three areas: RNA-seq read mapping, large-scale data query, and metagenomics sequence classification. A critical step of RNA-seq data analysis is to map the RNA-seq reads onto a reference genome. This dissertation presents a novel splice alignment tool, MapSplice3. It achieves high read alignment and base mapping yields and is able to detect splice junctions, gene fusions, and circular RNAs comprehensively at the same time. Based on MapSplice3, we further extend a novel lightweight approach called iMapSplice that enables personalized mRNA transcriptional profiling. As huge amount of RNA-seq has been shared through public datasets, it provides invaluable resources for researchers to test hypotheses by reusing existing datasets. To meet the needs of efficiently querying large-scale sequencing data, a novel method, called SeqOthello, has been developed. It is able to efficiently query sequence k-mers against large-scale datasets and finally determines the existence of the given sequence. Metagenomics studies often generate tens of millions of reads to capture the presence of microbial organisms. Thus efficient and accurate algorithms are in high demand. In this dissertation, we introduce MetaOthello, a probabilistic hashing classifier for metagenomic sequences. It supports efficient query of a taxon using its k-mer signatures

    Intelligent Radio Spectrum Monitoring

    Full text link
    [EN] Spectrum monitoring is an important part of the radio spectrum management process, providing feedback on the workflow that allows for our current wirelessly interconnected lifestyle. The constantly increasing number of users and uses of wireless technologies is pushing the limits and capabilities of the existing infrastructure, demanding new alternatives to manage and analyse the extremely large volume of data produced by existing spectrum monitoring networks. This study addresses this problem by proposing an information management system architecture able to increase the analytical level of a spectrum monitoring measurement network. This proposal includes an alternative to manage the data produced by such network, methods to analyse the spectrum data and to automate the data gathering process. The study was conducted employing system requirements from the Brazilian National Telecommunications Agency and related functional concepts were aggregated from the reviewed scientific literature and publications from the International Telecommunication Union. The proposed solution employs microservice architecture to manage the data, including tasks such as format conversion, analysis, optimization and automation. To enable efficient data exchange between services, we proposed the use of a hierarchical structure created using the HDF5 format. The suggested architecture was partially implemented as a pilot project, which allowed to demonstrate the viability of presented ideas and perform an initial refinement of the proposed data format and analytical algorithms. The results pointed to the potential of the solution to solve some of the limitations of the existing spectrum monitoring workflow. The proposed system may play a crucial role in the integration of the spectrum monitoring activities into open data initiatives, promoting transparency and data reusability for this important public service.[ES] El control y análisis de uso del espectro electromagnético, un servicio conocido como comprobación técnica del espectro, es una parte importante del proceso de gestión del espectro de radiofrecuencias, ya que proporciona la información necesaria al flujo de trabajo que permite nuestro estilo de vida actual, interconectado e inalámbrico. El número cada vez más grande de usuarios y el creciente uso de las tecnologías inalámbricas amplían las demandas sobre la infraestructura existente, exigiendo nuevas alternativas para administrar y analizar el gran volumen de datos producidos por las estaciones de medición del espectro. Este estudio aborda este problema al proponer una arquitectura de sistema para la gestión de información capaz de aumentar la capacidad de análisis de una red de equipos de medición dedicados a la comprobación técnica del espectro. Esta propuesta incluye una alternativa para administrar los datos producidos por dicha red, métodos para analizar los datos recolectados, así como una propuesta para automatizar el proceso de recopilación. El estudio se realizó teniendo como referencia los requisitos de la Agencia Nacional de Telecomunicaciones de Brasil, siendo considerados adicionalmente requisitos funcionales relacionados descritos en la literatura científica y en las publicaciones de la Unión Internacional de Telecomunicaciones. La solución propuesta emplea una arquitectura de microservicios para la administración de datos, incluyendo tareas como la conversión de formatos, análisis, optimización y automatización. Para permitir el intercambio eficiente de datos entre servicios, sugerimos el uso de una estructura jerárquica creada usando el formato HDF5. Esta arquitectura se implementó parcialmente dentro de un proyecto piloto, que permitió demostrar la viabilidad de las ideas presentadas, realizar mejoras en el formato de datos propuesto y en los algoritmos analíticos. Los resultados señalaron el potencial de la solución para resolver algunas de las limitaciones del tradicional flujo de trabajo de comprobación técnica del espectro. La utilización del sistema propuesto puede mejorar la integración de las actividades e impulsar iniciativas de datos abiertos, promoviendo la transparencia y la reutilización de datos generados por este importante servicio público[CA] El control i anàlisi d'ús de l'espectre electromagnètic, un servei conegut com a comprovació tècnica de l'espectre, és una part important del procés de gestió de l'espectre de radiofreqüències, ja que proporciona la informació necessària al flux de treball que permet el nostre estil de vida actual, interconnectat i sense fils. El número cada vegada més gran d'usuaris i el creixent ús de les tecnologies sense fils amplien la demanda sobre la infraestructura existent, exigint noves alternatives per a administrar i analitzar el gran volum de dades produïdes per les xarxes d'estacions de mesurament. Aquest estudi aborda aquest problema en proposar una arquitectura de sistema per a la gestió d'informació capaç d’augmentar la capacitat d’anàlisi d'una xarxa d'equips de mesurament dedicats a la comprovació tècnica de l'espectre. Aquesta proposta inclou una alternativa per a administrar les dades produïdes per aquesta xarxa, mètodes per a analitzar les dades recol·lectades, així com una proposta per a automatitzar el procés de recopilació. L'estudi es va realitzar tenint com a referència els requisits de l'Agència Nacional de Telecomunicacions del Brasil, sent considerats addicionalment requisits funcionals relacionats descrits en la literatura científica i en les publicacions de la Unió Internacional de Telecomunicacions. La solució proposada empra una arquitectura de microserveis per a l'administració de dades, incloent tasques com la conversió de formats, anàlisi, optimització i automatització. Per a permetre l'intercanvi eficient de dades entre serveis, suggerim l'ús d'una estructura jeràrquica creada usant el format HDF5. Aquesta arquitectura es va implementar parcialment dins d'un projecte pilot, que va permetre demostrar la viabilitat de les idees presentades, realitzar millores en el format de dades proposat i en els algorismes analítics. Els resultats van assenyalar el potencial de la solució per a resoldre algunes de les limitacions del tradicional flux de treball de comprovació tècnica de l'espectre. La utilització del sistema proposat pot millorar la integració de les activitats i impulsar iniciatives de dades obertes, promovent la transparència i la reutilització de dades generades per aquest important servei públicSantos Lobão, F. (2019). Intelligent Radio Spectrum Monitoring. http://hdl.handle.net/10251/128850TFG

    A review of estimation of distribution algorithms in bioinformatics

    Get PDF
    Evolutionary search algorithms have become an essential asset in the algorithmic toolbox for solving high-dimensional optimization problems in across a broad range of bioinformatics problems. Genetic algorithms, the most well-known and representative evolutionary search technique, have been the subject of the major part of such applications. Estimation of distribution algorithms (EDAs) offer a novel evolutionary paradigm that constitutes a natural and attractive alternative to genetic algorithms. They make use of a probabilistic model, learnt from the promising solutions, to guide the search process. In this paper, we set out a basic taxonomy of EDA techniques, underlining the nature and complexity of the probabilistic model of each EDA variant. We review a set of innovative works that make use of EDA techniques to solve challenging bioinformatics problems, emphasizing the EDA paradigm's potential for further research in this domain

    Large Scale Genomic Sequence SVM Classifiers

    Get PDF
    In genomic sequence analysis tasks like splice site recognition or promoter identification, large amounts of training sequences are available, and indeed needed to achieve sufficiently high classification performances. In this work we study two recently proposed and successfully used kernels, namely the Spectrum kernel and the Weighted Degree kernel (WD). In particular, we suggest several extensions using Suffix Trees and modi cations of an SMO-like SVM training algorithm in order to accelerate the training of the SVMs and their evaluation on test sequences. Our simulations show that for the spectrum kernel and WD kernel, large scale SVM training can be accelerated by factors of 20 and 4 times, respectively, while using much less memory (e.g. no kernel caching). The evaluation on new sequences is often several thousand times faster using the new techniques (depending on the number of Support Vectors). Our method allows us to train on sets as large as one million sequences

    Combining similarity in time and space for training set formation under concept drift

    Get PDF
    Concept drift is a challenge in supervised learning for sequential data. It describes a phenomenon when the data distributions change over time. In such a case accuracy of a classifier benefits from the selective sampling for training. We develop a method for training set selection, particularly relevant when the expected drift is gradual. Training set selection at each time step is based on the distance to the target instance. The distance function combines similarity in space and in time. The method determines an optimal training set size online at every time step using cross validation. It is a wrapper approach, it can be used plugging in different base classifiers. The proposed method shows the best accuracy in the peer group on the real and artificial drifting data. The method complexity is reasonable for the field applications

    BEAT: Bioinformatics Exon Array Tool to store, analyze and visualize Affymetrix GeneChip Human Exon Array data from disease experiments

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>It is known from recent studies that more than 90% of human multi-exon genes are subject to Alternative Splicing (AS), a key molecular mechanism in which multiple transcripts may be generated from a single gene. It is widely recognized that a breakdown in AS mechanisms plays an important role in cellular differentiation and pathologies. Polymerase Chain Reactions, microarrays and sequencing technologies have been applied to the study of transcript diversity arising from alternative expression. Last generation Affymetrix GeneChip Human Exon 1.0 ST Arrays offer a more detailed view of the gene expression profile providing information on the AS patterns. The exon array technology, with more than five million data points, can detect approximately one million exons, and it allows performing analyses at both gene and exon level. In this paper we describe BEAT, an integrated user-friendly bioinformatics framework to store, analyze and visualize exon arrays datasets. It combines a data warehouse approach with some rigorous statistical methods for assessing the AS of genes involved in diseases. Meta statistics are proposed as a novel approach to explore the analysis results. BEAT is available at <url>http://beat.ba.itb.cnr.it</url>.</p> <p>Results</p> <p>BEAT is a web tool which allows uploading and analyzing exon array datasets using standard statistical methods and an easy-to-use graphical web front-end. BEAT has been tested on a dataset with 173 samples and tuned using new datasets of exon array experiments from 28 colorectal cancer and 26 renal cell cancer samples produced at the Medical Genetics Unit of IRCCS Casa Sollievo della Sofferenza.</p> <p>To highlight all possible AS events, alternative names, accession Ids, Gene Ontology terms and biochemical pathways annotations are integrated with exon and gene level expression plots. The user can customize the results choosing custom thresholds for the statistical parameters and exploiting the available clinical data of the samples for a multivariate AS analysis.</p> <p>Conclusions</p> <p>Despite exon array chips being widely used for transcriptomics studies, there is a lack of analysis tools offering advanced statistical features and requiring no programming knowledge. BEAT provides a user-friendly platform for a comprehensive study of AS events in human diseases, displaying the analysis results with easily interpretable and interactive tables and graphics.</p

    Automatic intron detection in metagenomes using neural networks.

    Get PDF
    Tato práce se zabývá detekcí intronů v metagenomech hub pomocí hlubokých neuronových sítí. Přesné biologické mechanizmy rozpoznávání a vyřezávání intronů nejsou zatím plně známy a jejich strojová detekce není považovaná za vyřešený problém. Rozpoznávání a vyřezávání intronů z DNA sekvencí je důležité pro identifikaci genů v metagenomech a hledání jejich homologií mezi známými DNA sekvencemi,které jsou dostupné ve veřejných databázích. Rozpoznání genů a nalezení jejich případných homologů umožňuje identifikaci jak již známých tak i nových druhů a jejich taxonomické zařazení. V rámci práce vznikly dva modely neuronových sítí, které detekují začátky a konce intronů, takzvaná donorová a akceptorová místa sestřihu. Detekovaná místa sestřihu jsou následně zkombinována do kandidátních intronů. Překrývající se kandidátní introny jsou poté odstraněny pomocí jednoduchého skórovacího algoritmu. Práce navazuje na existující řešení, které využívá metody podpůrných vektorů (SVM). Výsledné neuronové sítě dosahují lepších výsledků než SVM a to při více než desetinásobně nižším výpočetním čase na zpracování stejně obsáhlého genomu.This work is concerned with the detection of introns in metagenomes with deep neural networks. Exact biological mechanisms of intron recognition and splicing are not fully known yet and their automated detection has remained unresolved. Detection and removal of introns from DNA sequences is important for the identification of genes in metagenomes and for searching for homologs among the known DNA sequences available in public databases. Gene prediction and the discovery of their homologs allows the identification of known and new species and their taxonomic classification. Two neural network models were developed as part of this thesis. The models' aim is the detection of intron starts and ends with the so-called donor and acceptor splice sites. The splice sites are later combined into candidate introns which are further filtered by a simple score-based overlap resolving algorithm. The work relates to an existing solution based on support vector machines (SVM). The resulting neural networks achieve better results than SVM and require more than order of magnitude less computational resources in order to process equally large genome
    corecore