689 research outputs found
PIntron: a Fast Method for Gene Structure Prediction via Maximal Pairings of a Pattern and a Text
Current computational methods for exon-intron structure prediction from a
cluster of transcript (EST, mRNA) data do not exhibit the time and space
efficiency necessary to process large clusters of over than 20,000 ESTs and
genes longer than 1Mb. Guaranteeing both accuracy and efficiency seems to be a
computational goal quite far to be achieved, since accuracy is strictly related
to exploiting the inherent redundancy of information present in a large
cluster. We propose a fast method for the problem that combines two ideas: a
novel algorithm of proved small time complexity for computing spliced
alignments of a transcript against a genome, and an efficient algorithm that
exploits the inherent redundancy of information in a cluster of transcripts to
select, among all possible factorizations of EST sequences, those allowing to
infer splice site junctions that are highly confirmed by the input data. The
EST alignment procedure is based on the construction of maximal embeddings that
are sequences obtained from paths of a graph structure, called Embedding Graph,
whose vertices are the maximal pairings of a genomic sequence T and an EST P.
The procedure runs in time linear in the size of P, T and of the output.
PIntron, the software tool implementing our methodology, is able to process in
a few seconds some critical genes that are not manageable by other gene
structure prediction tools. At the same time, PIntron exhibits high accuracy
(sensitivity and specificity) when compared with ENCODE data. Detailed
experimental data, additional results and PIntron software are available at
http://www.algolab.eu/PIntron
A knowledge engineering approach to the recognition of genomic coding regions
ได้ทุนอุดหนุนการวิจัยจากมหาวิทยาลัยเทคโนโลยีสุรนารี ปีงบประมาณ พ.ศ.2556-255
NOVEL COMPUTATIONAL METHODS FOR SEQUENCING DATA ANALYSIS: MAPPING, QUERY, AND CLASSIFICATION
Over the past decade, the evolution of next-generation sequencing technology has considerably advanced the genomics research. As a consequence, fast and accurate computational methods are needed for analyzing the large data in different applications. The research presented in this dissertation focuses on three areas: RNA-seq read mapping, large-scale data query, and metagenomics sequence classification.
A critical step of RNA-seq data analysis is to map the RNA-seq reads onto a reference genome. This dissertation presents a novel splice alignment tool, MapSplice3. It achieves high read alignment and base mapping yields and is able to detect splice junctions, gene fusions, and circular RNAs comprehensively at the same time. Based on MapSplice3, we further extend a novel lightweight approach called iMapSplice that enables personalized mRNA transcriptional profiling. As huge amount of RNA-seq has been shared through public datasets, it provides invaluable resources for researchers to test hypotheses by reusing existing datasets. To meet the needs of efficiently querying large-scale sequencing data, a novel method, called SeqOthello, has been developed. It is able to efficiently query sequence k-mers against large-scale datasets and finally determines the existence of the given sequence. Metagenomics studies often generate tens of millions of reads to capture the presence of microbial organisms. Thus efficient and accurate algorithms are in high demand. In this dissertation, we introduce MetaOthello, a probabilistic hashing classifier for metagenomic sequences. It supports efficient query of a taxon using its k-mer signatures
Intelligent Radio Spectrum Monitoring
[EN] Spectrum monitoring is an important part of the radio spectrum management
process, providing feedback on the workflow that allows for our current wirelessly
interconnected lifestyle. The constantly increasing number of users and uses of wireless
technologies is pushing the limits and capabilities of the existing infrastructure,
demanding new alternatives to manage and analyse the extremely large volume of data
produced by existing spectrum monitoring networks. This study addresses this problem
by proposing an information management system architecture able to increase the
analytical level of a spectrum monitoring measurement network. This proposal includes
an alternative to manage the data produced by such network, methods to analyse the
spectrum data and to automate the data gathering process. The study was conducted
employing system requirements from the Brazilian National Telecommunications
Agency and related functional concepts were aggregated from the reviewed scientific
literature and publications from the International Telecommunication Union. The
proposed solution employs microservice architecture to manage the data, including tasks
such as format conversion, analysis, optimization and automation. To enable efficient
data exchange between services, we proposed the use of a hierarchical structure created
using the HDF5 format. The suggested architecture was partially implemented as a pilot
project, which allowed to demonstrate the viability of presented ideas and perform an
initial refinement of the proposed data format and analytical algorithms. The results
pointed to the potential of the solution to solve some of the limitations of the existing
spectrum monitoring workflow. The proposed system may play a crucial role in the
integration of the spectrum monitoring activities into open data initiatives, promoting
transparency and data reusability for this important public service.[ES] El control y análisis de uso del espectro electromagnético, un servicio conocido como
comprobación técnica del espectro, es una parte importante del proceso de gestión del espectro
de radiofrecuencias, ya que proporciona la información necesaria al flujo de trabajo que permite
nuestro estilo de vida actual, interconectado e inalámbrico. El número cada vez más grande de
usuarios y el creciente uso de las tecnologías inalámbricas amplían las demandas sobre la
infraestructura existente, exigiendo nuevas alternativas para administrar y analizar el gran
volumen de datos producidos por las estaciones de medición del espectro. Este estudio aborda
este problema al proponer una arquitectura de sistema para la gestión de información capaz de
aumentar la capacidad de análisis de una red de equipos de medición dedicados a la comprobación
técnica del espectro. Esta propuesta incluye una alternativa para administrar los datos producidos
por dicha red, métodos para analizar los datos recolectados, así como una propuesta para
automatizar el proceso de recopilación. El estudio se realizó teniendo como referencia los
requisitos de la Agencia Nacional de Telecomunicaciones de Brasil, siendo considerados
adicionalmente requisitos funcionales relacionados descritos en la literatura científica y en las
publicaciones de la Unión Internacional de Telecomunicaciones. La solución propuesta emplea
una arquitectura de microservicios para la administración de datos, incluyendo tareas como la
conversión de formatos, análisis, optimización y automatización. Para permitir el intercambio
eficiente de datos entre servicios, sugerimos el uso de una estructura jerárquica creada usando el
formato HDF5. Esta arquitectura se implementó parcialmente dentro de un proyecto piloto, que
permitió demostrar la viabilidad de las ideas presentadas, realizar mejoras en el formato de datos
propuesto y en los algoritmos analíticos. Los resultados señalaron el potencial de la solución para
resolver algunas de las limitaciones del tradicional flujo de trabajo de comprobación técnica del
espectro. La utilización del sistema propuesto puede mejorar la integración de las actividades e
impulsar iniciativas de datos abiertos, promoviendo la transparencia y la reutilización de datos
generados por este importante servicio público[CA] El control i anàlisi d'ús de l'espectre electromagnètic, un servei conegut com a
comprovació tècnica de l'espectre, és una part important del procés de gestió de
l'espectre de radiofreqüències, ja que proporciona la informació necessària al flux de
treball que permet el nostre estil de vida actual, interconnectat i sense fils. El número
cada vegada més gran d'usuaris i el creixent ús de les tecnologies sense fils amplien la
demanda sobre la infraestructura existent, exigint noves alternatives per a administrar i
analitzar el gran volum de dades produïdes per les xarxes d'estacions de mesurament.
Aquest estudi aborda aquest problema en proposar una arquitectura de sistema per a la
gestió d'informació capaç d’augmentar la capacitat d’anàlisi d'una xarxa d'equips de
mesurament dedicats a la comprovació tècnica de l'espectre. Aquesta proposta inclou
una alternativa per a administrar les dades produïdes per aquesta xarxa, mètodes per a
analitzar les dades recol·lectades, així com una proposta per a automatitzar el procés de
recopilació. L'estudi es va realitzar tenint com a referència els requisits de l'Agència
Nacional de Telecomunicacions del Brasil, sent considerats addicionalment requisits
funcionals relacionats descrits en la literatura científica i en les publicacions de la Unió
Internacional de Telecomunicacions. La solució proposada empra una arquitectura de
microserveis per a l'administració de dades, incloent tasques com la conversió de
formats, anàlisi, optimització i automatització. Per a permetre l'intercanvi eficient de
dades entre serveis, suggerim l'ús d'una estructura jeràrquica creada usant el format
HDF5. Aquesta arquitectura es va implementar parcialment dins d'un projecte pilot, que
va permetre demostrar la viabilitat de les idees presentades, realitzar millores en el
format de dades proposat i en els algorismes analítics. Els resultats van assenyalar el
potencial de la solució per a resoldre algunes de les limitacions del tradicional flux de
treball de comprovació tècnica de l'espectre. La utilització del sistema proposat pot
millorar la integració de les activitats i impulsar iniciatives de dades obertes, promovent
la transparència i la reutilització de dades generades per aquest important servei públicSantos Lobão, F. (2019). Intelligent Radio Spectrum Monitoring. http://hdl.handle.net/10251/128850TFG
A review of estimation of distribution algorithms in bioinformatics
Evolutionary search algorithms have become an essential asset in the algorithmic toolbox for solving high-dimensional optimization problems in across a broad range of bioinformatics problems. Genetic algorithms, the most well-known and representative evolutionary search technique, have been the subject of the major part of such applications. Estimation of distribution algorithms (EDAs) offer a novel evolutionary paradigm that constitutes a natural and attractive alternative to genetic algorithms. They make use of a probabilistic model, learnt from the promising solutions, to guide the search process. In this paper, we set out a basic taxonomy of EDA techniques, underlining the nature and complexity of the probabilistic model of each EDA variant. We review a set of innovative works that make use of EDA techniques to solve challenging bioinformatics problems, emphasizing the EDA paradigm's potential for further research in this domain
Large Scale Genomic Sequence SVM Classifiers
In genomic sequence analysis tasks like splice site recognition or promoter identification, large amounts of training sequences are available, and indeed needed to achieve sufficiently high classification performances. In this work we study two recently proposed and successfully used kernels, namely the Spectrum kernel and the Weighted Degree kernel (WD). In particular, we suggest several extensions using Suffix Trees and modi cations of an SMO-like SVM training algorithm in order to accelerate the training of the SVMs and their evaluation on test sequences. Our simulations show that for the spectrum kernel and WD kernel, large scale SVM training can be accelerated by factors of 20 and 4 times, respectively, while using much less memory (e.g. no kernel caching). The evaluation on new sequences is often several thousand times faster using the new techniques (depending on the number of Support Vectors). Our method allows us to train on sets as large as one million sequences
Combining similarity in time and space for training set formation under concept drift
Concept drift is a challenge in supervised learning for sequential data. It describes a phenomenon when the data distributions change over time. In such a case accuracy of a classifier benefits from the selective sampling for training. We develop a method for training set selection, particularly relevant when the expected drift is gradual. Training set selection at each time step is based on the distance to the target instance. The distance function combines similarity in space and in time. The method determines an optimal training set size online at every time step using cross validation. It is a wrapper approach, it can be used plugging in different base classifiers. The proposed method shows the best accuracy in the peer group on the real and artificial drifting data. The method complexity is reasonable for the field applications
BEAT: Bioinformatics Exon Array Tool to store, analyze and visualize Affymetrix GeneChip Human Exon Array data from disease experiments
<p>Abstract</p> <p>Background</p> <p>It is known from recent studies that more than 90% of human multi-exon genes are subject to Alternative Splicing (AS), a key molecular mechanism in which multiple transcripts may be generated from a single gene. It is widely recognized that a breakdown in AS mechanisms plays an important role in cellular differentiation and pathologies. Polymerase Chain Reactions, microarrays and sequencing technologies have been applied to the study of transcript diversity arising from alternative expression. Last generation Affymetrix GeneChip Human Exon 1.0 ST Arrays offer a more detailed view of the gene expression profile providing information on the AS patterns. The exon array technology, with more than five million data points, can detect approximately one million exons, and it allows performing analyses at both gene and exon level. In this paper we describe BEAT, an integrated user-friendly bioinformatics framework to store, analyze and visualize exon arrays datasets. It combines a data warehouse approach with some rigorous statistical methods for assessing the AS of genes involved in diseases. Meta statistics are proposed as a novel approach to explore the analysis results. BEAT is available at <url>http://beat.ba.itb.cnr.it</url>.</p> <p>Results</p> <p>BEAT is a web tool which allows uploading and analyzing exon array datasets using standard statistical methods and an easy-to-use graphical web front-end. BEAT has been tested on a dataset with 173 samples and tuned using new datasets of exon array experiments from 28 colorectal cancer and 26 renal cell cancer samples produced at the Medical Genetics Unit of IRCCS Casa Sollievo della Sofferenza.</p> <p>To highlight all possible AS events, alternative names, accession Ids, Gene Ontology terms and biochemical pathways annotations are integrated with exon and gene level expression plots. The user can customize the results choosing custom thresholds for the statistical parameters and exploiting the available clinical data of the samples for a multivariate AS analysis.</p> <p>Conclusions</p> <p>Despite exon array chips being widely used for transcriptomics studies, there is a lack of analysis tools offering advanced statistical features and requiring no programming knowledge. BEAT provides a user-friendly platform for a comprehensive study of AS events in human diseases, displaying the analysis results with easily interpretable and interactive tables and graphics.</p
Automatic intron detection in metagenomes using neural networks.
Tato práce se zabývá detekcí intronů v metagenomech hub pomocí hlubokých neuronových sítí. Přesné biologické mechanizmy rozpoznávání a vyřezávání intronů nejsou zatím plně známy a jejich strojová detekce není považovaná za vyřešený problém. Rozpoznávání a vyřezávání intronů z DNA sekvencí je důležité pro identifikaci genů v metagenomech a hledání jejich homologií mezi známými DNA sekvencemi,které jsou dostupné ve veřejných databázích. Rozpoznání genů a nalezení jejich případných homologů umožňuje identifikaci jak již známých tak i nových druhů a jejich taxonomické zařazení. V rámci práce vznikly dva modely neuronových sítí, které detekují začátky a konce intronů, takzvaná donorová a akceptorová místa sestřihu. Detekovaná místa sestřihu jsou následně zkombinována do kandidátních intronů. Překrývající se kandidátní introny jsou poté odstraněny pomocí jednoduchého skórovacího algoritmu. Práce navazuje na existující řešení, které využívá metody podpůrných vektorů (SVM). Výsledné neuronové sítě dosahují lepších výsledků než SVM a to při více než desetinásobně nižším výpočetním čase na zpracování stejně obsáhlého genomu.This work is concerned with the detection of introns in metagenomes with deep neural networks. Exact biological mechanisms of intron recognition and splicing are not fully known yet and their automated detection has remained unresolved. Detection and removal of introns from DNA sequences is important for the identification of genes in metagenomes and for searching for homologs among the known DNA sequences available in public databases. Gene prediction and the discovery of their homologs allows the identification of known and new species and their taxonomic classification. Two neural network models were developed as part of this thesis. The models' aim is the detection of intron starts and ends with the so-called donor and acceptor splice sites. The splice sites are later combined into candidate introns which are further filtered by a simple score-based overlap resolving algorithm. The work relates to an existing solution based on support vector machines (SVM). The resulting neural networks achieve better results than SVM and require more than order of magnitude less computational resources in order to process equally large genome
- …