220 research outputs found

    Designing Efficient Spaced Seeds for SOLiD Read Mapping

    Get PDF
    The advent of high-throughput sequencing technologies constituted a major advance in genomic studies, offering new prospects in a wide range of applications.We propose a rigorous and flexible algorithmic solution to mapping SOLiD color-space reads to a reference genome. The solution relies on an advanced method of seed design that uses a faithful probabilistic model of read matches and, on the other hand, a novel seeding principle especially adapted to read mapping. Our method can handle both lossy and lossless frameworks and is able to distinguish, at the level of seed design, between SNPs and reading errors. We illustrate our approach by several seed designs and demonstrate their efficiency

    Locality-Sensitive Bucketing Functions for the Edit Distance

    Get PDF
    Many bioinformatics applications involve bucketing a set of sequences where each sequence is allowed to be assigned into multiple buckets. To achieve both high sensitivity and precision, bucketing methods are desired to assign similar sequences into the same bucket while assigning dissimilar sequences into distinct buckets. Existing k-mer-based bucketing methods have been efficient in processing sequencing data with low error rate, but encounter much reduced sensitivity on data with high error rate. Locality-sensitive hashing (LSH) schemes are able to mitigate this issue through tolerating the edits in similar sequences, but state-of-the-art methods still have large gaps. Here we generalize the LSH function by allowing it to hash one sequence into multiple buckets. Formally, a bucketing function, which maps a sequence (of fixed length) into a subset of buckets, is defined to be (d?, d?)-sensitive if any two sequences within an edit distance of d? are mapped into at least one shared bucket, and any two sequences with distance at least d? are mapped into disjoint subsets of buckets. We construct locality-sensitive bucketing (LSB) functions with a variety of values of (d?,d?) and analyze their efficiency with respect to the total number of buckets needed as well as the number of buckets that a specific sequence is mapped to. We also prove lower bounds of these two parameters in different settings and show that some of our constructed LSB functions are optimal. These results provide theoretical foundations for their practical use in analyzing sequences with high error rate while also providing insights for the hardness of designing ungapped LSH functions

    Eukaryotic Plant Pathogen Detection Through High Throughput DNA/RNA Sequencing Data Analysis

    Get PDF
    Plant pathogen detection is crucial for developing appropriate management techniques. A variety of tools are available for rapid plant pathogen detection. Most tools rely on unique features of the pathogen to detect its presence. Immunoassays rely on unique proteins while genetic approaches rely on unique DNA signatures. However, most of these tools can detect a limited number of pathogens at once. E-probe Diagnostics Nucleic acid Analysis (EDNA) is a bioinformatic tool originally designed as a theoretical approach to detect multiple plant pathogens at once. EDNA uses metagenomic databases and bioinformatics to infer the presence/absence of plant pathogens in a given sample. Additionally, EDNA relies on a continuous design and curation of unique signatures termed e-probes. EDNA has been successfully validated in viral, bacterial and eukaryotic plant pathogens. However, most of these validations have been performed solely at the species level and only using DNA sequencing. My thesis involved the refinement of EDNA to increase its detection scope to include plant pathogens at the strain/isolate level. Additional refinements included its increasing EDNA’s capacity to use transcriptomic analysis to detect actively infecting plant pathogens and metabolic pathways. Actively infecting/growing plant pathogen detection was performed by using Slerotinia minor as an eukaryotic model system. We sequenced and annotated the genome of S. minor to be able to use its genome for e-probe generation. In vitro detection of actively growing S. minor was successfully achieved using EDNA for RNA sequencing analysis. However, actively infecting S. minor in peanut was non-detectable. EDNA’s capacity to detect the aflatoxin metabolic pathway was also assesed. Actively producing aflatoxin A. flavus strains (AF70) were successfully used to differentially detect the production of aflatoxin when A. flavus grows in an environment conducive for the production of aflatoxin (maize). Finally, EDNA’s detection scope was assesed with eukaryotic strains having very low genetic diversity within its species (Pythium aphanidermatum). We were able to successfully discriminate P. aphanidermatum P16 strain from P. aphanidermatum BR444, concomitantly, these two strains were differentiated from other related species (Globisporangium irregulare and Pythium deliense) in the same detection run trial.Plant Patholog

    Faster FPT Algorithm for 5-Path Vertex Cover

    Get PDF
    The problem of d-Path Vertex Cover, d-PVC lies in determining a subset F of vertices of a given graph G=(V,E) such that G F does not contain a path on d vertices. The paths we aim to cover need not to be induced. It is known that the d-PVC problem is NP-complete for any d >= 2. When parameterized by the size of the solution k, 5-PVC has direct trivial algorithm with O(5^kn^{O(1)}) running time and, since d-PVC is a special case of d-Hitting Set, an algorithm running in O(4.0755^kn^{O(1)}) time is known. In this paper we present an iterative compression algorithm that solves the 5-PVC problem in O(4^kn^{O(1)}) time

    Applied Randomized Algorithms for Efficient Genomic Analysis

    Get PDF
    The scope and scale of biological data continues to grow at an exponential clip, driven by advances in genetic sequencing, annotation and widespread adoption of surveillance efforts. For instance, the Sequence Read Archive (SRA) now contains more than 25 petabases of public data, while RefSeq, a collection of reference genomes, recently surpassed 100,000 complete genomes. In the process, it has outgrown the practical reach of many traditional algorithmic approaches in both time and space. Motivated by this extreme scale, this thesis details efficient methods for clustering and summarizing large collections of sequence data. While our primary area of interest is biological sequences, these approaches largely apply to sequence collections of any type, including natural language, software source code, and graph structured data. We applied recent advances in randomized algorithms to practical problems. We used MinHash and HyperLogLog, both examples of Locality- Sensitive Hashing, as well as coresets, which are approximate representations for finite sum problems, to build methods capable of scaling to billions of items. Ultimately, these are all derived from variations on sampling. We combined these advances with hardware-based optimizations and incorporated into free and open-source software libraries (sketch, frp, lib- simdsampling) and practical software tools built on these libraries (Dashing, Minicore, Dashing 2), empowering users to interact practically with colossal datasets on commodity hardware

    De novo sequencing, assembly and analysis of the genome and transcriptome of the nematode Panagrolaimus superbus

    Get PDF
    The nematode Panagrolaimus superbus can survive for extended peri- ods of time in a desiccated state (anhydrobiosis) and is also freezing tol- erant (cryobiotic). These adaptations make it an interesting candidate for genome and transcriptome sequencing using second generation high through- put methods. In this project the transcriptome of P. superbus was sequenced using the 454 (Roche) platform. To enrich for stress-related genes, nema- todes were exposed to one of the following stresses (desiccation, cold, heat or oxidation). Equal numbers of nematodes from each stress treatment were combined with unstressed control nematodes prior to RNA extraction. Nor- malised and unnormalised cDNA libraries were prepared from this mixed population. A de novo assembly of the transcriptome was generated using a variety of assembly programs and strategies. A Sanger sequenced expressed sequence dataset comprising 3,982 unigenes was fully annotated and inte- grated into the de novo transcriptome assembly. The de novo assembly has also been annotated and putative stress response genes were identified. The haploid karyotype of P. superbus was determined to be n=4. P. superbus genomic DNA was sequenced using 454 (Roche) methods along with 50 bp and 100 bp paired end Illumina reads. Eight different gDNA assemblies were prepared, generating predicted genome sizes ranging from 87.9 kilobases to 159.7 kilobases. The longest contigs were obtained from the 454 genomic DNA assembly and the assemblies of the Illumina reads generated shorter contigs. The gene order of the P. superbus mitochondrial DNA genome was obtained and a draft assembly of the mitochondrial genome is presented. The current transcriptome assembly is a resource suitable for use as a refer- ence for aligning high throughput RNA Seq reads. Both the transcriptome and genome assemblies can be used to generate a protein reference database for the mass spectrometry based identification of the proteome of control and desiccated P. superbus for future studies

    Study of Strategies for Genetic Variant Discrimination and Detection by Optosensing

    Full text link
    Tesis por compendio[ES] La medicina actual se dirige hacia un enfoque más personalizado basándose en el diagnóstico molecular del paciente a través del estudio de biomarcadores específicos. Aplicando este principio molecular, el diagnóstico, pronóstico y selección de la terapia se apoyan en la identificación de variaciones específicas del genoma humano, como variaciones de un único nucleótido (SNV). Para detectar estos biomarcadores se dispone de una amplia oferta de tecnologías. Sin embargo, muchos de los métodos en uso presentan limitaciones como un elevado coste, complejidad, tiempos de análisis largos o requieren de personal y equipamiento especializado, lo que imposibilita su incorporación masiva en la mayoría de los sistemas sanitarios. Por tanto, existe la necesidad de investigar y desarrollar soluciones analíticas que aporten información sobre las variantes genéticas y que se puedan implementar en diferentes escenarios del ámbito de la salud con prestaciones competitivas y económicamente viables. El objetivo principal de esta tesis ha sido desarrollar estrategias innovadoras para resolver el reto de la detección múltiple de variantes genéticas que se encuentran en forma minoritaria en muestras biológicas de pacientes, cubriendo las demandas asociadas al entorno clínico. Las tareas de investigación se centraron en la combinación de reacciones de discriminación alélica con amplificación selectiva de DNA y el desarrollo de sistemas ópticos de detección versátiles. Con el fin de atender el amplio abanico de necesidades, en el primer capítulo, se presentan resultados que mejoran las prestaciones analíticas de la reacción en cadena de la polimerasa (PCR) mediante la incorporación de una etapa al termociclado y de un agente bloqueante amplificando selectivamente las variantes minoritarias que fueron monitorizadas mediante fluorescencia a tiempo real. En el segundo capítulo, se logró la discriminación alélica combinando la ligación de oligonucleótidos con la amplificación de la recombinasa polimerasa (RPA), que al operar a temperatura constante permitió una detección tipo point-of-care (POC). La identificación de SNV se llevó a cabo mediante hibridación en formato micromatriz, utilizando la tecnología Blu-Ray como plataforma de ensayo y detección. En el tercer capítulo, se integró la RPA con la reacción de hibridación alelo especifica en cadena (AS-HCR), en formato array para genotipar SNV a partir de DNA genómico en un chip. La lectura de los resultados se realizó mediante un smartphone. En el último capítulo, se presenta la síntesis de un nuevo reactivo bioluminiscente que se aplicó a la monitorización de biomarcadores de DNA a tiempo real y final de la RPA basada en la transferencia de energía de resonancia de bioluminiscencia (BRET), eliminando la necesidad de una fuente de excitación. Todas las estrategias permitieron un reconocimiento especifico de la variante de interés, incluso en muestras que contenían tan solo 20 copias de DNA genómico diana. Se consiguieron resultados sensibles (límite de detección 0.5% variante/total), reproducibles (desviación estándar relativa < 19%), de manera sencilla (3 etapas o menos), rápida (tiempos cortos de 30-200 min) y permitiendo el análisis simultaneo de varios genes. Como prueba de concepto, estas estrategias se aplicaron a la detección e identificación en muestras clínicas de biomarcadores asociados a cáncer colorrectal y enfermedades cardiológicas. Los resultados se validaron por comparación con los métodos de referencia NGS y PCR, comprobándose que se mejoraban los requerimientos técnicos y la relación coste-eficacia. En conclusión, las investigaciones llevadas a cabo posibilitaron desarrollar herramientas de genotipado con propiedades analíticas competitivas y versátiles, aplicables a diferentes escenarios sanitarios, desde hospitales a entornos con pocos recursos. Estos resultados son prometedores al dar respuesta a la demanda de tecnologías alternativas para el diagnóstico molecular personalizado.[CA] La medicina actual es dirigeix cap a un enfocament més personalitzat basant-se en el diagnòstic molecular del pacient a través de l'estudi de biomarcadors específics. Aplicant aquest principi molecular, el diagnòstic, pronòstic i selecció de la teràpia es recolzen en la identificació de variacions específiques del genoma humà, com variacions d'un únic nucleòtid (SNV). Per a detectar aquests biomarcadors, es disposa d'una àmplia oferta de tecnologies. No obstant això, molts dels mètodes en ús presenten limitacions com un elevat cost, complexitat, temps d'anàlisis llargues o requereixen de personal i equipament especialitzat, la qual cosa impossibilita la seua incorporació massiva en la majoria dels sistemes sanitaris. Per tant, existeix la necessitat d'investigar i desenvolupar solucions analítiques que aporten informació sobre les variants genètiques i que es puguen implementar en diferents escenaris de l'àmbit de la salut amb prestacions competitives i econòmicament viables. L'objectiu principal d'aquesta tesi ha sigut desenvolupar estratègies innovadores per a resoldre el repte de la detecció múltiple de variants genètiques que es troben en forma minoritària en mostres biològiques de pacients, cobrint les demandes associades a l'entorn clínic. Les tasques d'investigació es van centrar en la combinació de reaccions de discriminació al·lèlica amb amplificació selectiva de DNA i al desenvolupament de sistemes òptics de detecció versàtils. Amb la finalitat d'atendre l'ampli ventall de necessitats, en el primer capítol, es presenten resultats que milloren les prestacions analítiques de la reacció en cadena de la polimerasa (PCR) mitjançant la incorporació d'una etapa al termociclat i d'un agent bloquejant amplificant selectivament les variants minoritàries que van ser monitoritzades mitjançant fluorescència a temps real. En el segon capítol, es va aconseguir la discriminació al·lèlica combinant el lligament d'oligonucleòtids amb l'amplificació de la recombinasa polimerasa (RPA), que en operar a temperatura constant va permetre una detecció tipus point-of-care (POC). La identificació de SNV es va dur a terme mitjançant hibridació en format micromatriu, utilitzant la tecnologia Blu-Ray com a plataforma d'assaig i detecció. En el tercer capítol, es va integrar la RPA amb la reacció d'hibridació al·lel específica en cadena (AS-HCR), en format matriu per a genotipar SNV a partir de DNA genòmic en un xip. La lectura dels resultats es va realitzar mitjançant un telèfon intel·ligent. En l'últim capítol, es presenta la síntesi d'un nou reactiu bioluminescent que es va aplicar al monitoratge de biomarcadors de DNA a temps real i final de la RPA basada en la transferència d'energia de ressonància de bioluminescència (BRET), eliminant la necessitat d'una font d'excitació. Totes les estratègies van permetre un reconeixement específic de la variant d'interès, fins i tot en mostres que només contenien 20 còpies de DNA genòmic diana. Es van aconseguir resultats sensibles (límit de detecció 0.5% variant/total), reproduïbles (desviació estàndard relativa < 19%), de manera senzilla (3 etapes o menys), ràpida (temps curts de 30-200 min) i permetent l'anàlisi simultània de diversos gens. Com a prova de concepte, aquestes estratègies es van aplicar a la detecció i identificació en mostres clíniques de biomarcadors associats a càncer colorectal i a malalties cardiològiques. Els resultats es van validar per comparació amb els mètodes de referència NGS i PCR, comprovant-se que es milloraven els requeriments tècnics i la relació cost-eficàcia. En conclusió, les investigacions dutes a terme van possibilitar desenvolupar eines de genotipat amb propietats analítiques competitives i versàtils, aplicables a diferents escenaris sanitaris, des d'hospitals a entorns amb pocs recursos. Aquests resultats són prometedors en donar resposta a la demanda de tecnologies alternatives per al diagnòstic molecular personalitzat.[EN] Current medicine is moving towards a more personalized approach based on the patients' molecular diagnosis through the study of specific biomarkers. Diagnosis, prognosis and therapy selection, applying this molecular principle, rely on identifying specific variations in the human genome, such as single nucleotide variations (SNV). A wide range of technologies is available to detect these biomarkers. However, many of the employed methods have limitations such as high cost, complexity, long analysis times, or requiring specialized personnel and equipment, making their massive incorporation in most healthcare systems impossible. Therefore, there is a need to research and develop analytical solutions that provide information on genetic variants that can be implemented in different health scenarios with competitive and economically feasible performances. The main objective of this thesis has been to develop innovative strategies to solve the challenge of multiple detection of genetic variants that are found in a minority amount in patient samples, covering the demands associated with the clinical setting. Research tasks focused on the combination of allelic discrimination reactions with selective DNA amplification and the development of versatile optical detection systems. In order to meet the wide range of needs, in the first chapter, the analytical performances of the polymerase chain reaction (PCR) were improved by incorporating a thermocycling step and a blocking agent to amplify selectively minority variants that were monitored by real-time fluorescence. In the second chapter, allelic discrimination was achieved by combining oligonucleotide ligation with recombinase polymerase amplification (RPA), which operates at a constant temperature, allowing point-of-care (POC) detection. SNV identification was carried out by hybridization in microarray format, using Blu-Ray technology as the assay platform and detector. RPA was integrated with allele-specific hybridization chain reaction (AS-HCR), in an array format to genotype SNV from genomic DNA on a chip in the third chapter. The reading of the results was performed using a smartphone. In the last chapter, a new bioluminescent reagent was synthesized. It was applied to real-time and endpoint DNA biomarker monitoring based on bioluminescence resonance energy transfer (BRET), eliminating the need for an excitation source. All the strategies allowed specific recognition of the target variant, even in samples containing as few as 20 copies of target genomic DNA. Sensitive (limit of detection 0.5% variant/total), reproducible (relative standard deviation < 19%), simple (3 steps or less), fast (short times of 30-200 min) results were achieved, allowing simultaneous analysis of several genes. As proof of concept, these strategies were applied to detect and identify biomarkers associated with colorectal cancer and cardiological diseases in clinical samples. The results were validated by comparison with reference methods such as NGS and PCR, proving that the technical requirements and cost-effectiveness were improved. In conclusion, the developed research made it possible to develop genotyping tools with competitive analytical properties and versatile, applicable to different healthcare scenarios, from hospitals to limited-resource environments. These results are promising since they respond to the demand for alternative technologies for personalized molecular diagnostics.The authors acknowledge the financial support received from the Generalitat Valenciana PROMETEO/2020/094, GRISOLIA/2014/024 PhD Grant and GVA-FPI-2017 PhD grant, the Spanish Ministry of Economy and Competitiveness MINECO projects CTQ2016-75749-R and PID2019-110713RB-I00 and European Regional Development Fund (ERDF).Lázaro Zaragozá, A. (2022). Study of Strategies for Genetic Variant Discrimination and Detection by Optosensing [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/185216TESISCompendi

    The Role of Distributed Computing in Big Data Science: Case Studies in Forensics and Bioinformatics

    Get PDF
    2014 - 2015The era of Big Data is leading the generation of large amounts of data, which require storage and analysis capabilities that can be only ad- dressed by distributed computing systems. To facilitate large-scale distributed computing, many programming paradigms and frame- works have been proposed, such as MapReduce and Apache Hadoop, which transparently address some issues of distributed systems and hide most of their technical details. Hadoop is currently the most popular and mature framework sup- porting the MapReduce paradigm, and it is widely used to store and process Big Data using a cluster of computers. The solutions such as Hadoop are attractive, since they simplify the transformation of an application from non-parallel to the distributed one by means of general utilities and without many skills. However, without any algorithm engineering activity, some target applications are not alto- gether fast and e cient, and they can su er from several problems and drawbacks when are executed on a distributed system. In fact, a distributed implementation is a necessary but not su cient condition to obtain remarkable performance with respect to a non-parallel coun- terpart. Therefore, it is required to assess how distributed solutions are run on a Hadoop cluster, and/or how their performance can be improved to reduce resources consumption and completion times. In this dissertation, we will show how Hadoop-based implementations can be enhanced by using carefully algorithm engineering activity, tuning, pro ling and code improvements. It is also analyzed how to achieve these goals by working on some critical points, such as: data local computation, input split size, number and granularity of tasks, cluster con guration, input/output representation, etc. i In particular, to address these issues, we choose some case studies coming from two research areas where the amount of data is rapidly increasing, namely, Digital Image Forensics and Bioinformatics. We mainly describe full- edged implementations to show how to design, engineer, improve and evaluate Hadoop-based solutions for Source Camera Identi cation problem, i.e., recognizing the camera used for taking a given digital image, adopting the algorithm by Fridrich et al., and for two of the main problems in Bioinformatics, i.e., alignment- free sequence comparison and extraction of k-mer cumulative or local statistics. The results achieved by our improved implementations show that they are substantially faster than the non-parallel counterparts, and re- markably faster than the corresponding Hadoop-based naive imple- mentations. In some cases, for example, our solution for k-mer statis- tics is approximately 30× faster than our Hadoop-based naive im- plementation, and about 40× faster than an analogous tool build on Hadoop. In addition, our applications are also scalable, i.e., execution times are (approximately) halved by doubling the computing units. Indeed, algorithm engineering activities based on the implementation of smart improvements and supported by careful pro ling and tun- ing may lead to a much better experimental performance avoiding potential problems. We also highlight how the proposed solutions, tips, tricks and insights can be used in other research areas and problems. Although Hadoop simpli es some tasks of the distributed environ- ments, we must thoroughly know it to achieve remarkable perfor- mance. It is not enough to be an expert of the application domain to build Hadop-based implementations, indeed, in order to achieve good performance, an expert of distributed systems, algorithm engi- neering, tuning, pro ling, etc. is also required. Therefore, the best performance depend heavily on the cooperation degree between the domain expert and the distributed algorithm engineer. [edited by Author]XIV n.s
    corecore