22 research outputs found

    PWHATSHAP: efficient haplotyping for future generation sequencing

    Get PDF
    Background: Haplotype phasing is an important problem in the analysis of genomics information. Given a set of DNA fragments of an individual, it consists of determining which one of the possible alleles (alternative forms of a gene) each fragment comes from. Haplotype information is relevant to gene regulation, epigenetics, genome-wide association studies, evolutionary and population studies, and the study of mutations. Haplotyping is currently addressed as an optimisation problem aiming at solutions that minimise, for instance, error correction costs, where costs are a measure of the con dence in the accuracy of the information acquired from DNA sequencing. Solutions have typically an exponential computational complexity. WhatsHap is a recent optimal approach which moves computational complexity from DNA fragment length to fragment overlap, i.e. coverage, and is hence of particular interest when considering sequencing technology's current trends that are producing longer fragments.  Results: Given the potential relevance of ecient haplotyping in several analysis pipelines, we have designed and engineered pWhatsHap, a parallel, high-performance version of WhatsHap. pWhatsHap is embedded in a toolkit developed in Python and supports genomics datasets in standard le formats. Building on WhatsHap, pWhatsHap exhibits the same complexity exploring a number of possible solutions which is exponential in the coverage of the dataset. The parallel implementation on multi-core architectures allows for a relevant reduction of the execution time for haplotyping, while the provided results enjoy the same high accuracy as that provided by WhatsHap, which increases with coverage.  Conclusions: Due to its structure and management of the large datasets, the parallelisation of WhatsHap posed demanding technical challenges, which have been addressed exploiting a high-level parallel programming framework. The result, pWhatsHap, is a freely available toolkit that improves the eciency of the analysis of genomics information

    Algorithms For Haplotype Inference And Block Partitioning

    Get PDF
    The completion of the human genome project in 2003 paved the way for studies to better understand and catalog variation in the human genome. The International HapMap Project was started in 2002 with the aim of identifying genetic variation in the human genome and studying the distribution of genetic variation across populations of individuals. The information collected by the HapMap project will enable researchers in associating genetic variations with phenotypic variations. Single Nucleotide Polymorphisms (SNPs) are loci in the genome where two individuals differ in a single base. It is estimated that there are approximately ten million SNPs in the human genome. These ten million SNPS are not completely independent of each other - blocks (contiguous regions) of neighboring SNPs on the same chromosome are inherited together. The pattern of SNPs on a block of the chromosome is called a haplotype. Each block might contain a large number of SNPs, but a small subset of these SNPs are sufficient to uniquely dentify each haplotype in the block. The haplotype map or HapMap is a map of these haplotype blocks. Haplotypes, rather than individual SNP alleles are expected to effect a disease phenotype. The human genome is diploid, meaning that in each cell there are two copies of each chromosome - i.e., each individual has two haplotypes in any region of the chromosome. With the current technology, the cost associated with empirically collecting haplotype data is prohibitively expensive. Therefore, the un-ordered bi-allelic genotype data is collected experimentally. The genotype data gives the two alleles in each SNP locus in an individual, but does not give information about which allele is on which copy of the chromosome. This necessitates computational techniques for inferring haplotypes from genotype data. This computational problem is called the haplotype inference problem. Many statistical approaches have been developed for the haplotype inference problem. Some of these statistical methods have been shown to be reasonably accurate on real genotype data. However, these techniques are very computation-intensive. With the international HapMap project collecting information from nearly 10 million SNPs, and with association studies involving thousands of individuals being undertaken, there is a need for more efficient methods for haplotype inference. This dissertation is an effort to develop efficient perfect phylogeny based combinatorial algorithms for haplotype inference. The perfect phylogeny haplotyping (PPH) problem is to derive a set of haplotypes for a given set of genotypes with the condition that the haplotypes describe a perfect phylogeny. The perfect phylogeny approach to haplotype inference is applicable to the human genome due to the block structure of the human genome. An important contribution of this dissertation is an optimal O(nm) time algorithm for the PPH problem, where n is the number of genotypes and m is the number of SNPs involved. The complexity of the earlier algorithms for this problem was O(nm^2). The O(nm) complexity was achieved by applying some transformations on the input data and by making use of the FlexTree data structure that has been developed as part of this dissertation work, which represents all the possible PPH solution for a given set of genotypes. Real genotype data does not always admit a perfect phylogeny, even within a block of the human genome. Therefore, it is necessary to extend the perfect phylogeny approach to accommodate deviations from perfect phylogeny. Deviations from perfect phylogeny might occur because of recombination events and repeated or back mutations (also referred to as homoplasy events). Another contribution of this dissertation is a set of fixed-parameter tractable algorithms for constructing near-perfect phylogenies with homoplasy events. For the problem of constructing a near perfect phylogeny with q homoplasy events, the algorithm presented here takes O(nm^2+m^(n+m)) time. Empirical analysis on simulated data shows that this algorithm produces more accurate results than PHASE (a popular haplotype inference program), while being approximately 1000 times faster than phase. Another important problem while dealing real genotype or haplotype data is the presence of missing entries. The Incomplete Perfect Phylogeny (IPP) problem is to construct a perfect phylogeny on a set of haplotypes with missing entries. The Incomplete Perfect Phylogeny Haplotyping (IPPH) problem is to construct a perfect phylogeny on a set of genotypes with missing entries. Both the IPP and IPPH problems have been shown to be NP-hard. The earlier approaches for both of these problems dealt with restricted versions of the problem, where the root is either available or can be trivially re-constructed from the data, or certain assumptions were made about the data. We make some novel observations about these problems, and present efficient algorithms for unrestricted versions of these problems. The algorithms have worst-case exponential time complexity, but have been shown to be very fast on practical instances of the problem

    On Solving Selected Nonlinear Integer Programming Problems in Data Mining, Computational Biology, and Sustainability

    Get PDF
    This thesis consists of three essays concerning the use of optimization techniques to solve four problems in the fields of data mining, computational biology, and sustainable energy devices. To the best of our knowledge, the particular problems we discuss have not been previously addressed using optimization, which is a specific contribution of this dissertation. In particular, we analyze each of the problems to capture their underlying essence, subsequently demonstrating that each problem can be modeled as a nonlinear (mixed) integer program. We then discuss the design and implementation of solution techniques to locate optimal solutions to the aforementioned problems. Running throughout this dissertation is the theme of using mixed-integer programming techniques in conjunction with context-dependent algorithms to identify optimal and previously undiscovered underlying structure

    ESTIMATING THE BIOGEOGRAPHICAL ORIGIN OF THE UNIDENTIFIED BODIES OF THE SHIPWRECK OF APRIL 18TH, 2015 IN THE MEDITERRANEAN: COMPARISON OF GENETIC AND ANTHROPOLOGICAL ASSESSMENTS

    Get PDF
    La crisi migratoria ha recentemente posto l’attenzione sulla necessità di identificare le vittime dei naufragi del Mediterraneo. Il presente studio fa parte del progetto del LABANOF (Laboratorio di Antropologia e Odontologia), Dipartimento di Scienze Biomediche dell’Università degli Studi di Milano, che ha come finalità quella di identificare le vittime del naufragio del 18 aprile 2015. Oltre ad occuparsi di tale aspetto identificativo, il presente lavoro di Tesi ha avuto come obiettivo il confronto tra metodiche genetiche e antropologiche al fine della stima dell’origine biogeografica di tali individui. A tal fine, i resti di 150 vittime sono stati sottoposti ad indagine genetica presso il laboratorio di genetica forense dell’Università degli Studi di Pavia, Torino, Brescia and Eurofins Genoma, a scopo identificativo con la tipizzazione dei marcatori autosomici STR. Sui 49 campioni che hanno restituito una qualità analitica ottimale, sono poi state effettuate analisi mirate alla valutazione dell’origine biogeografica: in particolare, sui casi si è proceduto alla tipizzazione dei marcatori autosomici sul cromosoma Y, oltre a marcatori SNP sia autosomici sia sul cromosoma Y. Essendosi altresì rilevate differenze in termini di resa e recupero dell’informazione genetica tra i diversi individui, utile al confronto con i profili dei presunti parenti, si è inoltre indagata l’eventuale sussistenza di una correlazione fra la qualità del materiale genetico risultante dall’analisi dei marcatori autosomici e le condizioni tafonomiche dei resti cadaverici studiati. L’unica variabile che ha mostrato di variare in modo significativo (p-value <0.05) è stato l’intervallo di tempo intercorso fra il naufragio e il momento dell’espletamento delle procedure autoptiche, nel corso delle quali si è provveduto al prelievo dei campioni ossei: i campioni con risultato analitico ottimale sono stati prelevati nella quasi totalità dei casi più precocemente (<200 giorni). Parallelamente i 49 campioni sono stati analizzati dal punto di vista antropologico applicando metodi morfologici sviluppati su popolazioni sovrapponibili a quelle di provenienza delle vittime. Per quanto riguarda l’ancestralità geografica, le predizioni dei due metodi valutati sui caratteri morfometrici del cranio (OSSA e hfeneR) appaiono coincidere tra loro se si considera il gruppo con percentuali di probabilità correlate alla stima. Tuttavia, la bassa numerosità delle risultanze del sistema OSSA non consente ulteriori considerazioni. Per quanto concerne i caratteri non metrici valutati sui denti, le analisi sono state effettuate mediante software rASUDAS. Le stime risultanti hanno rivelato una scarsa concordanza con le predizioni morfologiche craniche, in particolare con quelle del software hefneR, che è più paragonabile del sistema OSSA. Il confronto con le predizioni di origine biogeografica mediante polimorfismi Y-STR ha evidenziato un certo pattern di concordanza solamente con le stime ottenute mediante il metodo hefneR. L’opportunità di confermare quanto osservato su un campione maggiore rappresenterebbe un ambizioso obiettivo per ricerche future. In conclusione, il presente lavoro ha permesso di identificare i metodi potenzialmente più idonei per lo studio del campione in analisi, aprendo nuove prospettive nel processo identificativo delle vittime del Mediterraneo. La possibilità di avere metodi antropologici e odontologici accurati permetterebbe di circoscrivere il confrona sospetti di identità provenienti da una specifica area geografica, oltre a focalizzare la ricerca dei parenti in alcuni paesi e a consentire la selezione di dati di riferimento idonei per il confronto genetico. In aggiunta, il riscontro di come la qualità dei profili genetici venga significativamente influenzata dal tempo di sommersione sottolinea l’importanza di procedure precoci di recupero e identificazione delle vittime.The migratory crisis has recently drawn attention to the need to identify the victims of shipwrecks in the Mediterranean Sea. This study is part of a larger project carried out by the Forensic Laboratory of Anthropology and Odontology (LABANOF – Laboratorio di Antropologia e Odontologia Forense) of the University of Milan aimed at identifying the victims of the shipwreck of April 18, 2015. The present research had the objective of exploring the current methods of geographic origin estimation in forensic genetics and anthropology and contributing to the search in the field of identification. To this end, the remains of 150 victims were subjected to genetic investigation at the forensic genetics laboratory of the University of Brescia, Turin, Pavia and of the Eurofins Genoma for identification purposes. Also, on 49 cases with good quality profiles biogeographical ancestry estimation was performed using different techniques, according to the protocols used in the different laboratories involved (Brescia and Turin). Furthermore, having detected differences in terms of recovery of genetic information useful for comparison with the profiles of alleged relatives, the possible correlation between the quality of the results in the analysis of autosomal markers and the taphonomic condition of the cadaveric remains was investigated. The only variable that showed significant variation (p-value <0.01) was the time interval between the shipwreck and the autopsy procedures, during which sampling was performed. Bone samples with optimal analytical results were taken earlier (<200 days) in almost all cases. This finding underlines the importance of early victim recovery and identification procedures. The genetic samples used for biogeographical estimate were also analyzed by applying morphological methods developed on populations allegedly comparable to those of the origin of the victims. The estimates were then compared with the genetic results obtained. With regard to geographical ancestry, the predictions of the OSSA method appear to coincide with the group with probability percentages related to the higher estimate resulting from the hefneR software. However, the low number of OSSA findings does not allow for any additional considerations to be made. As far as the non-metric dental characters are concerned, the results of the analysis with the rASUDAS software show little agreement with the cranial morphological predictions, referring in this sense only to the hefneR method: the OSSA score, in fact, contrary to the first two, does not consider the Asian population among the reference population and is therefore less expendable in terms of comparison. The comparison with the predictions of biogeographical origin using Y-STR polymorphisms showed a certain pattern of agreement only with the estimates obtained by the hefneR method. The opportunity to confirm what was observed on a larger sample and when the identification process is completed would be an ambitious goal for future research. In conclusion, the present research made it possible to identify the potentially most suitable methods for geographical origin estimation, opening new perspectives in the identification process of the unknown victims of the Mediterranean. The availability of accurate anthropological and odontological methods can complement genetic analysis investigations. In this sense, the comparison of suspicious identities can be restricted to individuals from a specific geographical area, just as the search for relatives can be directed to certain African countries and thus suitable reference data can be selected for genetic comparison. In addition, knowing any links between the state of preservation of the remains and the quality of the genetic material stored in it may allow the selection of the most suitable portion of bone and/or protocols for subsequent genetic investigations

    Ant Colony Optimization

    Get PDF
    Ant Colony Optimization (ACO) is the best example of how studies aimed at understanding and modeling the behavior of ants and other social insects can provide inspiration for the development of computational algorithms for the solution of difficult mathematical problems. Introduced by Marco Dorigo in his PhD thesis (1992) and initially applied to the travelling salesman problem, the ACO field has experienced a tremendous growth, standing today as an important nature-inspired stochastic metaheuristic for hard optimization problems. This book presents state-of-the-art ACO methods and is divided into two parts: (I) Techniques, which includes parallel implementations, and (II) Applications, where recent contributions of ACO to diverse fields, such as traffic congestion and control, structural optimization, manufacturing, and genomics are presented

    Novel guidelines for the analysis of single nucleotide polymorphisms in disease association studies

    Get PDF
    How genetic mutations such as Single Nucleotide Polymorphisms (SNPs) affect the risk of contracting a specific disease is still an open question for numerous different medical conditions. Two problems related to SNPs analysis are (i) the selection of computational techniques to discover possible single and multiple SNP associations; and (ii) the size of the latest datasets, which may contain millions of SNPs. In order to find associations between SNPs and diseases, two popular techniques are investigated and enhanced. Firstly, the ‘Transmission Disequilibrium Test’ for familybased analysis is considered. The fixed length of haplotypes provided by this approach represents a possible limit to the quality of the obtained results. For this reason, an adaptation is proposed to select the minimum number of SNPs that are responsible for disease predisposition. Secondly, decision tree algorithms for case-control analysis in situations of unrelated individuals are considered. The application of a single tool may lead to limited analysis of the genetic association to a specific condition. Thus, a novel consensus approach is proposed exploiting the strengths of three different algorithms, ADTree, C4.5 and Id3. Results obtained suggest the new approach achieves improved performance. The recent explosive growth in size of current SNPs databases has highlighted limitations in current techniques. An example is ‘Linkage Disequilibrium’ which identifies redundancy in multiple SNPs. Despite the high accuracies obtained by this method, it exhibits poor scalability for large datasets, which severely impacts on its performance. Therefore, a new fast scalable tool based on ‘Linkage Disequilibrium’ is developed to reduce the size through the measurement and elimination of redundancy between SNPs included in the initial dataset. Experimental evidence validates the potentially improved performance of the new method
    corecore