40 research outputs found

    Efficient path-based computations on pedigree graphs with compact encodings

    Get PDF
    A pedigree is a diagram of family relationships, and it is often used to determine the mode of inheritance (dominant, recessive, etc.) of genetic diseases. Along with rapidly growing knowledge of genetics and accumulation of genealogy information, pedigree data is becoming increasingly important. In large pedigree graphs, path-based methods for efficiently computing genealogical measurements, such as inbreeding and kinship coefficients of individuals, depend on efficient identification and processing of paths. In this paper, we propose a new compact path encoding scheme on large pedigrees, accompanied by an efficient algorithm for identifying paths. We demonstrate the utilization of our proposed method by applying it to the inbreeding coefficient computation. We present time and space complexity analysis, and also manifest the efficiency of our method for evaluating inbreeding coefficients as compared to previous methods by experimental results using pedigree graphs with real and synthetic data. Both theoretical and experimental results demonstrate that our method is more scalable and efficient than previous methods in terms of time and space requirements

    Path-Counting Formulas for Generalized Kinship Coefficients and Condensed Identity Coefficients

    Get PDF
    An important computation on pedigree data is the calculation of condensed identity coefficients, which provide a complete description of the degree of relatedness of two individuals. The applications of condensed identity coefficients range from genetic counseling to disease tracking. Condensed identity coefficients can be computed using linear combinations of generalized kinship coefficients for two, three, four individuals, and two pairs of individuals and there are recursive formulas for computing those generalized kinship coefficients (Karigl, 1981). Path-counting formulas have been proposed for the (generalized) kinship coefficients for two (three) individuals but there have been no path-counting formulas for the other generalized kinship coefficients. It has also been shown that the computation of the (generalized) kinship coefficients for two (three) individuals using path-counting formulas is efficient for large pedigrees, together with path encoding schemes tailored for pedigree graphs. In this paper, we propose a framework for deriving path-counting formulas for generalized kinship coefficients. Then, we present the path-counting formulas for all generalized kinship coefficients for which there are recursive formulas and which are sufficient for computing condensed identity coefficients. We also perform experiments to compare the efficiency of our method with the recursive method for computing condensed identity coefficients on large pedigrees

    Doctor of Philosophy

    Get PDF
    dissertationServing as a record of what happened during a scientific process, often computational, provenance has become an important piece of computing. The importance of archiving not only data and results but also the lineage of these entities has led to a variety of systems that capture provenance as well as models and schemas for this information. Despite significant work focused on obtaining and modeling provenance, there has been little work on managing and using this information. Using the provenance from past work, it is possible to mine common computational structure or determine differences between executions. Such information can be used to suggest possible completions for partial workflows, summarize a set of approaches, or extend past work in new directions. These applications require infrastructure to support efficient queries and accessible reuse. In order to support knowledge discovery and reuse from provenance information, the management of those data is important. One component of provenance is the specification of the computations; workflows provide structured abstractions of code and are commonly used for complex tasks. Using change-based provenance, it is possible to store large numbers of similar workflows compactly. This storage also allows efficient computation of differences between specifications. However, querying for specific structure across a large collection of workflows is difficult because comparing graphs depends on computing subgraph isomorphism which is NP-Complete. Graph indexing methods identify features that help distinguish graphs of a collection to filter results for a subgraph containment query and reduce the number of subgraph isomorphism computations. For provenance, this work extends these methods to work for more exploratory queries and collections with significant overlap. However, comparing workflow or provenance graphs may not require exact equality; a match between two graphs may allow paired nodes to be similar yet not equivalent. This work presents techniques to better correlate graphs to help summarize collections. Using this infrastructure, provenance can be reused so that users can learn from their own and others' history. Just as textual search has been augmented with suggested completions based on past or common queries, provenance can be used to suggest how computations can be completed or which steps might connect to a given subworkflow. In addition, provenance can help further science by accelerating publication and reuse. By incorporating provenance into publications, authors can more easily integrate their results, and readers can more easily verify and repeat results. However, reusing past computations requires maintaining stronger associations with any input data and underlying code as well as providing paths for migrating old work to new hardware or algorithms. This work presents a framework for maintaining data and code as well as supporting upgrades for workflow computations

    Solving Optimization Problems via Maximum Satisfiability : Encodings and Re-Encodings

    Get PDF
    NP-hard combinatorial optimization problems are commonly encountered in numerous different domains. As such efficient methods for solving instances of such problems can save time, money, and other resources in several different applications. This thesis investigates exact declarative approaches to combinatorial optimization within the maximum satisfiability (MaxSAT) paradigm, using propositional logic as the constraint language of choice. Specifically we contribute to both MaxSAT solving and encoding techniques. In the first part of the thesis we contribute to MaxSAT solving technology by developing solver independent MaxSAT preprocessing techniques that re-encode MaxSAT instances into other instances. In order for preprocessing to be effective, the total time spent re-encoding the original instance and solving the new instance should be lower than the time required to directly solve the original instance. We show how the recently proposed label-based framework for MaxSAT preprocessing can be efficiently integrated with state-of-art MaxSAT solvers in a way that improves the empirical performance of those solvers. We also investigate the theoretical effect that label-based preprocessing has on the number of iterations needed by MaxSAT solvers in order to solve instances. We show that preprocessing does not improve best-case performance (in the number of iterations) of MaxSAT solvers, but can improve the worst-case performance. Going beyond previously proposed preprocessing rules we also propose and evaluate a MaxSAT-specific preprocessing technique called subsumed label elimination (SLE). We show that SLE is theoretically different from previously proposed MaxSAT preprocessing rules and that using SLE in conjunction with other preprocessing rules improves empirical performance of several MaxSAT solvers. In the second part of the thesis we propose and evaluate new MaxSAT encodings to two important data analysis tasks: correlation clustering and bounded treewidth Bayesian network learning. For both problems we empirically evaluate the resulting MaxSAT-based solution approach with other exact algorithms for the problems. We show that, on many benchmarks, the MaxSAT-based approach is faster and more memory efficient than other exact approaches. For correlation clustering, we also show that the quality of solutions obtained using MaxSAT is often significantly higher than the quality of solutions obtained by approximative (inexact) algorithms. We end the thesis with a discussion highlighting possible further research directions.Kombinatorinen optimointi on laajasti tutkittu matematiikan ja tietojenkäsittelytieteen osa-alue. Kombinatorisissa optimointiongelmissa diskreetin ratkaisujen joukon yli määritelty kustannusfunktio määrittää kunkin ratkaisun hyvyyden. Tehtävänä on löytää sallittujen ratkaisujen joukosta kustannusfunktion mukaan paras mahdollinen. Esimerkiksi niin sanotussa kauppamatkustajan ongelmassa annettuna joukko kaupunkeja tavoitteena on löytää lyhin mahdollinen reitti, jota kulkemalla voidaan käydä kaikissa kaupungeissa. Kauppamatkustajan ongelma sekä monet muut kombinatoriset optimointiongelmat ovat laskennallisesti haastavia, tarkemmin ilmaistuna NP-vaikeita. Haastavia kombinatorisia optimointiongelmia esiintyy monilla eri tieteen ja teollisuuden aloilla; esimerkiksi useat koneoppimiseen liittyvät ongelmat voidaan esittää kombinatorisina optimointiongelmina. Kombinatoristen optimointiongelmien moninaisuus motivoi tehokkaiden ratkaisualgoritmien kehitystä. Väitöskirjassa kehitetään deklaratiivisia ratkaisumenetelmiä NP-vaikeille optimointiongelmille. Deklaratiivinen ratkaisumenetelmä olettaa, että ratkaistavalle ongelmalle on olemassa jonkin matemaattisen rajoitekielen rajoitemalli, joka kuvaa kunkin ongelman instanssin joukkona matemaattisia rajoitteita siten, että kunkin rajoiteinstanssin optimaalinen ratkaisu voidaan tulkita alkuperäisen ongelman optimaalisena ratkaisuna. Deklaratiivisessa ratkaisumenetelmässä ratkaistavan optimointiongelman instanssi ratkaistaan kuvaamalla ensin instanssi rajoitemallilla joukoksi rajoitteita ja ratkaisemalla sitten rajoiteinstanssi rajoitekielen ratkaisualgoritmilla. Työssä käytetään lauselogiikkaa rajoitekielenä ja keskitytään lauselogiikan toteutuvuusongelman (SAT) laajennukseen optimointiongelmille. Tätä ongelmaa kutsutaan nimellä MaxSAT. Työssä kehitetään sekä sekä yleisiä MaxSAT-ratkaisumenetelmiä että MaxSAT-malleja tietyille koneoppimiseen liittyville optimointiongelmille. Väitöskirjan keskeiset kontribuutiot esitellään kahdessa osassa. Ensimmäisessä osassa kehitetään MaxSAT-ratkaisumenetelmiä, tarkemmin sanottuna MaxSAT-esikäsittelymenetelmiä. Esikäsittelymenetelmät ovat tehokkaasti laskettavissa olevia päättelysääntöjä (esikäsittelysääntöjä), joita käyttämällä annettuja MaxSAT-instansseja voidaan yksinkertaistaa. Esikäsittelyn tavoitteena on tehdä MaxSAT-instansseista helpommin ratkaistavia käytännössä. Väitöstyössä: i) esitellään tapa integroida keskeiset lauselogiikan toteutuvuusongelman esikäsittelysäännöt nykyaikaisiin MaxSAT-ratkaisualgoritmeihin ii) analysoidaan esikäsittelyn vaikutusta ratkaisualgoritmien käyttäytymiseen ja iii) esitellään uusi MaxSAT-esikäsittelysääntö. Kaikkia kontribuutioita MaxSAT-esikäsittelyyn analysoidaan sekä teoreettisella että kokeellisella tasolla. Kirjan toisessa osassa kehitetään MaxSAT-malleja kahdelle koneoppimiseen liittyvälle optimointiongelmalle: korrelaatioklusteroinnille ja Bayes-verkkojen rakenteenoppimisongelmalle. Kehitettäviä malleja analysoidaan sekä teoreettisesti, että kokeellisesti. Teoreettisella tasolla mallit todistetaan oikeellisiksi. Kokeellisella tasolla osoitetaan, että mallit mahdollistavat alkuperäisten ongelmien instanssien tehokkaan ratkaisemisen aiemmin näille ongelmille esiteltyihin eksakteihin ratkaisualgoritmeihin verrattuna

    Tools and Algorithms for the Construction and Analysis of Systems

    Get PDF
    This open access book constitutes the proceedings of the 28th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, TACAS 2022, which was held during April 2-7, 2022, in Munich, Germany, as part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2022. The 46 full papers and 4 short papers presented in this volume were carefully reviewed and selected from 159 submissions. The proceedings also contain 16 tool papers of the affiliated competition SV-Comp and 1 paper consisting of the competition report. TACAS is a forum for researchers, developers, and users interested in rigorously based tools and algorithms for the construction and analysis of systems. The conference aims to bridge the gaps between different communities with this common interest and to support them in their quest to improve the utility, reliability, exibility, and efficiency of tools and algorithms for building computer-controlled systems

    Tools and Algorithms for the Construction and Analysis of Systems

    Get PDF
    This open access book constitutes the proceedings of the 28th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, TACAS 2022, which was held during April 2-7, 2022, in Munich, Germany, as part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2022. The 46 full papers and 4 short papers presented in this volume were carefully reviewed and selected from 159 submissions. The proceedings also contain 16 tool papers of the affiliated competition SV-Comp and 1 paper consisting of the competition report. TACAS is a forum for researchers, developers, and users interested in rigorously based tools and algorithms for the construction and analysis of systems. The conference aims to bridge the gaps between different communities with this common interest and to support them in their quest to improve the utility, reliability, exibility, and efficiency of tools and algorithms for building computer-controlled systems

    Visualizing genetic transmission patterns in plant pedigrees.

    Get PDF
    Ensuring food security in a world with an increasing population and demand on natural resources is becoming ever more pertinent. Plant breeders are using an increasingly diverse range of data types such as phenotypic and genotypic data to identify plant lines with desirable characteristics suitable to be taken forward in plant breeding programmes. These characteristics include a number of key morphological and physiological traits, such as disease resistance and yield that need to be maintained and improved upon if a commercial plant variety is to be successful.The ability to predict and understand the inheritance of alleles that facilitate resistance to pathogens or any other commercially important characteristic is crucially important to experimental plant genetics and commercial plant breeding programmes. However, derivation of the inheritance of such traits by traditional molecular techniques is expensive and time consuming, even with recent developments in high-throughput technologies. This is especially true in industrial settings where, due to time constraints relating to growing seasons, many thousands of plant lines may need to be screened quickly, efficiently and economically every year. Thus, computational tools that provide the ability to integrate and visualize diverse data types with an associated plant pedigree structure will enable breeders to make more informed and subsequently better decisions on the plant lines that are used in crossings. This will help meet both the demands for increased yield and production and adaptation to climate change.Traditional family tree style layouts are commonly used and simple to understand but are unsuitable for the data densities that are now commonplace in large breeding programmes. The size and complexity of plant pedigrees means that there is a cognitive limitation in conceptualising large plant pedigree structures, therefore novel techniques and tools are required by geneticists and plant breeders to improve pedigree comprehension.Taking a user-centred, iterative approach to design, a pedigree visualization system was developed for exploring a large and unique set of experimental barley (H. vulgare) data. This work progressed from the development of a static pedigree visualization to interactive prototypes and finally the Helium pedigree visualization software. At each stage of the development process, user feedback in the form of informal and more structured user evaluation from domain experts guided the development lifecycle with users’ concerns addressed and additional functionality added.Plant pedigrees are very different to those from humans and farmed animals and consequently the development of the pedigree visualizations described in this work focussed on implementing currently accepted techniques used in pedigree visualization and adapting them to meet the specific demands of plant pedigrees. Helium includes techniques to aid problems with user understanding identified through user testing; examples of these include difficulties where crosses between varieties are situated in different regions of the pedigree layout. There are good biological reasons why this happens but it has been shown, through testing, that it leads to problems with users’ comprehension of the relatedness of individuals in the pedigree. The inclusion of visual cues and the use of localised layouts have allowed complications like these to be reduced. Other examples include the use of sizing of nodes to show the frequency of usage of specific plant lines which have been shown to act as positional reference points to users, and subsequently bringing a secondary level of structure to the pedigree layout. The use of these novel techniques has allowed the classification of three main types of plant line, which have been coined: principal, flanking and terminal plant lines. This technique has also shown visually the most frequently used plant lines, which while previously known in text records, were never quantified.Helium’s main contributions are two-fold. Firstly it has applied visualization techniques used in traditional pedigrees and applied them to the domain of plant pedigrees; this has addressed problems with handling large experimental plant pedigrees. The scale, complexity and diversity of data and the number of plant lines that Helium can handle exceed other currently available plant pedigree visualization tools. These techniques (including layout, phenotypic and genotypic encoding) have been improved to deal with the differences that exist between human/mammalian pedigrees which take account of problems such as the complexity of crosses and routine inbreeding. Secondly, the verification of the effectiveness of the visualizations has been demonstrated by performing user testing on a group of 28 domain experts. The improvements have advanced both user understanding of pedigrees and allowed a much greater density and scale of data to be visualized. User testing has shown that the implementation and extensions to visualization techniques has improved user comprehension of plant pedigrees when asked to perform real-life tasks with barley datasets. Results have shown an increase in correct responses between the prototype interface and Helium. A SUS analysis has sown a high acceptance rate for Helium

    Visualizing genetic transmission patterns in plant pedigrees.

    Get PDF
    Ensuring food security in a world with an increasing population and demand on natural resources is becoming ever more pertinent. Plant breeders are using an increasingly diverse range of data types such as phenotypic and genotypic data to identify plant lines with desirable characteristics suitable to be taken forward in plant breeding programmes. These characteristics include a number of key morphological and physiological traits, such as disease resistance and yield that need to be maintained and improved upon if a commercial plant variety is to be successful.The ability to predict and understand the inheritance of alleles that facilitate resistance to pathogens or any other commercially important characteristic is crucially important to experimental plant genetics and commercial plant breeding programmes. However, derivation of the inheritance of such traits by traditional molecular techniques is expensive and time consuming, even with recent developments in high-throughput technologies. This is especially true in industrial settings where, due to time constraints relating to growing seasons, many thousands of plant lines may need to be screened quickly, efficiently and economically every year. Thus, computational tools that provide the ability to integrate and visualize diverse data types with an associated plant pedigree structure will enable breeders to make more informed and subsequently better decisions on the plant lines that are used in crossings. This will help meet both the demands for increased yield and production and adaptation to climate change.Traditional family tree style layouts are commonly used and simple to understand but are unsuitable for the data densities that are now commonplace in large breeding programmes. The size and complexity of plant pedigrees means that there is a cognitive limitation in conceptualising large plant pedigree structures, therefore novel techniques and tools are required by geneticists and plant breeders to improve pedigree comprehension.Taking a user-centred, iterative approach to design, a pedigree visualization system was developed for exploring a large and unique set of experimental barley (H. vulgare) data. This work progressed from the development of a static pedigree visualization to interactive prototypes and finally the Helium pedigree visualization software. At each stage of the development process, user feedback in the form of informal and more structured user evaluation from domain experts guided the development lifecycle with users’ concerns addressed and additional functionality added.Plant pedigrees are very different to those from humans and farmed animals and consequently the development of the pedigree visualizations described in this work focussed on implementing currently accepted techniques used in pedigree visualization and adapting them to meet the specific demands of plant pedigrees. Helium includes techniques to aid problems with user understanding identified through user testing; examples of these include difficulties where crosses between varieties are situated in different regions of the pedigree layout. There are good biological reasons why this happens but it has been shown, through testing, that it leads to problems with users’ comprehension of the relatedness of individuals in the pedigree. The inclusion of visual cues and the use of localised layouts have allowed complications like these to be reduced. Other examples include the use of sizing of nodes to show the frequency of usage of specific plant lines which have been shown to act as positional reference points to users, and subsequently bringing a secondary level of structure to the pedigree layout. The use of these novel techniques has allowed the classification of three main types of plant line, which have been coined: principal, flanking and terminal plant lines. This technique has also shown visually the most frequently used plant lines, which while previously known in text records, were never quantified.Helium’s main contributions are two-fold. Firstly it has applied visualization techniques used in traditional pedigrees and applied them to the domain of plant pedigrees; this has addressed problems with handling large experimental plant pedigrees. The scale, complexity and diversity of data and the number of plant lines that Helium can handle exceed other currently available plant pedigree visualization tools. These techniques (including layout, phenotypic and genotypic encoding) have been improved to deal with the differences that exist between human/mammalian pedigrees which take account of problems such as the complexity of crosses and routine inbreeding. Secondly, the verification of the effectiveness of the visualizations has been demonstrated by performing user testing on a group of 28 domain experts. The improvements have advanced both user understanding of pedigrees and allowed a much greater density and scale of data to be visualized. User testing has shown that the implementation and extensions to visualization techniques has improved user comprehension of plant pedigrees when asked to perform real-life tasks with barley datasets. Results have shown an increase in correct responses between the prototype interface and Helium. A SUS analysis has sown a high acceptance rate for Helium

    Secure Time-Aware Provenance for Distributed Systems

    Get PDF
    Operators of distributed systems often find themselves needing to answer forensic questions, to perform a variety of managerial tasks including fault detection, system debugging, accountability enforcement, and attack analysis. In this dissertation, we present Secure Time-Aware Provenance (STAP), a novel approach that provides the fundamental functionality required to answer such forensic questions – the capability to “explain” the existence (or change) of a certain distributed system state at a given time in a potentially adversarial environment. This dissertation makes the following contributions. First, we propose the STAP model, to explicitly represent time and state changes. The STAP model allows consistent and complete explanations of system state (and changes) in dynamic environments. Second, we show that it is both possible and practical to efficiently and scalably maintain and query provenance in a distributed fashion, where provenance maintenance and querying are modeled as recursive continuous queries over distributed relations. Third, we present security extensions that allow operators to reliably query provenance information in adversarial environments. Our extensions incorporate tamper-evident properties that guarantee eventual detection of compromised nodes that lie or falsely implicate correct nodes. Finally, the proposed research results in a proof-of-concept prototype, which includes a declarative query language for specifying a range of useful provenance queries, an interactive exploration tool, and a distributed provenance engine for operators to conduct analysis of their distributed systems. We discuss the applicability of this tool in several use cases, including Internet routing, overlay routing, and cloud data processing
    corecore