132 research outputs found

    Cutset Sampling for Bayesian Networks

    Full text link
    The paper presents a new sampling methodology for Bayesian networks that samples only a subset of variables and applies exact inference to the rest. Cutset sampling is a network structure-exploiting application of the Rao-Blackwellisation principle to sampling in Bayesian networks. It improves convergence by exploiting memory-based inference algorithms. It can also be viewed as an anytime approximation of the exact cutset-conditioning algorithm developed by Pearl. Cutset sampling can be implemented efficiently when the sampled variables constitute a loop-cutset of the Bayesian network and, more generally, when the induced width of the networks graph conditioned on the observed sampled variables is bounded by a constant w. We demonstrate empirically the benefit of this scheme on a range of benchmarks

    USING THE MULTI-STRING BURROW-WHEELER TRANSFORM FOR HIGH-THROUGHPUT SEQUENCE ANALYSIS

    Get PDF
    The throughput of sequencing technologies has created a bottleneck where raw sequence files are stored in an un-indexed format on disk. Alignment to a reference genome is the most common pre-processing method for indexing this data, but alignment requires a priori knowledge of a reference sequence, and often loses a significant amount of sequencing data due to biases. Sequencing data can instead be stored in a lossless, compressed, indexed format using the multi-string Burrows Wheeler Transform (BWT). This dissertation introduces three algorithms that enable faster construction of the BWT for sequencing datasets. The first two algorithms are a merge algorithm for merging two or more BWTs into a single BWT and a merge-based divide-and-conquer algorithm that will construct a BWT from any sequencing dataset. The third algorithm is an induced sorting algorithm that constructs the BWT from any string collection and is well-suited for building BWTs of long-read sequencing datasets. These algorithms are evaluated based on their efficiency and utility in constructing BWTs of different types of sequencing data. This dissertation also introduces two applications of the BWT: long-read error correction and a set of biologically motivated sequence search tools. The long-read error correction is evaluated based on accuracy and efficiency of the correction. Our analyses show that the BWT of almost all sequencing datasets can now be efficiently constructed. Once constructed, we show that the BWT offers significant utility in performing fast searches as well as fast and accurate long read corrections. Additionally, we highlight several use cases of the BWT-based web tools in answering biologically mo- tivated problems.Doctor of Philosoph

    Exploiting multilingual lexical resources to predict MWE compositionality

    Get PDF
    Semantic idiomaticity is the extent to which the meaning of a multiword expression (MWE) cannot be predicted from the meanings of its component words. Much work in natural language processing on semantic idiomaticity has focused on compositionality prediction, wherein a binary or continuous-valued compositionality score is predicted for an MWE as a whole, or its individual component words. One source of information for making compositionality predictions is the translation of an MWE into other languages. This chapter extends two previously-presented studies – Salehi & Cook (2013) and Salehi et al. (2014) – that propose methods for predicting compositionality that exploit translation information provided by multilingual lexical resources, and that are applicable to many kinds of MWEs in a wide range of languages. These methods make use of distributional similarity of an MWE and its component words under translation into many languages, as well as string similarity measures applied to definitions of translations of an MWE and its component words. We evaluate these methods over English noun compounds, English verb-particle constructions, and German noun compounds. We show that the estimation of compositionality is improved when using translations into multiple languages, as compared to simply using distributional similarity in the source language. We further find that string similarity complements distributional similarity

    Computing all-vs-all MEMs in grammar-compressed text

    Full text link
    We describe a compression-aware method to compute all-vs-all maximal exact matches (MEM) among strings of a repetitive collection T\mathcal{T}. The key concept in our work is the construction of a fully-balanced grammar G\mathcal{G} from T\mathcal{T} that meets a property that we call \emph{fix-free}: the expansions of the nonterminals that have the same height in the parse tree form a fix-free set (i.e., prefix-free and suffix-free). The fix-free property allows us to compute the MEMs of T\mathcal{T} incrementally over G\mathcal{G} using a standard suffix-tree-based MEM algorithm, which runs on a subset of grammar rules at a time and does not decompress nonterminals. By modifying the locally-consistent grammar of Christiansen et al 2020., we show how we can build G\mathcal{G} from T\mathcal{T} in linear time and space. We also demonstrate that our MEM algorithm runs on top of G\mathcal{G} in O(G+occ)O(G +occ) time and uses O(logG(G+occ))O(\log G(G+occ)) bits, where GG is the grammar size, and occocc is the number of MEMs in T\mathcal{T}. In the conclusions, we discuss how our idea can be modified to implement approximate pattern matching in compressed space

    Communication avoiding parallel algorithms for amorphous problems

    Get PDF
    Parallelizing large sized problem in parallel systems has always been a challenge for programmer. This difficulty is caused by the complexity of the existing systems as well as the target problems. This is becoming a greater issue as the data sizes are constantly growing and as a result, larger parallel systems are required. Graph algorithms, machine learning problems and bio-informatics methods are among the many ever-growing problems. These group of problems are amorphous, meaning that memory accesses are unpredictable and the application usually has a poor locality. Therefore, synchronizations in these problems are specially costly since all-to-all communications are required and delivering an efficient parallel algorithm becomes more challenging. Another difficulty with these problems is that the amount of parallelism in them is limited which naturally makes them hard to parallelize. This is due to complicated data-dependences among the data elements in the algorithm. Writing parallel algorithms for these problems, on the other hand, are specially difficult since an amorphous problem can be expressed in several dramatically different ways. This is because of complex data dependences which are statically unknown and therefore, many unique parallel approaches exist for a single problem. Consequently, programming each single approach requires starting from scratch which is time consuming. This thesis introduces several ways to avoid costly communications in amorphous problems by compromising from the computation. This means that we can increase the total amount of work done by the processors to avoid synchronizations in an algorithm. This is specially effective in large clusters since there is a massive computing power with very costly communications. These approaches, clearly, have a trade off between computation and communication and in this thesis, we study these trade offs as well. Also, we propose a new language to express the proposed algorithms to overcome the programming difficulty of the problems by providing tunable parameters for performance

    Lagrangian coherent structures and trajectory similarity: two important tools for scientific visualization

    Get PDF
    This thesis studies the computation and visualization of Lagrangian coherent structures (LCS), an emerging technique for analyzing time-varying velocity fields (e.g. blood vessels and airflows), and the measure of similarity for trajectories (e.g. hurricane paths). LCS surfaces and trajectory-based techniques (e.g. trajectory clustering) are complementary to each other for visualization, while velocity fields and trajectories are two important types of scientific data, which are more and more accessible by virtue of the technology development for both data collection and numerical simulation. A key step for LCS computation is tracing the paths of collections of particles through a flow field. When a flow field is interpolated from the nodes of an unstructured mesh, the process of advecting a particle must first find which cell in the unstructured mesh contains the particle. Since the paths of nearby particles often diverge, the parallelization of particle advection quickly leads to incoherent memory accesses of the unstructured mesh. We have developed a new block advection GPU approach that reorganizes particles into spatially coherent bundles as they follow their advection paths, which greatly improves memory coherence and thus shared-memory GPU performance. This approach works best for flows that meet the CFL criterion on unstructured meshes of uniformly sized elements, small enough to fit at least two timesteps in GPU memory. LCS surfaces provide insight into unsteady fluid flow, but their construction has posed many challenges. These structures can be characterized as ridges of a field, but their local definition utilizes an ambiguous eigenvector direction that can point in one of two directions, and its ambiguity can lead to noise and other problems. We overcome these issues with an application of a global ridge definition, applied using the hierarchical watershed transformation. We show results on a mathematical flow model and a simulated vascular flow dataset indicating the watershed method produces less noisy structures. Trajectory similarity has been shown to be a powerful tool for visualizing and analyzing trajectories. In this paper we propose a novel measure of trajectory similarity using both spatial and directional information. The similarity is asymmetric, bounded within [0,1], affine-invariant, and efficiently computed. Asymmetric mappings between a pair of trajectories can be derived from this similarity. Experimental results demonstrate that the measure is better than existing measures in both similarity scores and trajectory mappings. The measure also inspires a simple similarity-based clustering method for effectivly visualizing a large number of trajectories, which outperforms the state-of-the-art model-based clustering method (VFKM)

    27th Annual European Symposium on Algorithms: ESA 2019, September 9-11, 2019, Munich/Garching, Germany

    Get PDF
    corecore