305 research outputs found

    A Cheat Sheet for Bayesian Prediction

    Full text link
    This paper reviews the growing field of Bayesian prediction. Bayes point and interval prediction are defined and exemplified and situated in statistical prediction more generally. Then, four general approaches to Bayes prediction are defined and we turn to predictor selection. This can be done predictively or non-predictively and predictors can be based on single models or multiple models. We call these latter cases unitary predictors and model average predictors, respectively. Then we turn to the most recent aspect of prediction to emerge, namely prediction in the context of large observational data sets and discuss three further classes of techniques. We conclude with a summary and statement of several current open problems.Comment: 33 page

    SUFFIX TREE, MINWISE HASHING AND STREAMING ALGORITHMS FOR BIG DATA ANALYSIS IN BIOINFORMATICS

    Get PDF
    In this dissertation, we worked on several algorithmic problems in bioinformatics using mainly three approaches: (a) a streaming model, (b) sux-tree based indexing, and (c) minwise-hashing (minhash) and locality-sensitive hashing (LSH). The streaming models are useful for large data problems where a good approximation needs to be achieved with limited space usage. We developed an approximation algorithm (Kmer-Estimate) using the streaming approach to obtain a better estimation of the frequency of k-mer counts. A k-mer, a subsequence of length k, plays an important role in many bioinformatics analyses such as genome distance estimation. We also developed new methods that use sux tree, a trie data structure, for alignment-free, non-pairwise algorithms for a conserved non-coding sequence (CNS) identification problem. We provided two different algorithms: STAG-CNS to identify exact-matched CNSs and DiCE to identify CNSs with mismatches. Using our algorithms, CNSs among various grass species were identified. A different approach was employed for identification of longer CNSs ( 100 bp, mostly found in animals). In our new method (MinCNE), the minhash approach was used to estimate the Jaccard similarity. Using also LSH, k-mers extracted from genomic sequences were clustered and CNSs were identified. Another new algorithm (MinIsoClust) that also uses minhash and LSH techniques was developed for an isoform clustering problem. Isoforms are generated from the same gene but by alternative splicing. As the isoform sequences share some exons but in different combinations, regular sequencing clustering methods do not work well. Our algorithm generates clusters for isoform sequences based on their shared minhash signatures. Finally, we discuss de novo transcriptome assembly algorithms and how to improve the assembly accuracy using ensemble approaches. First, we did a comprehensive performance analysis on different transcriptome assemblers using simulated benchmark datasets. Then, we developed a new ensemble approach (Minsemble) for the de novo transcriptome assembly problem that integrates isoform-clustering using minhash technique to identify potentially correct transcripts from various de novo transcriptome assemblers. Minsemble identified more correctly assembled transcripts as well as genes compared to other de novo and ensemble methods. Adviser: Jitender S. Deogu

    A Comprehensive Survey on Vector Database: Storage and Retrieval Technique, Challenge

    Full text link
    A vector database is used to store high-dimensional data that cannot be characterized by traditional DBMS. Although there are not many articles describing existing or introducing new vector database architectures, the approximate nearest neighbor search problem behind vector databases has been studied for a long time, and considerable related algorithmic articles can be found in the literature. This article attempts to comprehensively review relevant algorithms to provide a general understanding of this booming research area. The basis of our framework categorises these studies by the approach of solving ANNS problem, respectively hash-based, tree-based, graph-based and quantization-based approaches. Then we present an overview of existing challenges for vector databases. Lastly, we sketch how vector databases can be combined with large language models and provide new possibilities

    Outlier Identification in Spatio-Temporal Processes

    Full text link
    This dissertation answers some of the statistical challenges arising in spatio-temporal data from Internet traffic, electricity grids and climate models. It begins with methodological contributions to the problem of anomaly detection in communication networks. Using electricity consumption patterns for University of Michigan campus, the well known spatial prediction method kriging has been adapted for identification of false data injections into the system. Events like Distributed Denial of Service (DDoS), Botnet/Malware attacks, Port Scanning etc. call for methods which can identify unusual activity in Internet traffic patterns. Storing information on the entire network though feasible cannot be done at the time scale at which data arrives. In this work, hashing techniques which can produce summary statistics for the network have been used. The hashed data so obtained indeed preserves the heavy tailed nature of traffic payloads, thereby providing a platform for the application of extreme value theory (EVT) to identify heavy hitters in volumetric attacks. These methods based on EVT require the estimation of the tail index of a heavy tailed distribution. The traditional estimators (Hill et al. (1975)) for the tail index tend to be biased in the presence of outliers. To circumvent this issue, a trimmed version of the classic Hill estimator has been proposed and studied from a theoretical perspective. For the Pareto domain of attraction, the optimality and asymptotic normality of the estimator has been established. Additionally, a data driven strategy to detect the number of extreme outliers in heavy tailed data has also been presented. The dissertation concludes with the statistical formulation of m-year return levels of extreme climatic events (heat/cold waves). The Generalized Pareto distribution (GPD) serves as good fit for modeling peaks over threshold of a distribution. Allowing the parameters of the GPD to vary as a function of covariates such as time of the year, El-Nino and location in the US, extremes of the areal impact of heat waves have been well modeled and inferred.PHDStatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/145789/1/shrijita_1.pd
    • …
    corecore