324 research outputs found

    VerdictDB: Universalizing Approximate Query Processing

    Full text link
    Despite 25 years of research in academia, approximate query processing (AQP) has had little industrial adoption. One of the major causes of this slow adoption is the reluctance of traditional vendors to make radical changes to their legacy codebases, and the preoccupation of newer vendors (e.g., SQL-on-Hadoop products) with implementing standard features. Additionally, the few AQP engines that are available are each tied to a specific platform and require users to completely abandon their existing databases---an unrealistic expectation given the infancy of the AQP technology. Therefore, we argue that a universal solution is needed: a database-agnostic approximation engine that will widen the reach of this emerging technology across various platforms. Our proposal, called VerdictDB, uses a middleware architecture that requires no changes to the backend database, and thus, can work with all off-the-shelf engines. Operating at the driver-level, VerdictDB intercepts analytical queries issued to the database and rewrites them into another query that, if executed by any standard relational engine, will yield sufficient information for computing an approximate answer. VerdictDB uses the returned result set to compute an approximate answer and error estimates, which are then passed on to the user or application. However, lack of access to the query execution layer introduces significant challenges in terms of generality, correctness, and efficiency. This paper shows how VerdictDB overcomes these challenges and delivers up to 171×\times speedup (18.45×\times on average) for a variety of existing engines, such as Impala, Spark SQL, and Amazon Redshift, while incurring less than 2.6% relative error. VerdictDB is open-sourced under Apache License.Comment: Extended technical report of the paper that appeared in Proceedings of the 2018 International Conference on Management of Data, pp. 1461-1476. ACM, 201

    Database Learning: Toward a Database that Becomes Smarter Every Time

    Full text link
    In today's databases, previous query answers rarely benefit answering future queries. For the first time, to the best of our knowledge, we change this paradigm in an approximate query processing (AQP) context. We make the following observation: the answer to each query reveals some degree of knowledge about the answer to another query because their answers stem from the same underlying distribution that has produced the entire dataset. Exploiting and refining this knowledge should allow us to answer queries more analytically, rather than by reading enormous amounts of raw data. Also, processing more queries should continuously enhance our knowledge of the underlying distribution, and hence lead to increasingly faster response times for future queries. We call this novel idea---learning from past query answers---Database Learning. We exploit the principle of maximum entropy to produce answers, which are in expectation guaranteed to be more accurate than existing sample-based approximations. Empowered by this idea, we build a query engine on top of Spark SQL, called Verdict. We conduct extensive experiments on real-world query traces from a large customer of a major database vendor. Our results demonstrate that Verdict supports 73.7% of these queries, speeding them up by up to 23.0x for the same accuracy level compared to existing AQP systems.Comment: This manuscript is an extended report of the work published in ACM SIGMOD conference 201

    Performance prediction for set similarity joins

    Full text link

    Stochastic Consensus-based Control of μGs with Communication Delays and Noises

    Get PDF

    BlinkML: Efficient Maximum Likelihood Estimation with Probabilistic Guarantees

    Full text link
    The rising volume of datasets has made training machine learning (ML) models a major computational cost in the enterprise. Given the iterative nature of model and parameter tuning, many analysts use a small sample of their entire data during their initial stage of analysis to make quick decisions (e.g., what features or hyperparameters to use) and use the entire dataset only in later stages (i.e., when they have converged to a specific model). This sampling, however, is performed in an ad-hoc fashion. Most practitioners cannot precisely capture the effect of sampling on the quality of their model, and eventually on their decision-making process during the tuning phase. Moreover, without systematic support for sampling operators, many optimizations and reuse opportunities are lost. In this paper, we introduce BlinkML, a system for fast, quality-guaranteed ML training. BlinkML allows users to make error-computation tradeoffs: instead of training a model on their full data (i.e., full model), BlinkML can quickly train an approximate model with quality guarantees using a sample. The quality guarantees ensure that, with high probability, the approximate model makes the same predictions as the full model. BlinkML currently supports any ML model that relies on maximum likelihood estimation (MLE), which includes Generalized Linear Models (e.g., linear regression, logistic regression, max entropy classifier, Poisson regression) as well as PPCA (Probabilistic Principal Component Analysis). Our experiments show that BlinkML can speed up the training of large-scale ML tasks by 6.26x-629x while guaranteeing the same predictions, with 95% probability, as the full model.Comment: 22 pages, SIGMOD 201

    Crispr-associated (Cas) effectors delivery via microfluidic cell-deformation chip

    Get PDF
    Identifying new and even more precise technologies for modifying and manipulating selectively specific genes has provided a powerful tool for characterizing gene functions in basic research and potential therapeutics for genome regulation. The rapid development of nuclease-based techniques such as CRISPR/Cas systems has revolutionized new genome engineering and medicine possibilities. Additionally, the appropriate delivery procedures regarding CRISPR/Cas systems are critical, and a large number of previous reviews have focused on the CRISPR/Cas9�12 and 13 delivery methods. Still, despite all efforts, the in vivo delivery of the CAS gene systems remains challenging. The transfection of CRISPR components can often be inefficient when applying conventional delivery tools including viral elements and chemical vectors because of the restricted packaging size and incompetency of some cell types. Therefore, physical methods such as microfluidic systems are more applicable for in vitro delivery. This review focuses on the recent advancements of microfluidic systems to deliver CRISPR/Cas systems in clinical and therapy investigations. © 2021 by the authors. Licensee MDPI, Basel, Switzerland

    Fullerene: biomedical engineers get to revisit an old friend

    Get PDF
    YesIn 1985, the serendipitous discovery of fullerene triggered the research of carbon structures into the world of symmetric nanomaterials. Consequently, Robert F. Curl, Harold W. Kroto and Richard E. Smalley were awarded the Noble prize in chemistry for their discovery of the buckminsterfullerene (C60 with a cage-like fused-ring structure). Fullerene, as the first symmetric nanostructure in carbon nanomaterials family, opened up new perspectives in nanomaterials field leading to discovery and research on other symmetric carbon nanomaterials like carbon nanotubes and two-dimensional graphene which put fullerenes in the shade, while fullerene as the most symmetrical molecule in the world with incredible properties deserves more attention in nanomaterials studies. Buckyball with its unique structure consisting of sp2 carbons which form a high symmetric cage with different sizes (C60, C70 and so on); however, the most abundant among them is C60 which possesses 60 carbon atoms. The combination of unique properties of this molecule extends its applications in divergent areas of science, especially those related to biomedical engineering. This review aims to be a comprehensive review with a broad interest to the biomedical engineering community, being a substantial overview of the most recent advances on fullerenes in biomedical applications that have not been exhaustively and critically reviewed in the past few years

    Using Crowdsourcing for Fine-Grained Entity Type Completion in Knowledge Bases

    Get PDF
    Recent years have witnessed the proliferation of large-scale Knowledge Bases (KBs). However, many entities in KBs have incomplete type information, and some are totally untyped. Even worse, fine-grained types (e.g., BasketballPlayer) containing rich semantic meanings are more likely to be incomplete, as they are more difficult to be obtained. Existing machine-based algorithms use predicates (e.g., birthPlace) of entities to infer their missing types, and they have limitations that the predicates may be insufficient to infer fine-grained types. In this paper, we utilize crowdsourcing to solve the problem, and address the challenge of controlling crowdsourcing cost. To this end, we propose a hybrid machine-crowdsourcing approach for fine-grained entity type completion. It firstly determines the types of some “representative” entities via crowdsourcing and then infers the types for remaining entities based on the crowdsourcing results. To support this approach, we first propose an embedding-based influence for type inference which considers not only the distance between entity embeddings but also the distances between entity and type embeddings. Second, we propose a new difficulty model for entity selection which can better capture the uncertainty of the machine algorithm when identifying the entity types. We demonstrate the effectiveness of our approach through experiments on real crowdsourcing platforms. The results show that our method outperforms the state-of-the-art algorithms by improving the effectiveness of fine-grained type completion at affordable crowdsourcing cost.Peer reviewe
    corecore