57 research outputs found

    DBEst : revisiting approximate query processing engines with machine learning models

    Get PDF
    In the era of big data, computing exact answers to analytical queries becomes prohibitively expensive. This greatly increases the value of approaches that can compute efficiently approximate, but highly-accurate, answers to analytical queries. Alas, the state of the art still suffers from many shortcomings: Errors are still high, unless large memory investments are made. Many important analytics tasks are not supported. Query response times are too long and thus approaches rely on parallel execution of queries atop large big data analytics clusters, in-situ or in the cloud, whose acquisition/use costs dearly. Hence, the following questions are crucial: Can we develop AQP engines that reduce response times by orders of magnitude, ensure high accuracy, and support most aggregate functions? With smaller memory footprints and small overheads to build the state upon which they are based? With this paper, we show that the answers to all questions above can be positive. The paper presents DBEst, a system based on Machine Learning models (regression models and probability density estimators). It will discuss its limitations, promises, and how it can complement existing systems. It will substantiate its advantages using queries and data from the TPC-DS benchmark and real-life datasets, compared against state of the art AQP engines

    Approximate query processing using machine learning

    Get PDF
    In the era of big data, the volume of collected data grows faster than the growth of computational power. And it becomes prohibitively expensive to compute the exact answers to analytical queries. This greatly increases the value of approaches that can compute efficiently approximate, but highly accurate, answers to analytical queries. Approximate query processing (AQP) aims to reduce the query latency and memory footprints at the cost of small quality losses. Previous efforts on AQP largely rely on samples or sketches, etc. However, trade-offs between query response time (or memory footprint) and accuracy are unavoidable. Specifically, to guarantee higher accuracy, a large sample is usually generated and maintained, which leads to increased query response time and space overheads. In this thesis, we aim to overcome the drawbacks of current AQP solutions by applying machine learning models. Instead of accessing data (or samples of it), models are used to make predictions. Our model-based AQP solutions are developed and improved in three stages, and are described as follows: 1. We firstly investigate potential regression models for AQP and propose the query-centric regression, coined QReg. QReg is an ensemble method based on regression models. It achieves better accuracy than the state-of- the-art regression models and overcomes the generalization-overfit dilemma when employing machine learning models within DBMSs. 2. We introduce the first AQP engine DBEst based on classical machine learning models. Specifically, regression models and density estimators are trained over the data/samples, and are further combined to produce the final approximate answers. 3. We further improve DBEst by replacing classical machine learning models with deep learning networks and word embedding. This overcomes the drawbacks of queries with large groups, and query response time and space overheads are further reduced. We conduct experiments against the state-of-the-art AQP engines over various datasets, and show that our method achieves better accuracy while offering orders of magnitude savings in space overheads and query response time

    Sirtuin 6 maintains epithelial STAT6 activity to support intestinal tuft cell development and type 2 immunity

    Get PDF
    Dynamic regulation of intestinal epithelial cell (IEC) differentiation is crucial for both homeostasis and the response to helminth infection. SIRT6 belongs to the NAD+-dependent deacetylases and has established diverse roles in aging, metabolism and disease. Here, we report that IEC Sirt6 deletion leads to impaired tuft cell development and type 2 immunity in response to helminth infection, thereby resulting in compromised worm expulsion. Conversely, after helminth infection, IEC SIRT6 transgenic mice exhibit enhanced epithelial remodeling process and more efficient worm clearance. Mechanistically, Sirt6 ablation causes elevated Socs3 expression, and subsequently attenuated tyrosine 641 phosphorylation of STAT6 in IECs. Notably, intestinal epithelial overexpression of constitutively activated STAT6 (STAT6vt) in mice is sufficient to induce the expansion of tuft and goblet cell linage. Furthermore, epithelial STAT6vt overexpression remarkedly reverses the defects in intestinal epithelial remodeling caused by Sirt6 ablation. Our results reveal a novel function of SIRT6 in regulating intestinal epithelial remodeling and mucosal type 2 immunity in response to helminth infection

    Query-centric regression

    No full text
    Regression Models (RMs) and Machine Learning models (ML) in general, aim to offer high prediction accuracy, even for unforeseen queries/datasets. This depends on their fundamental ability to generalize. However, overfitting a model, with respect to the current DB state, may be best suited to offer excellent accuracy. This overfit-generalize divide bears many practical implications faced by a data analyst. The paper will reveal, shed light, and quantify this divide using a large number of real-world datasets and a large number of RMs. It will show that different RMs occupy different positions in this divide, which results in different RMs being better suited to answer queries on different parts of the same dataset (as queries typically target specific data subspaces defined using selection operators on attributes). It will study in detail 8 real-life data sets and from the TPC-DS benchmark and experiment with various dimensionalities therein. It will employ new appropriate metrics that will reveal the performance differences of RMs and will substantiate the problem across a wide variety of popular RMs, ranging from simple linear models to advanced, state-of-the-art, ensembles (which enjoy excellent generalization performance). It will put forth and study a new, query-centric, model that addresses this problem, improving per-query accuracy, while also offering excellent overall accuracy. Finally, it will study the effects of scale on the problem and its solutions

    Complete mitochondrial genome sequence of Pseudecheneis Sulcata in the Yarlung Zangbo River, Tibet

    No full text
    Pseudecheneis sulcata belongs to Sisoridae, Pseudecheneis, which is mainly distributed in India and Tibet of China, and is located in the Motuo and Chayu in the lower reaches of the Yarlung Zangbo River in Tibet. In the present study, we obtained the complete mitochondrial genome sequence of Pseudecheneis sulcata, which was 16,535 bp in length. This genome consisted of 13 protein-coding genes, 22 tRNAgenes, 2 rRNA genes and a non-coding control region. The protein-coding genes have three start codons (GTG, ATG, and CTA) and four stop codons, including three complete stop codons and one incomplete stop codon. To verify the accuracy and utility of newly determined mitogenome sequences by constructing a species phylogenetic relationship tree of species, we expect to use the full mitochondrial gene sequence to interpret related evolutionary events
    • …
    corecore