4 research outputs found

    Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach

    Full text link
    Finding joinable tables in data lakes is key procedure in many applications such as data integration, data augmentation, data analysis, and data market. Traditional approaches that find equi-joinable tables are unable to deal with misspellings and different formats, nor do they capture any semantic joins. In this paper, we propose PEXESO, a framework for joinable table discovery in data lakes. We embed textual values as high-dimensional vectors and join columns under similarity predicates on high-dimensional vectors, hence to address the limitations of equi-join approaches and identify more meaningful results. To efficiently find joinable tables with similarity, we propose a block-and-verify method that utilizes pivot-based filtering. A partitioning technique is developed to cope with the case when the data lake is large and the index cannot fit in main memory. An experimental evaluation on real datasets shows that our solution identifies substantially more tables than equi-joins and outperforms other similarity-based options, and the join results are useful in data enrichment for machine learning tasks. The experiments also demonstrate the efficiency of the proposed method.Comment: Full version of paper in ICDE 202

    Learning with Unsure Responses

    No full text
    Many annotation systems provide to add an unsure option in the labels, because the annotators have different expertise, and they may not have enough confidence to choose a label for some assigned instances. However, all the existing approaches only learn the labels with a clear class name and ignore the unsure responses. Due to the unsure response also account for a proportion of the dataset (e.g., about 10-30% in real datasets), existing approaches lead to high costs such as paying more money or taking more time to collect enough size of labeled data. Therefore, it is a significant issue to make use of these unsure.In this paper, we make the unsure responses contribute to training classifiers. We found a property that the instances corresponding to the unsure responses always appear close to the decision boundary of classification. We design a loss function called unsure loss based on this property. We extend the conventional methods for classification and learning from crowds with this unsure loss. Experimental results on realworld and synthetic data demonstrate the performance of our method and its superiority over baseline methods

    Meimei: An Efficient Probabilistic Approach for Semantically Annotating Tables

    No full text
    Given a large amount of table data, how can we find the tables that contain the contents we want? A naive search fails when the column names are ambiguous, such as if columns containing stock price information are named “Close” in one table and named “P” in another table.One way of dealing with this problem that has been gaining attention is the semantic annotation of table data columns by using canonical knowledge. While previous studies successfully dealt with this problem for specific types of table data such as web tables, it still remains for various other types of table data: (1) most approaches do not handle table data with numerical values, and (2) their predictive performance is not satisfactory.This paper presents a novel approach for table data annotation that combines a latent probabilistic model with multilabel classifiers. It features three advantages over previous approaches due to using highly predictive multi-label classifiers in the probabilistic computation of semantic annotation. (1) It is more versatile due to using multi-label classifiers in the probabilistic model, which enables various types of data such as numerical values to be supported. (2) It is more accurate due to the multi-label classifiers and probabilistic model working together to improve predictive performance. (3) It is more efficient due to potential functions based on multi-label classifiers reducing the computational cost for annotation.Extensive experiments demonstrated the superiority of the proposed approach over state-of-the-art approaches for semantic annotation of real data (183 human-annotated tables obtained from the UCI Machine Learning Repository)

    Empagliflozin in Patients with Chronic Kidney Disease

    No full text
    Background The effects of empagliflozin in patients with chronic kidney disease who are at risk for disease progression are not well understood. The EMPA-KIDNEY trial was designed to assess the effects of treatment with empagliflozin in a broad range of such patients. Methods We enrolled patients with chronic kidney disease who had an estimated glomerular filtration rate (eGFR) of at least 20 but less than 45 ml per minute per 1.73 m(2) of body-surface area, or who had an eGFR of at least 45 but less than 90 ml per minute per 1.73 m(2) with a urinary albumin-to-creatinine ratio (with albumin measured in milligrams and creatinine measured in grams) of at least 200. Patients were randomly assigned to receive empagliflozin (10 mg once daily) or matching placebo. The primary outcome was a composite of progression of kidney disease (defined as end-stage kidney disease, a sustained decrease in eGFR to < 10 ml per minute per 1.73 m(2), a sustained decrease in eGFR of & GE;40% from baseline, or death from renal causes) or death from cardiovascular causes. Results A total of 6609 patients underwent randomization. During a median of 2.0 years of follow-up, progression of kidney disease or death from cardiovascular causes occurred in 432 of 3304 patients (13.1%) in the empagliflozin group and in 558 of 3305 patients (16.9%) in the placebo group (hazard ratio, 0.72; 95% confidence interval [CI], 0.64 to 0.82; P < 0.001). Results were consistent among patients with or without diabetes and across subgroups defined according to eGFR ranges. The rate of hospitalization from any cause was lower in the empagliflozin group than in the placebo group (hazard ratio, 0.86; 95% CI, 0.78 to 0.95; P=0.003), but there were no significant between-group differences with respect to the composite outcome of hospitalization for heart failure or death from cardiovascular causes (which occurred in 4.0% in the empagliflozin group and 4.6% in the placebo group) or death from any cause (in 4.5% and 5.1%, respectively). The rates of serious adverse events were similar in the two groups. Conclusions Among a wide range of patients with chronic kidney disease who were at risk for disease progression, empagliflozin therapy led to a lower risk of progression of kidney disease or death from cardiovascular causes than placebo
    corecore