13 research outputs found

    Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective

    Full text link
    Data-centric AI is at the center of a fundamental shift in software engineering where machine learning becomes the new software, powered by big data and computing infrastructure. Here software engineering needs to be re-thought where data becomes a first-class citizen on par with code. One striking observation is that a significant portion of the machine learning process is spent on data preparation. Without good data, even the best machine learning algorithms cannot perform well. As a result, data-centric AI practices are now becoming mainstream. Unfortunately, many datasets in the real world are small, dirty, biased, and even poisoned. In this survey, we study the research landscape for data collection and data quality primarily for deep learning applications. Data collection is important because there is lesser need for feature engineering for recent deep learning approaches, but instead more need for large amounts of data. For data quality, we study data validation, cleaning, and integration techniques. Even if the data cannot be fully cleaned, we can still cope with imperfect data during model training using robust model training techniques. In addition, while bias and fairness have been less studied in traditional data management research, these issues become essential topics in modern machine learning applications. We thus study fairness measures and unfairness mitigation techniques that can be applied before, during, or after model training. We believe that the data management community is well poised to solve these problems

    Robust Data Pruning under Label Noise via Maximizing Re-labeling Accuracy

    Full text link
    Data pruning, which aims to downsize a large training set into a small informative subset, is crucial for reducing the enormous computational costs of modern deep learning. Though large-scale data collections invariably contain annotation noise and numerous robust learning methods have been developed, data pruning for the noise-robust learning scenario has received little attention. With state-of-the-art Re-labeling methods that self-correct erroneous labels while training, it is challenging to identify which subset induces the most accurate re-labeling of erroneous labels in the entire training set. In this paper, we formalize the problem of data pruning with re-labeling. We first show that the likelihood of a training example being correctly re-labeled is proportional to the prediction confidence of its neighborhood in the subset. Therefore, we propose a novel data pruning algorithm, Prune4Rel, that finds a subset maximizing the total neighborhood confidence of all training examples, thereby maximizing the re-labeling accuracy and generalization performance. Extensive experiments on four real and one synthetic noisy datasets show that \algname{} outperforms the baselines with Re-labeling models by up to 9.1% as well as those with a standard model by up to 21.6%

    PAMAE: Parallel k-Medoids Clustering with High Accuracy and Efficiency

    No full text
    The k-medoids algorithm is one of the best-known clustering algorithms. Despite this, however, it is not as widely used for big data analytics as the k-means algorithm, mainly because of its high computational complexity. Many studies have attempted to solve the efficiency problem of the k-medoids algorithm, but all such studies have improved efficiency at the expense of accuracy. In this paper, we propose a novel parallel k-medoids algorithm, which we call PAMAE, that achieves both high accuracy and high efficiency. We identify two factorsā€”ā€œglobal searchā€ and ā€œentire dataā€ā€”that are essential to achieving high accuracy, but are also very timeconsuming if considered simultaneously. Thus, our key idea is to apply them individually through two phases: parallel seeding and parallel refinement, neither of which is costly. The first phase performs global search over sampled data, and the second phase performs local search over entire data. Our theoretical analysis proves that this serial execution of the two phases leads to an accurate solution that would be achieved by global search over entire data. In order to validate the merit of our approach, we implement PAMAE on Spark as well as Hadoop and conduct extensive experiments using various real-world data sets on 12 Microsoft Azure machines (48 cores). The results show that PAMAE significantly outperforms most of recent parallel algorithms and, at the same time, produces a clustering quality as comparable as the previous most-accurate algorithm. The source code and data are available at https://github.com/jaegil/k-Medoid.1

    Ada-boundary: accelerating DNN training via adaptive boundary batch selection

    No full text
    Ā© The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature. Neural networks converge faster with help from a smart batch selection strategy. In this regard, we proposeAda-Boundary, a novel and simple adaptive batch selection algorithm that constructs an effective mini-batch according to the learning progress of the model. Our key idea is to exploitconfusingsamples for which the model cannot predict labels with high confidence. Thus, samples near the current decision boundary are considered to be the most effective for expediting convergence. Taking advantage of this design,Ada-Boundarymaintained its dominance for various degrees of training difficulty. We demonstrate the advantage ofAda-Boundaryby extensive experimentation using CNNs with five benchmark data sets.Ada-Boundarywas shown to produce a relative improvement in test errors by up to 31.80% compared with the baseline for a fixed wall-clock training time, thereby achieving a faster convergence speed11Nsci

    PREMERE: Meta-Reweighting via Self-Ensembling for Point-of-Interest Recommendation

    No full text
    Point-of-interest (POI) recommendation has become an important research topic in these days. The user check-in history used as the input to POI recommendation is very imbalanced and noisy because of sparse and missing check-ins. Although sample reweighting is commonly adopted for addressing this challenge with the input data, its fixed weighting scheme is often inappropriate to deal with different characteristics of users or POIs. Thus, in this paper, we propose PREMERE, an adaptive weighting scheme based on meta-learning. Because meta-data is typically required by meta-learning but is inherently hard to obtain in POI recommendation, we self-generate the meta-data via self-ensembling. Furthermore, the meta-model architecture is extended to deal with the scarcity of check-ins. Thorough experiments show that replacing a weighting scheme with PREMERE boosts the performance of the state-of-the-art recommender algorithms by 2.36ā€“26.9% on three benchmark datasets
    corecore