77 research outputs found
A Survey of Dataset Refinement for Problems in Computer Vision Datasets
Large-scale datasets have played a crucial role in the advancement of
computer vision. However, they often suffer from problems such as class
imbalance, noisy labels, dataset bias, or high resource costs, which can
inhibit model performance and reduce trustworthiness. With the advocacy of
data-centric research, various data-centric solutions have been proposed to
solve the dataset problems mentioned above. They improve the quality of
datasets by re-organizing them, which we call dataset refinement. In this
survey, we provide a comprehensive and structured overview of recent advances
in dataset refinement for problematic computer vision datasets. Firstly, we
summarize and analyze the various problems encountered in large-scale computer
vision datasets. Then, we classify the dataset refinement algorithms into three
categories based on the refinement process: data sampling, data subset
selection, and active learning. In addition, we organize these dataset
refinement methods according to the addressed data problems and provide a
systematic comparative description. We point out that these three types of
dataset refinement have distinct advantages and disadvantages for dataset
problems, which informs the choice of the data-centric method appropriate to a
particular research objective. Finally, we summarize the current literature and
propose potential future research topics.Comment: 33 pages, 10 figures, to be published in ACM Computing Survey
Cover tree based dynamization of clustering algorithms
openIn questo lavoro, i Cover Tree sono l'obiettivo principale e servono come struttura di dati per memorizzare in modo efficiente i dati. Li utilizziamo per gestire dinamicamente il k-Center problem, sia con che senza outlier. La struttura Cover Tree è progettata per recuperare un coreset, una rappresentazione molto piccola dei dati, che viene poi fornito a un algoritmo di clustering offline per ottenere rapidamente una soluzione per l'intero set di dati.
Rispetto alla definizione originale, il Cover Tree implementato viene aumentato con nuovi campi, per mantenere informazioni aggiuntive cruciali per l'estrazione di coreset ragionevoli. Le soluzioni ottenibili per i problemi citati sono approssimazioni (α + ε), dove α rappresenta la migliore approssimazione nota ottenibile in tempo polinomiale nell'impostazione standard offline e ε>0 è un parametro di precisione fornito dall'utente.
L'obiettivo principale dell'utilizzo di una struttura dati dinamica è quello di ottenere una soluzione ragionevole, rispetto a quella ottenuta applicando gli algoritmi di clustering da zero a tutti i dati. Per verificare la qualità della nostra soluzione, conduciamo una serie di esperimenti per valutarne le prestazioni e mettere a punto i parametri coinvolti.In this work, cover trees are the main focus, and they serve as a data structure to efficiently store metric data. We utilize them for dynamically handling the k-center problem, both with and without outliers. The cover tree data structure is designed to retrieve a coreset, a very succinct summary of the data, which is then fed to an offline clustering algorithm to quickly obtain a solution for the whole dataset.
With respect to the original definition, the cover tree implemented is augmented, to maintain additional information crucial for extracting reasonable coresets. The solutions obtainable for the mentioned problems are (α + ε)-approximations, where α represents the best-known approximation achievable in polynomial time in the standard offline setting, and ε>0 is a user-provided accuracy parameter.
The main objective in using a dynamic data structure is to obtain a reasonable solution, in comparison to the solution obtained by applying the clustering algorithms from scratch to all the data points. To ascertain the quality of our solution, we conduct a series of experiments to evaluate its performance and to fine-tune the involved parameters
Data-Efficient Training of CNNs and Transformers with Coresets: A Stability Perspective
Coreset selection is among the most effective ways to reduce the training
time of CNNs, however, only limited is known on how the resultant models will
behave under variations of the coreset size, and choice of datasets and models.
Moreover, given the recent paradigm shift towards transformer-based models, it
is still an open question how coreset selection would impact their performance.
There are several similar intriguing questions that need to be answered for a
wide acceptance of coreset selection methods, and this paper attempts to answer
some of these. We present a systematic benchmarking setup and perform a
rigorous comparison of different coreset selection methods on CNNs and
transformers. Our investigation reveals that under certain circumstances,
random selection of subsets is more robust and stable when compared with the
SOTA selection methods. We demonstrate that the conventional concept of uniform
subset sampling across the various classes of the data is not the appropriate
choice. Rather samples should be adaptively chosen based on the complexity of
the data distribution for each class. Transformers are generally pretrained on
large datasets, and we show that for certain target datasets, it helps to keep
their performance stable at even very small coreset sizes. We further show that
when no pretraining is done or when the pretrained transformer models are used
with non-natural images (e.g. medical data), CNNs tend to generalize better
than transformers at even very small coreset sizes. Lastly, we demonstrate that
in the absence of the right pretraining, CNNs are better at learning the
semantic coherence between spatially distant objects within an image, and these
tend to outperform transformers at almost all choices of the coreset size
BigFCM: Fast, Precise and Scalable FCM on Hadoop
Clustering plays an important role in mining big data both as a modeling
technique and a preprocessing step in many data mining process implementations.
Fuzzy clustering provides more flexibility than non-fuzzy methods by allowing
each data record to belong to more than one cluster to some degree. However, a
serious challenge in fuzzy clustering is the lack of scalability. Massive
datasets in emerging fields such as geosciences, biology and networking do
require parallel and distributed computations with high performance to solve
real-world problems. Although some clustering methods are already improved to
execute on big data platforms, but their execution time is highly increased for
large datasets. In this paper, a scalable Fuzzy C-Means (FCM) clustering named
BigFCM is proposed and designed for the Hadoop distributed data platform. Based
on the map-reduce programming model, it exploits several mechanisms including
an efficient caching design to achieve several orders of magnitude reduction in
execution time. Extensive evaluation over multi-gigabyte datasets shows that
BigFCM is scalable while it preserves the quality of clustering
Applied Randomized Algorithms for Efficient Genomic Analysis
The scope and scale of biological data continues to grow at an exponential clip, driven by advances in genetic sequencing, annotation and widespread adoption of surveillance efforts. For instance, the Sequence Read Archive (SRA) now contains more than 25 petabases of public data, while RefSeq, a collection of reference genomes, recently surpassed 100,000 complete genomes. In the process, it has outgrown the practical reach of many traditional algorithmic approaches in both time and space.
Motivated by this extreme scale, this thesis details efficient methods for clustering and summarizing large collections of sequence data. While our primary area of interest is biological sequences, these approaches largely apply to sequence collections of any type, including natural language, software source code, and graph structured data.
We applied recent advances in randomized algorithms to practical problems. We used MinHash and HyperLogLog, both examples of Locality- Sensitive Hashing, as well as coresets, which are approximate representations for finite sum problems, to build methods capable of scaling to billions of items. Ultimately, these are all derived from variations on sampling.
We combined these advances with hardware-based optimizations and
incorporated into free and open-source software libraries (sketch, frp, lib- simdsampling) and practical software tools built on these libraries (Dashing, Minicore, Dashing 2), empowering users to interact practically with colossal datasets on commodity hardware
You Only Condense Once: Two Rules for Pruning Condensed Datasets
Dataset condensation is a crucial tool for enhancing training efficiency by
reducing the size of the training dataset, particularly in on-device scenarios.
However, these scenarios have two significant challenges: 1) the varying
computational resources available on the devices require a dataset size
different from the pre-defined condensed dataset, and 2) the limited
computational resources often preclude the possibility of conducting additional
condensation processes. We introduce You Only Condense Once (YOCO) to overcome
these limitations. On top of one condensed dataset, YOCO produces smaller
condensed datasets with two embarrassingly simple dataset pruning rules: Low
LBPE Score and Balanced Construction. YOCO offers two key advantages: 1) it can
flexibly resize the dataset to fit varying computational constraints, and 2) it
eliminates the need for extra condensation processes, which can be
computationally prohibitive. Experiments validate our findings on networks
including ConvNet, ResNet and DenseNet, and datasets including CIFAR-10,
CIFAR-100 and ImageNet. For example, our YOCO surpassed various dataset
condensation and dataset pruning methods on CIFAR-10 with ten Images Per Class
(IPC), achieving 6.98-8.89% and 6.31-23.92% accuracy gains, respectively. The
code is available at: https://github.com/he-y/you-only-condense-once.Comment: Accepted by NeurIPS 202
BlinkML: Efficient Maximum Likelihood Estimation with Probabilistic Guarantees
The rising volume of datasets has made training machine learning (ML) models
a major computational cost in the enterprise. Given the iterative nature of
model and parameter tuning, many analysts use a small sample of their entire
data during their initial stage of analysis to make quick decisions (e.g., what
features or hyperparameters to use) and use the entire dataset only in later
stages (i.e., when they have converged to a specific model). This sampling,
however, is performed in an ad-hoc fashion. Most practitioners cannot precisely
capture the effect of sampling on the quality of their model, and eventually on
their decision-making process during the tuning phase. Moreover, without
systematic support for sampling operators, many optimizations and reuse
opportunities are lost.
In this paper, we introduce BlinkML, a system for fast, quality-guaranteed ML
training. BlinkML allows users to make error-computation tradeoffs: instead of
training a model on their full data (i.e., full model), BlinkML can quickly
train an approximate model with quality guarantees using a sample. The quality
guarantees ensure that, with high probability, the approximate model makes the
same predictions as the full model. BlinkML currently supports any ML model
that relies on maximum likelihood estimation (MLE), which includes Generalized
Linear Models (e.g., linear regression, logistic regression, max entropy
classifier, Poisson regression) as well as PPCA (Probabilistic Principal
Component Analysis). Our experiments show that BlinkML can speed up the
training of large-scale ML tasks by 6.26x-629x while guaranteeing the same
predictions, with 95% probability, as the full model.Comment: 22 pages, SIGMOD 201
- …