Large-scale datasets have played a crucial role in the advancement of
computer vision. However, they often suffer from problems such as class
imbalance, noisy labels, dataset bias, or high resource costs, which can
inhibit model performance and reduce trustworthiness. With the advocacy of
data-centric research, various data-centric solutions have been proposed to
solve the dataset problems mentioned above. They improve the quality of
datasets by re-organizing them, which we call dataset refinement. In this
survey, we provide a comprehensive and structured overview of recent advances
in dataset refinement for problematic computer vision datasets. Firstly, we
summarize and analyze the various problems encountered in large-scale computer
vision datasets. Then, we classify the dataset refinement algorithms into three
categories based on the refinement process: data sampling, data subset
selection, and active learning. In addition, we organize these dataset
refinement methods according to the addressed data problems and provide a
systematic comparative description. We point out that these three types of
dataset refinement have distinct advantages and disadvantages for dataset
problems, which informs the choice of the data-centric method appropriate to a
particular research objective. Finally, we summarize the current literature and
propose potential future research topics.Comment: 33 pages, 10 figures, to be published in ACM Computing Survey