Search CORE

514,821 research outputs found

Repeated Random Sampling for Minimizing the Time-to-Accuracy of Learning

Author: Gürel Nezihe Merve
Kalogerias Dionysis
Karbasi Amin
Mageirakos Vasilis
Nikolakakis Konstantinos E.
Okanovic Patrik
Rekatsinas Theodoros
Waleffe Roger
Publication venue
Publication date: 28/05/2023
Field of study

Methods for carefully selecting or generating a small set of training data to learn from, i.e., data pruning, coreset selection, and data distillation, have been shown to be effective in reducing the ever-increasing cost of training neural networks. Behind this success are rigorously designed strategies for identifying informative training examples out of large datasets. However, these strategies come with additional computational costs associated with subset selection or data distillation before training begins, and furthermore, many are shown to even under-perform random sampling in high data compression regimes. As such, many data pruning, coreset selection, or distillation methods may not reduce 'time-to-accuracy', which has become a critical efficiency measure of training deep neural networks over large datasets. In this work, we revisit a powerful yet overlooked random sampling strategy to address these challenges and introduce an approach called Repeated Sampling of Random Subsets (RSRS or RS2), where we randomly sample the subset of training data for each epoch of model training. We test RS2 against thirty state-of-the-art data pruning and data distillation methods across four datasets including ImageNet. Our results demonstrate that RS2 significantly reduces time-to-accuracy compared to existing techniques. For example, when training on ImageNet in the high-compression regime (using less than 10% of the dataset each epoch), RS2 yields accuracy improvements up to 29% compared to competing pruning methods while offering a runtime reduction of 7x. Beyond the above meta-study, we provide a convergence analysis for RS2 and discuss its generalization capability. The primary goal of our work is to establish RS2 as a competitive baseline for future data selection or distillation techniques aimed at efficient training

arXiv.org e-Print Archive

Test Set Diameter: Quantifying the Diversity of Sets of Test Cases

Author: Clark David
Feldt Robert
Poulding Simon
Yoo Shin
Publication venue
Publication date: 10/06/2015
Field of study

A common and natural intuition among software testers is that test cases need to differ if a software system is to be tested properly and its quality ensured. Consequently, much research has gone into formulating distance measures for how test cases, their inputs and/or their outputs differ. However, common to these proposals is that they are data type specific and/or calculate the diversity only between pairs of test inputs, traces or outputs. We propose a new metric to measure the diversity of sets of tests: the test set diameter (TSDm). It extends our earlier, pairwise test diversity metrics based on recent advances in information theory regarding the calculation of the normalized compression distance (NCD) for multisets. An advantage is that TSDm can be applied regardless of data type and on any test-related information, not only the test inputs. A downside is the increased computational time compared to competing approaches. Our experiments on four different systems show that the test set diameter can help select test sets with higher structural and fault coverage than random selection even when only applied to test inputs. This can enable early test design and selection, prior to even having a software system to test, and complement other types of test automation and analysis. We argue that this quantification of test set diversity creates a number of opportunities to better understand software quality and provides practical ways to increase it.Comment: In submissio

arXiv.org e-Print Archive

Crossref

Data-Driven Teaching: Tools and Trends

Author
Publication venue: Rennie Center for Education Research & Policy
Publication date: 02/02/2006
Field of study

Data-Driven Teaching: Tools and Trends, a policy brief released by the Rennie Center for Education Research and Policy, focuses on three district-based data analysis programs and highlights the policy and practice challenges associated with their use. More important, it provides educators and policymakers with guiding questions to assist in the selection of data analysis programs. The brief makes critical recommendations for district and state level policymakers seeking to move toward including data analysis as part of teachers' daily practice.Drawing on research with teachers, principals and superintendents in three urban districts, the Rennie Center's brief recommends that policymakers at both the state and district levels provide teachers with more time and support for the integration of data into their instructional planning

IssueLab

Towards an Efficient Discovery of the Topological Representative Subgraphs

Author: Dhifli Wajdi
Moussaoui Mohamed
Nguifo Engelbert Mephu
Saidi Rabie
Publication venue
Publication date: 01/01/2013
Field of study

With the emergence of graph databases, the task of frequent subgraph discovery has been extensively addressed. Although the proposed approaches in the literature have made this task feasible, the number of discovered frequent subgraphs is still very high to be efficiently used in any further exploration. Feature selection for graph data is a way to reduce the high number of frequent subgraphs based on exact or approximate structural similarity. However, current structural similarity strategies are not efficient enough in many real-world applications, besides, the combinatorial nature of graphs makes it computationally very costly. In order to select a smaller yet structurally irredundant set of subgraphs, we propose a novel approach that mines the top-k topological representative subgraphs among the frequent ones. Our approach allows detecting hidden structural similarities that existing approaches are unable to detect such as the density or the diameter of the subgraph. In addition, it can be easily extended using any user defined structural or topological attributes depending on the sought properties. Empirical studies on real and synthetic graph datasets show that our approach is fast and scalable

arXiv.org e-Print Archive

HAL Clermont Université