Search CORE

10,396 research outputs found

The Census and social science: third report of Session 2012–13

Author
Publication venue: Stationery Office
Publication date: 01/01/2012
Field of study

Digital Education Resource Archive

Measurements of participation in Scottish higher education report

Author: Cohen Geoff
Fielding Antony
Waterton Jennifer
Publication venue: Scottish Government (Scotland)
Publication date: 01/01/2010
Field of study

Digital Education Resource Archive

Avoiding disclosure of individually identifiable health information: a literature review

Author: Borton Joshua
Fernandes-Huessy Johannes
Gonzalez Claudia
Hair Elizabeth
Holden Craig
Mulcahy Tim
Prada Sergio I
Publication venue
Publication date
Field of study

Achieving data and information dissemination without arming anyone is a central task of any entity in charge of collecting data. In this article, the authors examine the literature on data and statistical confidentiality. Rather than comparing the theoretical properties of specific methods, they emphasize the main themes that emerge from the ongoing discussion among scientists regarding how best to achieve the appropriate balance between data protection, data utility, and data dissemination. They cover the literature on de-identification and reidentification methods with emphasis on health care data. The authors also discuss the benefits and limitations for the most common access methods. Although there is abundant theoretical and empirical research, their review reveals lack of consensus on fundamental questions for empirical practice: How to assess disclosure risk, how to choose among disclosure methods, how to assess reidentification risk, and how to measure utility loss.public use files, disclosure avoidance, reidentification, de-identification, data utility

Research Papers in Economics

A critical assessment of imbalanced class distribution problem: the case of predicting freshmen student attrition

Author: Delen Dursun
Kasap Nihat
Meesad Phayung
Thammasiri Dech
Publication venue: 'Elsevier BV'
Publication date: 01/08/2013
Field of study

Predicting student attrition is an intriguing yet challenging problem for any academic institution. Class-imbalanced data is a common in the field of student retention, mainly because a lot of students register but fewer students drop out. Classification techniques for imbalanced dataset can yield deceivingly high prediction accuracy where the overall predictive accuracy is usually driven by the majority class at the expense of having very poor performance on the crucial minority class. In this study, we compared different data balancing techniques to improve the predictive accuracy in minority class while maintaining satisfactory overall classification performance. Specifically, we tested three balancing techniques—oversampling, under-sampling and synthetic minority over-sampling (SMOTE)—along with four popular classification methods—logistic regression, decision trees, neuron networks and support vector machines. We used a large and feature rich institutional student data (between the years 2005 and 2011) to assess the efficacy of both balancing techniques as well as prediction methods. The results indicated that the support vector machine combined with SMOTE data-balancing technique achieved the best classification performance with a 90.24% overall accuracy on the 10-fold holdout sample. All three data-balancing techniques improved the prediction accuracy for the minority class. Applying sensitivity analyses on developed models, we also identified the most important variables for accurate prediction of student attrition. Application of these models has the potential to accurately predict at-risk students and help reduce student dropout rates

Crossref

Sabanci University Research Database

30 Years of Synthetic Data

Author: Drechsler Joerg
Haensch Anna-Carolina
Publication venue
Publication date: 04/04/2023
Field of study

The idea to generate synthetic data as a tool for broadening access to sensitive microdata has been proposed for the first time three decades ago. While first applications of the idea emerged around the turn of the century, the approach really gained momentum over the last ten years, stimulated at least in parts by some recent developments in computer science. We consider the upcoming 30th jubilee of Rubin's seminal paper on synthetic data (Rubin, 1993) as an opportunity to look back at the historical developments, but also to offer a review of the diverse approaches and methodological underpinnings proposed over the years. We will also discuss the various strategies that have been suggested to measure the utility and remaining risk of disclosure of the generated data.Comment: 42 page

arXiv.org e-Print Archive

ILR Faculty Publications 2014-15

Author: ILR School Cornell University
Publication venue: DigitalCommons@ILR
Publication date: 01/01/2015
Field of study

The production of scholarly research continues to be one of the primary missions of the ILR School. During a typical academic year, ILR faculty members published or had accepted for publication over 25 books, edited volumes, and monographs, 170 articles and chapters in edited volumes, numerous book reviews. In addition, a large number of manuscripts were submitted for publication, presented at professional association meetings, or circulated in working paper form. Our faculty's research continues to find its way into the very best industrial relations, social science and statistics journals.FacultyPublications_2014_15_final.pdf: 24 downloads, before Oct. 1, 2020

DigitalCommons@ILR

eCommons@Cornell