5 research outputs found

    Data Re-Use and the Problem of Group Identity

    Get PDF
    Reusing existing data sets of health information for public health or medical research has much to recommend it. Much data repurposing in medical or public health research or practice involves information that has been stripped of individual identifiers but some does not. In some cases, there may have been consent to the reuse but in other cases consent may be absent and people may be entirely unaware of how the data about them are being used. Data sets are also being combined and may contain information with very different sources, consent histories, and individual identifiers. Much of the ethical and policy discussion about the permissibility of data reuse has centered on two questions: for identifiable data, the scope of the original consent and whether the reuse is permissible in light of that scope, and for de-identified data, whether there are unacceptable risks that the data will be reidentified in a manner that is harmful to any data subjects. Prioritizing these questions rests on a picture of the ethics of data use as primarily about respecting the choices of the data subject. We contend that this picture is mistaken; data repurposing, especially when data sets are combined, raises novel questions about the impacts of research on groups and their implications for individuals regarded as falling within these groups. These impacts suggest that the controversies about de-identification or reconsent for reuse are to some extent beside the point. Serious ethical questions are also raised by the inferences that may be drawn about individuals from the research and resulting risks of stigmatization. These risks may arise even when individuals were not part of the original data set being repurposed. Data reuse, repurposing, and recombination may have damaging effects on others not included within the original data sets. These issues of justice for individuals who might be regarded as indirect subjects of research are not even raised by approaches that consider only the implications for or agreement of the original data subject. This chapter argues that health information should be available for reuse, information should be available for use, but in a way that does not yield unexpected surprises, produce direct harm to individuals, or violate warranted trust

    Privacy preserving data mining services on the web

    No full text
    Data mining research deals with extracting useful information from large collections of data. Since data mining is a complex process that requires expertise, it is beneficial to provide it as a service on the web. On the other hand, such use of data mining services combined with data collection efforts by private and government organizations leads to increased privacy concerns. In this work, we address the issue of preserving privacy while providing data mining services on the web and present an architecture for privacy preserving sharing of data mining models on the web. In the proposed architecture, data providers use APPEL for specifying their privacy preferences on data mining models, while data collectors use P3P policies for specifying their data-usage practices. Both parties use PMML as the standard for specifying data mining queries, constraints and models

    Towards a Collaborative Platform for Advanced Meta-Learning in Health care Predictive Analytics

    No full text
    Modern medical research and clinical practice are more dependent than ever on multi-factorial data sets originating from various sources, such as medical imaging, DNA analysis, patient health records and contextual factors. This data drives research, facilitates correct diagnoses and ultimately helps to develop and select the appropriate treatments. The volume and impact of this data has increased tremendously through technological developments such as highthroughput genomics and high-resolution medical imaging techniques. Additionally, the availability and popularity of different wearable health care devices has allowed the collection and monitoring of fine-grained personal health care data. The fusion and combination of these heterogeneous data sources has already led to many breakthroughs in health research and shows high potential for the development of methods that will push current reactive practices towards predictive, personalized and preventive health care. This potential is recognized and has led to the development of many platforms for the collection and statistical analysis of health care data (e.g. Apple Health, Microsoft Health Vault, Oracle Health Management, Philips HealthSuite, and EMC Health care Analytics). However, the heterogeneity of the data, privacy concerns, and the complexity and multiplicity of health care processes (e.g. diagnoses, therapy control, and risk prediction) creates significant challenges for data fusion, algorithm selection and tuning. These challenges leave a gap between the actual and the potential data usage in health care, which prevents a paradigm shift from delayed generalized medicine to predictive personalized medicine In this work we present an extensions of the OpenML platform that will be addressed in our future work in order to meet the needs of meta-learning in health care predictive analytics: privacy preserving sharing of data, workflows and evaluations, reproducibility of the results, and rich meta-data spaces about both data and algorithms. OpenML.org [2] is a collaboration platform which is designed to organize datasets, machine learning workflows, models and their evaluations. Currently, OpenML is not fully distributed but can be installed on local instances which can communicate with the main OpenML database using mirroring techniques. The downside of this approach is that code (machine learning workflows), datasets, experiments (models and evaluations) are physically kept on local instances, so users cannot communicate and share. We plan to turn OpenML into a fully distributed machine learning platform, which will be accessible from different data mining and machine learning platforms such as RapidMiner, R, WEKA, KNIME, or similar. Such a distributed platform would allow the ease of sharing data and knowledge. Currently, regulations and privacy concerns often prevent hospitals to learn from each other's approaches (e.g. machine learning workflows), reproduce work done by others (data version control, preprocessing and statistical analysis), and build models collaboratively. On the other hand, meta-data such as type of the hospital, percentage of readmitted patients or indicator of emergency treatment, as well as the learned models and their evaluations can be shared and have great potential for the development of a cutting edge meta-learning system for ranking, selection and tuning of machine learning algorithms. The success of meta-learning systems is greatly influenced by the size of problem (data) and algorithm spaces, but also by the quality of the data and algorithm descriptions (meta-features). Thus, we plan to employ domain knowledge provided by expert and formal sources (e.g. ontologies) in order to extend the meta-feature space for meta-learning in health care applications. For example, in meta-analyses of gene expression microarray data, the type of chip is very important in predicting algorithm performance. Further, in fused data sources it would be useful to know which type of data contributed to the performance (electronic health records, laboratory tests, data from wearables etc.). In contrast to data descriptions, algorithm descriptions are much less analyzed and applied in the meta-learning process. Recent result
    corecore