98,380 research outputs found

    Privacy and Confidentiality in an e-Commerce World: Data Mining, Data Warehousing, Matching and Disclosure Limitation

    Full text link
    The growing expanse of e-commerce and the widespread availability of online databases raise many fears regarding loss of privacy and many statistical challenges. Even with encryption and other nominal forms of protection for individual databases, we still need to protect against the violation of privacy through linkages across multiple databases. These issues parallel those that have arisen and received some attention in the context of homeland security. Following the events of September 11, 2001, there has been heightened attention in the United States and elsewhere to the use of multiple government and private databases for the identification of possible perpetrators of future attacks, as well as an unprecedented expansion of federal government data mining activities, many involving databases containing personal information. We present an overview of some proposals that have surfaced for the search of multiple databases which supposedly do not compromise possible pledges of confidentiality to the individuals whose data are included. We also explore their link to the related literature on privacy-preserving data mining. In particular, we focus on the matching problem across databases and the concept of ``selective revelation'' and their confidentiality implications.Comment: Published at http://dx.doi.org/10.1214/088342306000000240 in the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Information architecture for a federated health record server

    Get PDF
    This paper describes the information models that have been used to implement a federated health record server and to deploy it in a live clinical setting. The authors, working at the Centre for Health Informatics and Multiprofessional Education (University College London), have built up over a decade of experience within Europe on the requirements and information models that are needed to underpin comprehensive multi-professional electronic health records. This work has involved collaboration with a wide range of health care and informatics organisations and partners in the healthcare computing industry across Europe though the EU Health Telematics projects GEHR, Synapses, EHCR-SupA, SynEx and Medicate. The resulting architecture models have fed into recent European standardisation work in this area, such as CEN TC/251 ENV 13606. UCL has implemented a federated health record server based on these models which is now running in the Department of Cardiovascular Medicine at the Whittington Hospital in North London. The information models described in this paper reflect a refinement based on this implementation experience

    ShenZhen transportation system (SZTS): a novel big data benchmark suite

    Get PDF
    Data analytics is at the core of the supply chain for both products and services in modern economies and societies. Big data workloads, however, are placing unprecedented demands on computing technologies, calling for a deep understanding and characterization of these emerging workloads. In this paper, we propose ShenZhen Transportation System (SZTS), a novel big data Hadoop benchmark suite comprised of real-life transportation analysis applications with real-life input data sets from Shenzhen in China. SZTS uniquely focuses on a specific and real-life application domain whereas other existing Hadoop benchmark suites, such as HiBench and CloudRank-D, consist of generic algorithms with synthetic inputs. We perform a cross-layer workload characterization at the microarchitecture level, the operating system (OS) level, and the job level, revealing unique characteristics of SZTS compared to existing Hadoop benchmarks as well as general-purpose multi-core PARSEC benchmarks. We also study the sensitivity of workload behavior with respect to input data size, and we propose a methodology for identifying representative input data sets

    Migrants in their family contexts: Application of a methodology

    Get PDF
    The composition of immigrant families is a topic which has attracted considerable public and political attention in recent years. In the late 1980s concern was expressed over the size of some households of Pacific Island peoples in New Zealand. In the 1990s a more persistent concern has been with the incidence of what have been called ‘astronaut’ families – families where one of the partners is persistently absent overseas. The ‘astronaut’ family phenomenon has been most commonly associated with Chinese immigrants from Hong Kong and Taiwan. This report uses a novel methodology to examine the incidence of ‘astronaut’ families in New Zealand at the time of the 1991 and 1996 censuses. The methodology is described in some detail in the first section, and it is hoped that the careful attention to the procedures used to examine migrants, who have been identified in the census, in their family contexts will stimulate further research in this area. The second part of the report presents the findings of an initial exploration of 1991 and 1996 census data using the methodology outlined in the first section. It is clear from results of this inquiry that the ‘astronaut’ family phenomenon is well established amongst some components of New Zealand’s Asian community. However, it is not as widespread as media comment in 1995 and 1996, before the 1996 election, suggested. It is important to develop ways of assessing characteristics of immigrant family structures in order to counter unsubstantiated assertions which promote negative stereotypes of immigrant communities. This research, which builds on a project supported by the Marsden Fund in 1997, suggest one fruitful avenue for making more extensive use of census data on immigrants in New Zealand to provide more objective assessment of “migrants in social context”

    A method and a tool for geocoding and record linkage

    Get PDF
    For many years, researchers have presented the geocoding of postal addresses as a challenge. Several research works have been devoted to achieve the geocoding process. This paper presents theoretical and technical aspects for geolocalization, geocoding, and record linkage. It shows possibilities and limitations of existing methods and commercial software identifying areas for further research. In particular, we present a methodology and a computing tool allowing the correction and the geo-coding of mailing addresses. The paper presents two main steps of the methodology. The first preliminary step is addresses correction (addresses matching), while the second caries geocoding of identified addresses. Additionally, we present some results from the processing of real data sets. Finally, in the discussion, areas for further research are identified.addresses correction; geocodage; matching; data management; record linkage

    Big data and data repurposing – using existing data to answer new questions in vascular dementia research

    Get PDF
    Introduction: Traditional approaches to clinical research have, as yet, failed to provide effective treatments for vascular dementia (VaD). Novel approaches to collation and synthesis of data may allow for time and cost efficient hypothesis generating and testing. These approaches may have particular utility in helping us understand and treat a complex condition such as VaD. Methods: We present an overview of new uses for existing data to progress VaD research. The overview is the result of consultation with various stakeholders, focused literature review and learning from the group’s experience of successful approaches to data repurposing. In particular, we benefitted from the expert discussion and input of delegates at the 9th International Congress on Vascular Dementia (Ljubljana, 16-18th October 2015). Results: We agreed on key areas that could be of relevance to VaD research: systematic review of existing studies; individual patient level analyses of existing trials and cohorts and linking electronic health record data to other datasets. We illustrated each theme with a case-study of an existing project that has utilised this approach. Conclusions: There are many opportunities for the VaD research community to make better use of existing data. The volume of potentially available data is increasing and the opportunities for using these resources to progress the VaD research agenda are exciting. Of course, these approaches come with inherent limitations and biases, as bigger datasets are not necessarily better datasets and maintaining rigour and critical analysis will be key to optimising data use
    • 

    corecore