121 research outputs found

    LinkWise: A Modern Record Linkage Software Application

    Get PDF
    Introduction Because of a lack of unique identifiers among datasets, and different data collection standards, record linkage is challenging. Thus, despite the importance of record linkage in unleashing the power of data, there are few software applications built for this purpose. Each software application has unique strengths and weaknesses. Objectives and Approach Data linkage comprises various steps such as selecting linkage identifiers, data cleaning, data pre-processing, calculating the linkage weights for identifiers, and estimating similarity thresholds to decide if two records are true matches. These steps require expertise and are costly for organizations interested in data sharing. Although data linkage software applications have been developed, there are drawbacks with these applications. They are either costly, difficult to use, not able to preserve the privacy of individuals, not able to handle big datasets, or perform poorly in terms of specificity and sensitivity. LinkWise is a software application developed to resolve these issues. Results LinkWise is a probabilistic modern linkage software implemented using Microsoft C#.Net. The following features are implemented in this software: automated all data linkage steps, a simple and user friendly interface, ability to link both unencrypted and encrypted data (privacy preserving record linkage), transparent linkage algorithm (not a black box), ability to perform incremental linkage (linking new data to previously linked data), ability to handle millions of records, ability to run on multiple processors to reduce run time, and high specificity and sensitivity. The software was tested over many datasets with various characteristics (e.g., different data fields, data formats, number of records, various amount of noise etc.). Results show that it is able to link data with a high specificity and sensitivity in a reasonable time. Conclusion/Implications LinkWise is a software application designed to address many issues arising in the process of data linkage. The software automated all steps of data linkage and preserves the privacy of individuals. It is very easy to use and technical background knowledge is not required to work with this software

    Privacy preserving record linkage meets record linkage using unencrypted data

    Get PDF
    Introduction Privacy preserving record linkage (PPRL) resolves privacy concerns because of its capabilities to link encrypted identifiers. It encrypts identifiers using bloom filters and performs record matching based on encrypted data using dice coefficient similarity. Matching data based on hashed identifiers impacts the performance of linkage due to loss of information. Objectives and Approach We propose a technique to optimize the bloom filter parameters and examine if the optimal parameters increase the performance of the linkage in terms of precision, recall, and f-measure. Let us consider a set of string values and calculate the similarity between any two of them using the Jaro-Winkler method. Now let us encrypt the string values using bloom filters and calculate the similarity between any two of them using the dice coefficient technique. Optimal parameters of bloom filters are those that minimize the difference between the calculated similarities using Jaro-Winkler vs. the calculated similarities using the dice coefficient technique. Results Using publically available data, several first name and last name datasets each comprising 1000 unique values were generated. The following values for bloom filter parameters were considered: q in q-grams (q=1,2,3), bit array length (l=50,100,200,500,1000), number of hash functions (k=5,10,20,50). The following five setups of bloom filters were able to minimize the difference between the calculated similarities on encrypted data using the dice coefficient technique, and the calculated similarities on unencrypted data using the Jaro-Winkler method: q=1,l=1000,k=50/q=1,l=500,k=20/ q=2,l=1000,k=50/ q=3,l=500,k=50. These setups were considered to perform data linkage over 10 synthetically-generated datasets. Results show that PPRL was able to achieve similar performance compared to data linkage over unencrypted data. Conclusion/Implications This study showed that optimal parameters of bloom filters minimized loss of information resulting from data encryption. Experimental findings indicated that PPRL using optimal parameters of bloom filters achieves almost the same performance as data linkage on unencrypted data in terms of precision, recall, and f-measure

    Fuzzy clustering of time series data: A particle swarm optimization approach

    Get PDF
    With rapid development in information gathering technologies and access to large amounts of data, we always require methods for data analyzing and extracting useful information from large raw dataset and data mining is an important method for solving this problem. Clustering analysis as the most commonly used function of data mining, has attracted many researchers in computer science. Because of different applications, the problem of clustering the time series data has become highly popular and many algorithms have been proposed in this field. Recently Swarm Intelligence (SI) as a family of nature inspired algorithms has gained huge popularity in the field of pattern recognition and clustering. In this paper, a technique for clustering time series data using a particle swarm optimization (PSO) approach has been proposed, and Pearson Correlation Coefficient as one of the most commonly-used distance measures for time series is considered. The proposed technique is able to find (near) optimal cluster centers during the clustering process. To reduce the dimensionality of the search space and improve the performance of the proposed method, a singular value decomposition (SVD) representation of cluster centers is considered. Experimental results over three popular data sets indicate the superiority of the proposed technique in comparing with fuzzy C-means and fuzzy K-medoids clustering techniques

    Trajectory of service use among young Albertans with complex needs

    Get PDF
    Introduction Youth with complex-needs are vulnerable as a consequence of exposure to social adversity and/or chronic health conditions, and are at a high risk of school failure and justice involvement. Information about the patterns of service use across government sectors that influence the life outcomes of complex-needs youth is unknown. Objectives and Approach Youth with complex needs often engage with multiple services across multiple government sectors for extended periods of time. Understanding the patterns and trajectory of their service use may inform programs, decision makers and government in the optimal allocation of resources to increase their life outcomes. It may reveal where and when interventions would be most effective to improve the life course for vulnerable youth. In this study, through a unique approach to link over 20 administrative longitudinal datasets and a novel trajectory clustering technique, the patterns of service use among complex-needs young Albertans is revealed and visualized. Results A trajectory clustering technique was applied to reveal patterns of service use among complex-needs individuals. Compared to the general population, higher proportions of youth with complex needs lived in low socio-economic neighborhoods, suffered from mental health issues, were high cost health service users, and had lower rates of high school completion. Furthermore, youth having complex needs for a longer period of time and who required multiple complex services in a given year had the poorest outcomes, in terms of high school completion, mental health issues, and other health problems. The majority of complex-needs youth came in contact with services via the education system, followed by child services/welfare. Conclusion/Implications The trajectories of service use among complex-needs youth reveals that these individuals are primarily identified through education. Consequently, educational supports would best address the development of effective programs including mental health supports and other needs

    Metaheuristic Based Scheduling Meta-Tasks in Distributed Heterogeneous Computing Systems

    Get PDF
    Scheduling is a key problem in distributed heterogeneous computing systems in order to benefit from the large computing capacity of such systems and is an NP-complete problem. In this paper, we present a metaheuristic technique, namely the Particle Swarm Optimization (PSO) algorithm, for this problem. PSO is a population-based search algorithm based on the simulation of the social behavior of bird flocking and fish schooling. Particles fly in problem search space to find optimal or near-optimal solutions. The scheduler aims at minimizing makespan, which is the time when finishes the latest task. Experimental studies show that the proposed method is more efficient and surpasses those of reported PSO and GA approaches for this problem

    Power of Linked Administrative Data

    Get PDF
    Introduction Linking administrative data provides valuable information about individuals using government services and can be very useful for policy-makers in improving and developing services and policies. The Child and Youth Data Laboratory (CYDL) links and analyses administrative data from Alberta Government ministries to provide evidence for policy and program development. Objectives and Approach Data from 20 programs of six Government of Alberta ministries (Advanced Education, Education, Health, Children’s Services, Community and Social Services, and Justice and Solicitor General) were linked anonymously. The data spans six years from 2005/06 to 2010/11 and consists of almost 50 million records corresponding to over 2 million unique Albertans aged 0 to 25 years. A data visualization tool called the Program Overlap Matrix summarises the overlap rates among the programs. It is comprised of a matrix of squares, where each cell represents the overlap between two programs. Results The Program Overlap Matrix is publically available at https://visualization.policywise.com/P2matrix/. It consists of overlap rates between programs in any study year (2005/06 to 2010/11), individual years, the first year vs. future years, and the last year vs. previous years which can be used to answer many policy-related questions such as: other service use (e.g., what other services do ESL students use?), over-represented programs (e.g., in what programs are Child Care Subsidy clients over-represented?), resilience (e.g., what is the proportion of Child Intervention clients in post-secondary institutions?), transitions (e.g., what types of services do students with special needs receive as they transition to adulthood?), and time trend (e.g., what types of services did Income Support clients receive in the past?) Conclusion/Implications The program overlap matrix is a powerful tool to discover relationships between programs. It is a useful instrument to inform public and policy-makers about the overlap rates between government programs. It can be used to answer a variety of policy-related questions

    Meta-heuristically seeded genetic algorithm for independent job scheduling in grid computing

    Get PDF
    Grid computing is an infrastructure which connects geographically distributed computers owned by various organizations allowing their resources, such as computational power and storage capabilities, to be shared, selected, and aggregated. Job scheduling problem is one of the most difficult tasks in grid computing systems. To solve this problem efficiently, new methods are required. In this paper, a seeded genetic algorithm is proposed which uses a meta-heuristic algorithm to generate its initial population. To evaluate the performance of the proposed method in terms of minimizing the makespan, the Expected Time to Compute (ETC) simulation model is used to carry out a number of experiments. The results show that the proposed algorithm performs better than other selected techniques

    Fuzzy clustering with spatial-temporal information

    Get PDF
    Clustering geographical units based on a set of quantitative features observed at several time occasions requires to deal with the complexity of both space and time information. In particular, one should consider (1) the spatial nature of the units to be clustered, (2) the characteristics of the space of multivariate time trajectories, and (3) the uncertainty related to the assignment of a geographical unit to a given cluster on the basis of the above com- plex features. This paper discusses a novel spatially constrained multivariate time series clustering for units characterised by different levels of spatial proximity. In particular, the Fuzzy Partitioning Around Medoids algorithm with Dynamic Time Warping dissimilarity measure and spatial penalization terms is applied to classify multivariate Spatial-Temporal series. The clustering method has been theoretically presented and discussed using both simulated and real data, highlighting its main features. In particular, the capability of embedding different levels of proximity among units, and the ability of considering time series with different length
    • …
    corecore