5,192 research outputs found
An Unsupervised Cluster: Learning Water Customer Behavior Using Variation of Information on a Reconstructed Phase Space
The unsupervised clustering algorithm described in this dissertation addresses the need to divide a population of water utility customers into groups based on their similarities and differences, using only the measured flow data collected by water meters. After clustering, the groups represent customers with similar consumption behavior patterns and provide insight into ‘normal’ and ‘unusual’ customer behavior patterns. This research focuses upon individually metered water utility customers and includes both residential and commercial customer accounts serviced by utilities within North America. The contributions of this dissertation not only represent a novel academic work, but also solve a practical problem for the utility industry. This dissertation introduces a method of agglomerative clustering using information theoretic distance measures on Gaussian mixture models within a reconstructed phase space. The clustering method accommodates a utility’s limited human, financial, computational, and environmental resources. The proposed weighted variation of information distance measure for comparing Gaussian mixture models places emphasis upon those behaviors whose statistical distributions are more compact over those behaviors with large variation and contributes a novel addition to existing comparison options
Privacy in trajectory micro-data publishing : a survey
We survey the literature on the privacy of trajectory micro-data, i.e.,
spatiotemporal information about the mobility of individuals, whose collection
is becoming increasingly simple and frequent thanks to emerging information and
communication technologies. The focus of our review is on privacy-preserving
data publishing (PPDP), i.e., the publication of databases of trajectory
micro-data that preserve the privacy of the monitored individuals. We classify
and present the literature of attacks against trajectory micro-data, as well as
solutions proposed to date for protecting databases from such attacks. This
paper serves as an introductory reading on a critical subject in an era of
growing awareness about privacy risks connected to digital services, and
provides insights into open problems and future directions for research.Comment: Accepted for publication at Transactions for Data Privac
Evaluation measure for group-based record linkage
Traditionally, record linkage is concerned with linking pairs of records across data sets and the classification of such pairs into matches (assumed to refer to the same individual) and non-matches (assumed to refer to different individuals). Increasingly, however, more complex data sets are being linked where often the aim is to identify groups, or clusters, of records that refer to the same individual or to a group of related individuals. Examples include finding the records of all births to the same parents or all medical records generated by members of the same family. When ground truth data in the form of known true matches and non-matches are available, then linkage quality is traditionally evaluated based on the classified versus the true matches (links) using measures such as precision (also known as the positive predictive value) and recall (also known as sensitivity or the true positive rate). The quality of clusters generated in record linkage is of high importance, since the comparison of different linkage methods is largely based on the values obtained by such evaluation measures. However, minimal research has been conducted thus far to evaluate the suitability of existing evaluation measures in the context of linking groups of records. As we show, evaluation measures such as precision and recall are not suitable for evaluating groups of linked records because they evaluate the quality of individually linked record pairs rather than the quality of records grouped into clusters. We highlight the shortcomings of traditional evaluation measures and then propose a novel approach to evaluate cluster quality in the context of group-based record linkage. We empirically evaluate our proposed approach using real-world data and show that it better reflects the quality of clusters generated by a group-based record linkage technique
- …