15 research outputs found
On The Accuracy and Completeness of The Record Matching Process
Abstract. Record matching or linking is one of the phases of the data quality improvement process, in which, records from different sources, are cleansed and integrated in a centralized data store to be used for various purposes. Both, earlier and recent studies in data quality and record linkage focus on various statistical models, which make strong assumptions on the probabilities of attribute errors. In this study, we evaluate different models for record linkage, which are built based on data only. We use a program that generates data with known error distributions and we train classification models, which we use to estimate the accuracy and the completeness of the record linking process. The results indicate that the automated learning techniques are adequate for this process and that both their accuracy and their completeness are comparable to the accuracy and the completeness of other, mostly manual, processes
Characterization of greater middle eastern genetic variation for enhanced disease gene discovery
The Greater Middle East (GME) has been a central hub of human migration and population admixture. The tradition of consanguinity, variably practiced in the Persian Gulf region, North Africa, and Central Asia1-3, has resulted in an elevated burden of recessive disease4. Here we generated a whole-exome GME variome from 1,111 unrelated subjects. We detected substantial diversity and admixture in continental and subregional populations, corresponding to several ancient founder populations with little evidence of bottlenecks. Measured consanguinity rates were an order of magnitude above those in other sampled populations, and the GME population exhibited an increased burden of runs of homozygosity (ROHs) but showed no evidence for reduced burden of deleterious variation due to classically theorized ‘genetic purging’. Applying this database to unsolved recessive conditions in the GME population reduced the number of potential disease-causing variants by four- to sevenfold. These results show variegated genetic architecture in GME populations and support future human genetic discoveries in Mendelian and population genetics
Online periodicity mining
This dissertation addresses the online periodicity mining problem. Periodicity mining is the process of discovering frequent periodic patterns in an attempt towards predicting the future behavior in time series data. The ubiquitousness of sensor devices that generate real-time, append-only and semi-infinite data streams has revived the need for online processing. We define periodicity mining as a two-step process: discovering potential periodicity rates (Periodicity Detection), and discovering the frequent periodic patterns of each periodicity rate (Mining Periodic Patterns). We propose new algorithms for both online periodicity detection and online mining of periodic patterns. For the latter, the proposed algorithm incrementally maintains an efficient data structure, namely the max-subpattern tree, from which the periodic patterns are discovered. For the periodicity detection, we define two types of periodicities: segment periodicity and symbol periodicity. Whereas segment periodicity concerns the periodicity of the entire time series, symbol periodicity concerns the periodicities of the various symbols or values of the time series. For each periodicity type, we propose an efficient convolution-based periodicity detection algorithm. Furthermore, we propose online periodicity mining algorithms that integrate both periodicity mining steps, and thus are able to discover the periodic patterns of unknown periods. All the proposed online algorithms require only one pass over the time series and no reprocessing of previously seen data. Finally, we address the inevitable problem of the presence of noise in real-world time series data. We propose a new online periodicity detection algorithm that deals efficiently with all types of noise. Based on time warping, the proposed algorithm warps (extends or shrinks) the time axis at various locations to optimally remove the noise. Experimental studies for all the proposed algorithms are carried out using both synthetic and real-world data. Results show that the proposed algorithms outperform the existing periodicity mining algorithms in terms of the time performance, the accuracy of the discovered periodicity rates and periodic patterns, and the resilience to noise. Real-data experiments demonstrate the practicality of the discovered periodic patterns
STAGGER: Periodicity Mining of Data Streams using Expanding Sliding Windows
Sensor devices are becoming ubiquitous, especially in measurement and monitoring applications. Because of the real-time, append-only and semi-infinite natures of the generated sensor data streams, an online incremental approach is a necessity for mining stream data types. In this paper, we propose STAGGER: a one-pass, online and incremental algorithm for mining periodic patterns in data streams. STAGGER does not require that the user pre-specify the periodicity rate of the data. Instead, STAGGER discovers the potential periodicity rates. STAGGER maintains multiple expanding sliding windows staggered over the stream, where computations are shared among the multiple overlapping windows. Small-length sliding windows are imperative for early and real-time output, yet are limited to discover short periodicity rates. As streamed data arrives continuously, the sliding windows expand in length in order to cover the whole stream. Larger-length sliding windows are able to discover longer periodicity rates. STAGGER incrementally maintains a tree-like data structure for the frequent periodic patterns of each discovered potential periodicity rate. In contrast to the Fourier/Wavelet-based approaches used for discovering periodicity rates, STAGGER not only discovers a wider, more accurate set of periodicities, but also discovers the periodic patterns themselves. In fact, experimental results with real and synthetic data sets show that STAGGER outperforms Fourier/Wavelet-based approaches by an order of magnitude in terms of the accuracy of the discovered periodicity rates. Moreover, real data experiments demonstrate the practicality of the discovered periodic patterns
STAGGER: Periodicity Mining of Data Streams Using Expanding Sliding Windows
Sensor devices are becoming ubiquitous, especially in measurement and monitoring applications. Because of the real-time, append-only and semi-infinite natures of the generated sensor data streams, an online incremental approach is a necessity for mining stream data types. In this paper, we propose STAGGER: a one-pass, online and incremental algorithm for mining periodic patterns in data streams. STAGGER does not require that the user pre-specify the periodicity rate of the data. Instead, STAGGER discovers the potential periodicity rates. STAGGER maintains multiple expanding sliding windows staggered over the stream, where computations are shared among the multiple overlapping windows. Small-length sliding windows are imperative for early and real-time output, yet are limited to discover short periodicity rates. As streamed data arrives continuously, the sliding windows expand in length in order to cover the whole stream. Larger-length sliding windows are able to discover longer periodicity rates. STAGGER incrementally maintains a tree-like data structure for the frequent periodic patterns of each discovered potential periodicity rate. In contrast to the Fourier/Wavelet-based approaches used for discovering periodicity rates, STAGGER not only discovers a wider, more accurate set of periodicities, but also discovers the periodic patterns themselves. In fact, experimental results with real and synthetic data sets show that STAGGER outperforms Fourier/Wavelet-based approaches by an order of magnitude in terms of the accuracy of the discovered periodicity rates. Moreover, real data experiments demonstrate the practicality of the discovered periodic patterns
A Stream Database Server for Sensor Applications
We present a framework for stream data processing that incorporates a stream database server as a fundamental component. The server operates as the stream control interface between arrays of distributed data stream sources and end-user clients that access and analyze the streams. The underlying framework provides novel stream management and query processing mechanisms to support the online acquisition, management, storage, non-blocking query, and integration of data streams for distributed multi-sensor networks. In this paper, we define our stream model and stream representation for the stream database, and we describe the functionality and implementation of key components of the stream processing framework, including the query processing interface for source streams, the stream manager, the stream buffer manager, nonblocking query execution, and a new class of join algorithms for joining multiple data streams constrained by a sliding time window. We conduct experiments using real data streams to evaluate the performance of the new algorithms against traditional stream join algorithms. The experiments show significant performance improvements and also demonstrate the flexibility of our system in handling data streams. A multi-sensor network application for the intelligent detection of hazardous materials is presented to illustrate the capabilities of our framework
Record Linkage: A Machine Learning Approach, A Toolbox, and a Digital Government Web Service
Data cleaning is a vital process that ensures the quality of data stored in real-world databases. Data cleaning problems are frequently encountered in many research areas, such as knowledge discovery in databases, data warehousing, system integration and eservices