Search CORE

15 research outputs found

On The Accuracy and Completeness of The Record Matching Process

Author: Ahmed K Elmagarmid
Mohamed G Elfeky
Munir Cochinwala
Sid Dalal
Vassilios S Verykios
Publication venue
Publication date: 01/01/2000
Field of study

Abstract. Record matching or linking is one of the phases of the data quality improvement process, in which, records from different sources, are cleansed and integrated in a centralized data store to be used for various purposes. Both, earlier and recent studies in data quality and record linkage focus on various statistical models, which make strong assumptions on the probabilities of attribute errors. In this study, we evaluate different models for record linkage, which are built based on data only. We use a program that generates data with known error distributions and we train classification models, which we use to estimate the accuracy and the completeness of the record linking process. The results indicate that the automated learning techniques are adequate for this process and that both their accuracy and their completeness are comparable to the accuracy and the completeness of other, mostly manual, processes

CiteSeerX

Characterization of greater middle eastern genetic variation for enhanced disease gene discovery

Author: Abdel Hadi Sawsan
Abdel Salam Ekram
Abdel Salam Ghada
Abdou Mohammed
Abel Laurent
Abhytankar Avinash
Adimi Parisa
Ahmad Jamil
Akcakus Mustafa
Aksu Guside
Al Aama Jumana
Al Allawi Nasir
Al Baradie Raidah
Al Gazali Lihadh
Al Hajjar Sami
Al Hashem Amal
Al Herz Waleed
Al Jeaid Deema
Al Juamaah Suliman
Al Muhsen Saleh
Al Sannaa Nouriya
Al Tameni Salem
Al Tawari Asma
Alangari Abdullah
Alcais Alexandre
Alfawaz Tariq S.
Alsediq Najla Sameer
Alsum Zobaida
Ammar Khodja Aomar
Amouian Sepideh
Arikan Cigdem
Aryani Omid
Aslanger Ayca
Aydogmus Cigdem
Aytekin Caner
Azab Mostafa Abdellateef
Azam Matloob
Bansagi Boglarka
Barbouche Mohamed Rhida
Bastaki Laila
Belkadi Aziz
Ben Omran Tawfeg
Bindu Parayil Sankaran
Blancas Lizbeth
Boisson Dupuis Stéphanie
Boisson Bertrand
Bonnet Damien
Bousfiha Aziz
Boussafara Lobna
Boutros Jeannette
Bustamante Jacinta
Caksen Huseyin
Camcioglu Yildiz
Catherinot Emilie
Celik Fatma C.
Ciancanelli Michael
Cipe Funda E.
Clark Andrew G.
Clark Gary
Cobat Aurélie
Comu Sinan
Condie Angela
Condino Neto Antonio
Desai Mukesh
Dobyns William
Dogu Figen
Domaia Mohamed
Dorum Meltem
Egritas Odul
El Azbaoui Safa
El Baghdadi Jamila
El Harouni Ashraf
El Ruby Mona
Elfeky Reem A.
Elghazali Gehad
Faqeih Eissa
Fenerci Elif
Fieschi Claire
Funda Cipe
Gabriel Stacey B.
Gamal Iman
Gelik Umit
Genel Fetah
Gezdirici Alper
Girisha Katta M.
Goldstein Amy
Grattan Smith Padraic
Gupta Neerja
Hahn Jin
Halees Anason
Hatipoglu Nevin
He Yupeng
Hennekam Raoul
Houshmand Massoud
Ichai Philippe
Ikinciogullari Aydan
Ismail Samira
Itan Yuval
Jalas Chaim
Jouanguy Emmanuelle
Kabra Madhulika
Kalkan Göknur
Kara Majdi
Karaca Neslihan
Karaer Kadri
Kariminejad Ariana
Kayserili Hulya
Keser Emiroglu Melike
Kilic Sara S.
Kissani Najib
Koc Zeynep Peker
Kokron Cristina
Koul Roshan
Kutukculer Necil
Lanternier Fanny
Mahdaviani Alireza
Mahlaoui Nizar
Mansour Lobna
Mansouri Davood
MARGARI Lucia
Marzouki Naima
Masri Amira
Megahed Amina
Megahed Hisham
Mekki Najla
Mesdaghi Mehrnaz
Mikati Mohd
Mojahedi Faezeh
Mulley John
Nampoothiri Sheela
Navarrete Carmen
Omar Tarek
Oraby Azza
Pandaluz Ayse
Parvaneh Nima
Patiroglu Turkan
Pellier Isabelle
Picard Capucine
Puel Anne
Raas Rothschild Annick
Rahim Sohair Abdel
Rajab Anna
Raoult Didier
Reisli Ismail
Rezaei Nima
Sabri Ayoub
Sahin Yasin
Saleem Laila
Salem Fadia
Sanal Ozden
Sanger Terry
Scott Eric M.
Shakankiry Hanan
Shang Lei
Shehata Nabil
Shembesh Nuri
Shkalim Vared
Softah Ameen
Sogaty Sameera
Soliman Neveen
Sonmez Aunaci Fatma
Spencer Emily G.
Stambouli Omar Boudghene
Sztriha Laszlo
Taibi Berrah Lynda
Temtamy Samia
Tonekaboni Hasan
Trauner Doris
Tuysuz Beyhan
Valente Enza Maria
Varan Ali
Vogt Guillaume
Walsh Christopher
Woods Geoffrey
Yesil Gozde
Yildiran Alisan
Yildiz Basak
Yuksel Adnan
Zaki Maha
Zhang Shen Ying
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

The Greater Middle East (GME) has been a central hub of human migration and population admixture. The tradition of consanguinity, variably practiced in the Persian Gulf region, North Africa, and Central Asia1-3, has resulted in an elevated burden of recessive disease4. Here we generated a whole-exome GME variome from 1,111 unrelated subjects. We detected substantial diversity and admixture in continental and subregional populations, corresponding to several ancient founder populations with little evidence of bottlenecks. Measured consanguinity rates were an order of magnitude above those in other sampled populations, and the GME population exhibited an increased burden of runs of homozygosity (ROHs) but showed no evidence for reduced burden of deleterious variation due to classically theorized ‘genetic purging’. Applying this database to unsolved recessive conditions in the GME population reduced the number of potential disease-causing variants by four- to sevenfold. These results show variegated genetic architecture in GME populations and support future human genetic discoveries in Mendelian and population genetics

Archivio istituzionale della ricerca - Università di Bari

Online periodicity mining

Author: Elfeky Mohamed G
Publication venue: 'Purdue University (bepress)'
Publication date: 01/01/2005
Field of study

This dissertation addresses the online periodicity mining problem. Periodicity mining is the process of discovering frequent periodic patterns in an attempt towards predicting the future behavior in time series data. The ubiquitousness of sensor devices that generate real-time, append-only and semi-infinite data streams has revived the need for online processing. We define periodicity mining as a two-step process: discovering potential periodicity rates (Periodicity Detection), and discovering the frequent periodic patterns of each periodicity rate (Mining Periodic Patterns). We propose new algorithms for both online periodicity detection and online mining of periodic patterns. For the latter, the proposed algorithm incrementally maintains an efficient data structure, namely the max-subpattern tree, from which the periodic patterns are discovered. For the periodicity detection, we define two types of periodicities: segment periodicity and symbol periodicity. Whereas segment periodicity concerns the periodicity of the entire time series, symbol periodicity concerns the periodicities of the various symbols or values of the time series. For each periodicity type, we propose an efficient convolution-based periodicity detection algorithm. Furthermore, we propose online periodicity mining algorithms that integrate both periodicity mining steps, and thus are able to discover the periodic patterns of unknown periods. All the proposed online algorithms require only one pass over the time series and no reprocessing of previously seen data. Finally, we address the inevitable problem of the presence of noise in real-world time series data. We propose a new online periodicity detection algorithm that deals efficiently with all types of noise. Based on time warping, the proposed algorithm warps (extends or shrinks) the time axis at various locations to optimally remove the noise. Experimental studies for all the proposed algorithms are carried out using both synthetic and real-world data. Results show that the proposed algorithms outperform the existing periodicity mining algorithms in terms of the time performance, the accuracy of the discovered periodicity rates and periodic patterns, and the resilience to noise. Real-data experiments demonstrate the practicality of the discovered periodic patterns

Purdue E-Pubs

STAGGER: Periodicity Mining of Data Streams using Expanding Sliding Windows

Author: Aref Walid G.
Elfeky Mohamed G.
Elmagarmid Ahmed K.
Publication venue: 'Purdue University (bepress)'
Publication date: 01/04/2005
Field of study

Sensor devices are becoming ubiquitous, especially in measurement and monitoring applications. Because of the real-time, append-only and semi-infinite natures of the generated sensor data streams, an online incremental approach is a necessity for mining stream data types. In this paper, we propose STAGGER: a one-pass, online and incremental algorithm for mining periodic patterns in data streams. STAGGER does not require that the user pre-specify the periodicity rate of the data. Instead, STAGGER discovers the potential periodicity rates. STAGGER maintains multiple expanding sliding windows staggered over the stream, where computations are shared among the multiple overlapping windows. Small-length sliding windows are imperative for early and real-time output, yet are limited to discover short periodicity rates. As streamed data arrives continuously, the sliding windows expand in length in order to cover the whole stream. Larger-length sliding windows are able to discover longer periodicity rates. STAGGER incrementally maintains a tree-like data structure for the frequent periodic patterns of each discovered potential periodicity rate. In contrast to the Fourier/Wavelet-based approaches used for discovering periodicity rates, STAGGER not only discovers a wider, more accurate set of periodicities, but also discovers the periodic patterns themselves. In fact, experimental results with real and synthetic data sets show that STAGGER outperforms Fourier/Wavelet-based approaches by an order of magnitude in terms of the accuracy of the discovered periodicity rates. Moreover, real data experiments demonstrate the practicality of the discovered periodic patterns

Purdue E-Pubs

Periodicity Detection in Time Series Databases

Author: Aref Walid G.
Atallah Mikhail J.
Elfeky Mohamed G.
Publication venue: 'Purdue University (bepress)'
Publication date: 01/12/2002
Field of study

Purdue E-Pubs

STAGGER: Periodicity Mining of Data Streams Using Expanding Sliding Windows

Author: Aref Walid G.
Elfeky Mohamed
Elmagarmid Ahmed
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/12/2006
Field of study

Purdue E-Pubs

A Stream Database Server for Sensor Applications

Author: Aref Walid G.
Catlin Ann C.
Elfeky Mohamed G.
Elmagarmid Ahmed K.
Hammad Moustafa A.
Publication venue: 'Purdue University (bepress)'
Publication date: 01/01/2002
Field of study

We present a framework for stream data processing that incorporates a stream database server as a fundamental component. The server operates as the stream control interface between arrays of distributed data stream sources and end-user clients that access and analyze the streams. The underlying framework provides novel stream management and query processing mechanisms to support the online acquisition, management, storage, non-blocking query, and integration of data streams for distributed multi-sensor networks. In this paper, we define our stream model and stream representation for the stream database, and we describe the functionality and implementation of key components of the stream processing framework, including the query processing interface for source streams, the stream manager, the stream buffer manager, nonblocking query execution, and a new class of join algorithms for joining multiple data streams constrained by a sliding time window. We conduct experiments using real data streams to evaluate the performance of the new algorithms against traditional stream join algorithms. The experiments show significant performance improvements and also demonstrate the flexibility of our system in handling data streams. A multi-sensor network application for the intelligent detection of hazardous materials is presented to illustrate the capabilities of our framework

CiteSeerX

Purdue E-Pubs

Record Linkage: A Machine Learning Approach, A Toolbox, and a Digital Government Web Service

Author: Elfeky Mohamed G.
Elmagarmid Ahmed K.
Ghanem Thanaa M.
Huwait Ahmed R.
Verykios Vassilios S.
Publication venue: 'Purdue University (bepress)'
Publication date: 01/01/2003
Field of study

Data cleaning is a vital process that ensures the quality of data stored in real-world databases. Data cleaning problems are frequently encountered in many research areas, such as knowledge discovery in databases, data warehousing, system integration and eservices

CiteSeerX

Purdue E-Pubs