Automatic transformation of raw clinical data into clean data using decision tree learning combining with string similarity algorithm

Zhang, Jian

research

Automatic transformation of raw clinical data into clean data using decision tree learning combining with string similarity algorithm

Authors: Jian Zhang
Publication date: 1 January 2015
Publisher: Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH
Doi

Abstract

It is challenging to conduct statistical analyses of complex scientific datasets. It is a timeconsuming process to find the relationships within data for whether a scientist or a statistician. The process involves preprocessing the raw data, the selection of appropriate statistics, performing analysis and providing correct interpretations, among which, the data pre-processing is tedious and a particular time drain. In a large amount of data provided for analysis, there is not a standard for recording the information, and some errors either of spelling, typing or transmission. Thus, there will be many expressions for the same meaning in the data, but it will be impossible for analysis system to automatically deal with these inaccuracies. What is needed is an automatic method for transforming the raw clinical data into data which it is possible to process automatically. In this paper we propose a method combining decision tree learning with the string similarity algorithm, which is fast and accuracy to clinical data cleaning. Experimental results show that it outperforms individual string similarity algorithms and traditional data cleaning process