Given a noisy dataset, how to locate erroneous instances and attributes and rank suspicious instances based on their impacts on the system performance is an interesting and important research issue. We provide in this paper an Error Detection and Impact-sensitive instance Ranking (EDIR) mechanism to address this problem. Given a noisy dataset D, we first train a benchmark classifier T from D. The instances, that cannot be effectively classified by T are treated as suspicious and forwarded to a subset S. For each attribute A i, we switch A i and the class label C to train a classifier AP i for A i. Given an instance I k in S, we use AP i and the benchmark classifier T to locate the erroneous value of each attribute A i. To quantitatively rank instances in S, we define an impact measure based on the Information-gain Ratio (IR). We calculate IR i between attribute A i and C, and use IR i as the impact-sensitive weight of A i. The sum of impact-sensitive weights from all located erroneous attributes of I k indicates its total impact value. The experimental results demonstrate the effectiveness of our strategies. 1
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.