Search CORE

572,993 research outputs found

Do unbalanced data have a negative effect on LDA?

Author: Anderson
Bradley
Chawla
D. Michael Titterington
Gilbert
Hanley
Jing-Hao Xue
Marks
McLachlan
Titterington
Weiss
Weiss
Xie
Publication venue: 'Elsevier BV'
Publication date: 01/01/2008
Field of study

For two-class discrimination, Xie and Qiu [The effect of imbalanced data sets on LDA: a theoretical and empirical analysis, Pattern Recognition 40 (2) (2007) 557–562] claimed that, when covariance matrices of the two classes were unequal, a (class) unbalanced data set had a negative effect on the performance of linear discriminant analysis (LDA). Through re-balancing 10 real-world data sets, Xie and Qiu [The effect of imbalanced data sets on LDA: a theoretical and empirical analysis, Pattern Recognition 40 (2) (2007) 557–562] provided empirical evidence to support the claim using AUC (Area Under the receiver operating characteristic Curve) as the performance metric. We suggest that such a claim is vague if not misleading, there is no solid theoretical analysis presented in Xie and Qiu [The effect of imbalanced data sets on LDA: a theoretical and empirical analysis, Pattern Recognition 40 (2) (2007) 557–562], and AUC can lead to a quite different conclusion from that led to by misclassification error rate (ER) on the discrimination performance of LDA for unbalanced data sets. Our empirical and simulation studies suggest that, for LDA, the increase of the median of AUC (and thus the improvement of performance of LDA) from re-balancing is relatively small, while, in contrast, the increase of the median of ER (and thus the decline in performance of LDA) from re-balancing is relatively large. Therefore, from our study, there is no reliable empirical evidence to support the claim that a (class) unbalanced data set has a negative effect on the performance of LDA. In addition, re-balancing affects the performance of LDA for data sets with either equal or unequal covariance matrices, indicating that having unequal covariance matrices is not a key reason for the difference in performance between original and re-balanced data

Crossref

Enlighten

Recommended from our members

Definition of a Balancing Point for Electricity Transmission Contracts

Author: Neuhoff Karsten
Olmos Luis
Publication venue: Faculty of Economics
Publication date: 16/06/2004
Field of study

Electricity transmission contracts allocate scarce resources, allow hedging against locational price differences and provide information to guide investment. Liquidity is increased if all transmission contracts are defined relative to one balancing point, then a set of two contracts can replicate any point to point contract. We propose an algorithm and apply it to the European electricity network to identify a well connected balancing point that exhibits minimal relative cross-price responses and hence reduces market power exercised by generation companies. Market level data which is difficult to obtain or model such as price levels in different regions or that is dependent on the time scale of interaction, as demand elasticity, is not required. The only critical input quantities are assumptions on future transmission constraint patterns

Apollo (Cambridge)

Dynamic load balancing in parallel KD-tree k-means

Author: Di Fatta Giuseppe
Pettinger David
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 30/06/2010
Field of study

One among the most influential and popular data mining methods is the k-Means algorithm for cluster analysis. Techniques for improving the efficiency of k-Means have been largely explored in two main directions. The amount of computation can be significantly reduced by adopting geometrical constraints and an efficient data structure, notably a multidimensional binary search tree (KD-Tree). These techniques allow to reduce the number of distance computations the algorithm performs at each iteration. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KD-Trees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient k-Means variants in parallel computing environments. In this work, we provide a parallel formulation of the KD-Tree based k-Means algorithm for distributed memory systems and address its load balancing issue. Three solutions have been developed and tested. Two approaches are based on a static partitioning of the data set and a third solution incorporates a dynamic load balancing policy

Central Archive at the University of Reading

Crossref

‘Definition of a Balancing Point for Electricity Transmission Contracts’

Author: Neuhoff K.
Olmos L.
Publication venue
Publication date
Field of study

Research Papers in Economics

SALBPGen - A systematic data generator for (simple) assembly line balancing

Author: Alena Otto
Armin Scholl
Christian Otto
Publication venue
Publication date
Field of study

Assembly line balancing is a well-known and extensively researched decision problem which arises when assembly line production systems are designed and operated. A large variety of real-world problem variations and elaborate solution methods were developed and presented in the academic literature in the past 60 years. Nevertheless, computational experiments examining and comparing the performance of solution procedures were mostly based on very limited data sets unsystematically collected from the literature and from some real-world cases. In particular, the precedence graphs used as the basis of former tests are limited in number and characteristics. As a consequence, former performance analyses suffer from a lack of systematics and statistical evidence. In this article, we propose SALPBGen, a new instance generator for the simple assembly line balancing problem (SALBP) which can be applied to any other assembly line balancing problem, too. It is able to systematically create instances with very diverse structures under full control of the experiment's designer. In particular, based on our analysis of real-world problems from automotive and related industries, typical substructures of the precedence graph like chains, bottlenecks and modules can be generated and combined as required based on a detailed analysis of graph structures and structure measures like the order strength. We also present a collection of new challenging benchmark data sets which are suited for comprehensive statistical tests in comparative studies of solution methods for SALBP and generalized problems as well. Researchers are invited to participate in a challenge to solve these new problem instances.manufacturing, benchmark data set, assembly line balancing, precedence graph, structure analysis, complexity measures

Research Papers in Economics