22 research outputs found
Robust Estimators are Hard to Compute
In modern statistics, the robust estimation of parameters of a regression hyperplane is a central problem. Robustness means that the estimation is not or only slightly affected by outliers in the data. In this paper, it is shown that the following robust estimators are hard to compute: LMS, LQS, LTS, LTA, MCD, MVE, Constrained M estimator, Projection Depth (PD) and Stahel-Donoho. In addition, a data set is presented such that the ltsReg-procedure of R has probability less than 0.0001 of finding a correct answer. Furthermore, it is described, how to design new robust estimators. --Computational statistics,complexity theory,robust statistics,algorithms,search heuristics
Robust Estimators are Hard to Compute
In modern statistics, the robust estimation of parameters of a re-
gression hyperplane is a central problem. Robustness means that the
estimation is not or only slightly a®ected by outliers in the data. In
this paper, it is shown that the following robust estimators are hard
to compute: LMS, LQS, LTS, LTA, MCD, MVE, Constrained M es-
timator, Projection Depth (PD) and Stahel-Donoho. In addition, a
data set is presented such that the ltsReg-procedure of R has proba-
bility less than 0.0001 of ¯nding a correct answer. Furthermore, it is
described, how to design new robust estimators
Repeated median and hybrid filters
Standard median filters preserve abrupt shifts (edges) and remove impulsive noise (outliers) from a constant signal but they deteriorate in trend periods. FIR median hybrid (FMH) filters are more flexible and also preserve shifts, but they are much more vulnerable to outliers. Application of robust regression methods, in particular of the repeated median, has been suggested for removing subsequent outliers from a signal with trends. A fast algorithm for updating the repeated median in linear time using quadratic space is given in Bernholt and Fried (2003). We construct repeated median hybrid filters to combine the robustness properties of the repeated median with the edge preservation ability of FMH filters. An algorithm for updating the repeated median is presented which needs only linear space. We also investigate analytical properties of these filters and compare their performance via simulations. --Signal extraction,Drifts,Jumps,Outliers,Update algorithm
Modified repeated median filters
We discuss moving window techniques for fast extraction of a signal comprising monotonic trends and abrupt shifts from a noisy time series with irrelevant spikes. Running medians remove spikes and preserve shifts, but they deteriorate in trend periods. Modified trimmed mean filters use a robust scale estimate such as the median absolute deviation about the median (MAD) to select an adaptive amount of trimming. Application of robust regression, particularly of the repeated median, has been suggested for improving upon the median in trend periods. We combine these ideas and construct modified filters based on the repeated median offering better shift preservation. All these filters are compared w.r.t. fundamental analytical properties and in basic data situations. An algorithm for the update of the MAD running in time O(log n) for window width n is presented as well. --signal extraction,robust filtering,drifts,jumps,outliers,computational geometry,update algorithm
Computing the Least Quartile Difference Estimator in the Plane
A common problem in linear regression is that largely aberrant values can strongly influence the results. The least quartile difference (LQD) regression estimator is highly robust, since it can resist up to almost 50% largely deviant data values without becoming extremely biased. Additionally, it shows good behavior on Gaussian data – in contrast to many other robust regression methods. However, the LQD is not widely used yet due to the high computational effort needed when using common algorithms, e.g. the subset algorithm of Rousseeuw and Leroy. For computing the LQD estimator for n data points in the plane, we propose a randomized algorithm with expected running time O(n2 log2 n) and an approximation algorithm with a running time of roughly O(n2 log n). It can be expected that the practical relevance of the LQD estimator will strongly increase thereby. --
Effiziente Algorithmen und Komplexität in der robusten Statistik
Ein Ausgangspunkt der robusten Statistik ist, dass der
Least-Squares-Schätzer zwar einfach zu berechnen ist, aber Probleme
mit Ausreißern hat. Es genügt, dass ein Punkt des Datensatzes von
den anderen weit entfernt ist, um das Ergebnis stark zu verfälschen.
Die robuste Statistik hat das Ziel, Schätzer zu finden, die wenig
sensitiv gegenüber Ausreißern sind. Allerdings haben diese robusten
Schätzer häufig den Nachteil, dass ad hoc kein schneller Algorithmus
für ihre Berechnung zur Verfügung steht. In dieser Arbeit werden neue
Algorithmen für einige robuste Schätzer vorgestellt:
Für Punktmengen in der Ebene der Least-Quartile-Difference-Schätzer
mit einer Rechenzeit von grob O(n^2 log n) und das Multiresolutions-
Kriterium mit einer Rechenzeit von O(n log n).
Im Kontext von Zeitreihen-Daten werden Update-Algorithmen für den
Repeated-Median-Schätzer mit Update-Zeit O(n) und den Median-Absolute-
Deviation-Schätzer mit Update-Zeit O(log n) vorgestellt.
Für d-dimensionale Punktmengen werden Exponentialzeit-Algorithmen
für den Least-Median-of-Squares-Schätzer und den Minimum-Covariance-
Determinant-Schätzer vorgestellt. Abschließend wird die NP-Härte
vieler robuster Schätzer bewiesen sowie eine praxisrelevante
und schwierige Eingabe angegeben, so dass die Suchheuristik Fast-LTS
nur mit einer exponentiell kleinen Wahrscheinlichkeit das Optimum
findet. Damit wird aufgezeigt, dass die theoretischen Eigenschaften
robuster Schätzer in die Praxis nur schwer zu verwirklichen sind
Constrained Minkowski Sums: A Geometric Framework for Solving Interval Problems inComputational Biology Efficiently
In this paper, we introduce the notion of a constrained Minkowski sum: for two (finite) point-sets P,Q⊆ℝ2 and a set of k inequalities Ax≥b, it is defined as the point-set (P ⊕ Q) Ax≥b ={x=p+q∣p∈P,q∈Q,Ax≥b}. We show that typical interval problems from computational biology can be solved by computing a set containing the vertices of the convex hull of an appropriately constrained Minkowski sum. We provide an algorithm for computing such a set with running time O(Nlog N), where N=|P|+|Q| if k is fixed. For the special case where P and Q consist of points with integer x 1-coordinates whose absolute values are bounded by O(N), we even achieve a linear running time O(N). We thereby obtain a linear running time for many interval problems from the literature and improve upon the best known running times for some of them. The main advantage of the presented approach is that it provides a general framework within which a broad variety of interval problems can be modeled and solve
Detecting high-order interactions of single nucleotide polymorphisms using genetic programming
Motivation: Not individual single nucleotide polymorphisms (SNPs), but high-order interactions of SNPs are assumed to be responsible for complex diseases such as cancer. Therefore, one of the major goals of genetic association studies concerned with such genotype data is the identification of these high-order interactions. This search is additionally impeded by the fact that these interactions often are only explanatory for a relatively small subgroup of patients. Most of the feature selection methods proposed in the literature, unfortunately, fail at this task, since they can either only identify individual variables or interactions of a low order, or try to find rules that are explanatory for a high percentage of the observations. In this paper, we present a procedure based on genetic programming and multi-valued logic that enables the identification of high-order interactions of categorical variables such as SNPs. This method called GPAS (Genetic Programming for Association Studies) cannot only be used for feature selection, but can also be employed for discrimination. Results: In an application to the genotype data from the GENICA study, an association study concerned with sporadic breast cancer, GPAS is able to identify high-order interactions of SNPs leading to a considerably increased breast cancer risk for different subsets of patients that are not found by other feature selection methods. As an application to a subset of the HapMap data shows, GPAS is not restricted to association studies comprising several ten SNPs, but can also be employed to analyze whole-genome data. --
Modified Repeated Median Filters
We discuss moving window techniques for fast extraction of a signal comprising monotonic trends and abrupt shifts from a noisy time series with irrelevant spikes. Running medians remove spikes and preserve shifts, but they deteriorate in trend periods. Modified trimmed mean filters use a robust scale estimate such as the median absolute deviation about the median (MAD) to select an adaptive amount of trimming. Application of robust regression, particularly of the repeated median, has been suggested for improving upon the median in trend periods. We combine these ideas and construct modified filters based on the repeated median offering better shift preservation. All these filters are compared w.r.t. fundamental analytical properties and in basic data situations. An algorithm for the update of the MAD running in time O(log n) for window width n is presented as well