Search CORE

91 research outputs found

A Fast Algorithm for Robust Regression with Penalised Trimmed Squares

Author: A Giloni
AC Atkinson
AC Atkinson
AS Hadi
C Agostinelli
CW Coakley
D Gervini
D Peña
D Peña
DM Hawkins
DM Hawkins
DM Hawkins
DM Sebert
G Zioutas
G Zioutas
G. Zioutas
J Agulló
JF Gentleman
L. Pitsoulis
LM Li
LS Pitsoulis
M Salibian-Barrera
MS Bazaraa
N Billor
N Billor
N Billor
O Hössjer
PJ Rousseeuw
PJ Rousseeuw
PJ Rousseeuw
PJ Rousseeuw
RJ Rousseeuw
TA Feo
VJ Yohai
VJ Yohai
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

The presence of groups containing high leverage outliers makes linear regression a difficult problem due to the masking effect. The available high breakdown estimators based on Least Trimmed Squares often do not succeed in detecting masked high leverage outliers in finite samples. An alternative to the LTS estimator, called Penalised Trimmed Squares (PTS) estimator, was introduced by the authors in \cite{ZiouAv:05,ZiAvPi:07} and it appears to be less sensitive to the masking problem. This estimator is defined by a Quadratic Mixed Integer Programming (QMIP) problem, where in the objective function a penalty cost for each observation is included which serves as an upper bound on the residual error for any feasible regression line. Since the PTS does not require presetting the number of outliers to delete from the data set, it has better efficiency with respect to other estimators. However, due to the high computational complexity of the resulting QMIP problem, exact solutions for moderately large regression problems is infeasible. In this paper we further establish the theoretical properties of the PTS estimator, such as high breakdown and efficiency, and propose an approximate algorithm called Fast-PTS to compute the PTS estimator for large data sets efficiently. Extensive computational experiments on sets of benchmark instances with varying degrees of outlier contamination, indicate that the proposed algorithm performs well in identifying groups of high leverage outliers in reasonable computational time.Comment: 27 page

arXiv.org e-Print Archive

CiteSeerX

Crossref

Robust Fuzzy Clustering via Trimming and Constraints

Author: A Farcomeni
AK Lenstra
DW Hosmer
E Ruspini
E Trauwaert
H Fritz
H Späth
J Kim
J Łeski
JC Bezdek
KL Wu
LA García-Escudero
LA García-Escudero
LA García-Escudero
PJ Rousseeuw
PJ Rousseeuw
PJ Rousseeuw
R Krishnapuram
RJ Hathaway
RN Davé
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Producción CientíficaA methodology for robust fuzzy clustering is proposed. This methodology can be widely applied in very different statistical problems given that it is based on probability likelihoods. Robustness is achieved by trimming a fixed proportion of “most outlying” observations which are indeed self-determined by the data set at hand. Constraints on the clusters’ scatters are also needed to get mathematically well-defined problems and to avoid the detection of non-interesting spurious clusters. The main lines for computationally feasible algorithms are provided and some simple guidelines about how to choose tuning parameters are briefly outlined. The proposed methodology is illustrated through two applications. The first one is aimed at heterogeneously clustering under multivariate normal assumptions and the second one migh be useful in fuzzy clusterwise linear regression problems.Ministerio de Economía, Industria y Competitividad (MTM2014-56235-C2-1-P)Junta de Castilla y León (programa de apoyo a proyectos de investigación – Ref. VA212U13

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio Documental de la Universidad de Valladolid

Archivio della Ricerca - Università di Roma 3

A global classification of coastal flood hazard climates associated with large-scale oceanographic forcing

Author: A Melet
A Rueda
B Gouldby
BG Reguero
BG Reguero
C Guedes Soares
C Izaguirre
E Ramos
G Wöppelmann
GD Egbert
HF Stockdon
ID Haigh
IJ Losada
IRR Young
JR Hunter
JW Hurrell
KA Serafin
L Li
M Menéndez
M Newman
MA Merrifield
P Camus
PJ Rousseeuw
PL Barnard
R Bürgmann
R Mawdsley
RJ Nicholls
S Brown
S Ghosh
S Hallegatte
T Sunamura
T Wahl
TSO Kohonen
W Köppen
WB White
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Coastal communities throughout the world are exposed to numerous and increasing threats, such as coastal flooding and erosion, saltwater intrusion and wetland degradation. Here, we present the first global-scale analysis of the main drivers of coastal flooding due to large-scale oceanographic factors. Given the large dimensionality of the problem (e.g. spatiotemporal variability in flood magnitude and the relative influence of waves, tides and surge levels), we have performed a computer-based classification to identify geographical areas with homogeneous climates. Results show that 75% of coastal regions around the globe have the potential for very large flooding events with low probabilities (unbounded tails), 82% are tide-dominated, and almost 49% are highly susceptible to increases in flooding frequency due to sea-level rise.A.R., F.J.M. and P.C. acknowledge the support of the Spanish ‘Ministerio de Economia y Competitividad’ under Grants BIA2014-59643-R and BIA2015-70644-R. This work was critically supported by the US Geological Survey under Grant/Cooperative Agreement G15AC00426 and from the US DOD Strategic Environmental Research and Development Program (SERDP Project RC-2644) through the NOAA National Centers for Environmental Information (NCEI). Dynamic atmospheric corrections (storm surge) are produced by CLS Space Oceanography Division using the Mog2D model from Legos and distributed by Aviso, with support from CNES (http://www.aviso.altimetry.fr/). Marine data from global reanalysis are provided by IHCantabria and are available for research purposes upon request at [email protected]

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UCrea

Combining Pareto-optimal clusters using supervised learning for identifying co-expressed genes

Author: A Horzyk
AA Alizadeh
AK Jain
Anirban Mukhopadhyay
AV Lukashin
C Xiang
CA Coello Coello
CW Hsu
D Dembele
DE Goldberg
DJ Lockhart
E Zitzler
I Davidson
J Handl
J Herrero
JC Bezdek
JT Tou
K Crammer
K Deb
M Hollander
MB Eisen
P Reymonda
P Rousseeuw
P Tamayo
R Sharan
RJ Cho
S Bandyopadhyay
S Bandyopadhyay
S Bandyopadhyay
S Bandyopadhyay
S Bandyopadhyay
S Chu
S Tavazoie
Sanghamitra Bandyopadhyay
SY Kim
SZ Selim
U Maulik
U Maulik
Ujjwal Maulik
V Vapnik
VR Iyer
X Wen
XL Xie
Y Xu
ZS Qin
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

Misty Mountain clustering: application to fast unsupervised flow cytometry gating

Author: A Cuevas
A Cuevas
AP Dempster
B Scholkopf
BJ Frey
C Fraley
CJC Burges
CW Morris
D Stauffer
G Celeux
G Cornuejols
G Lizard
G Schwarz
GC Tseng
GEP Box
GJ McLachlan
H Hotelling
István P Sugár
J Hoshen
JA Hartigan
JB MacQueen
K Lo
K Lo
KH Knuth
L Boddy
L Boddy
L Breiman
LJ Heyer
M Fiedler
MB Eisen
MF Wilkins
MP Wand
PJ Rousseeuw
PO Krutzik
R Kothari
RF Murphy
RJ Beckman
RL Boyell
RR Brinkman
S Demers
S Kirkpatrick
S Pyne
Stuart C Sealfon
TC Bakker Schut
W Feller
W Jang
W Jang
WE Donath
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background There are many important clustering questions in computational biology for which no satisfactory method exists. Automated clustering algorithms, when applied to large, multidimensional datasets, such as flow cytometry data, prove unsatisfactory in terms of speed, problems with local minima or cluster shape bias. Model-based approaches are restricted by the assumptions of the fitting functions. Furthermore, model based clustering requires serial clustering for all cluster numbers within a user defined interval. The final cluster number is then selected by various criteria. These supervised serial clustering methods are time consuming and frequently different criteria result in different optimal cluster numbers. Various unsupervised heuristic approaches that have been developed such as affinity propagation are too expensive to be applied to datasets on the order of 106 points that are often generated by high throughput experiments. Results To circumvent these limitations, we developed a new, unsupervised density contour clustering algorithm, called Misty Mountain, that is based on percolation theory and that efficiently analyzes large data sets. The approach can be envisioned as a progressive top-down removal of clouds covering a data histogram relief map to identify clusters by the appearance of statistically distinct peaks and ridges. This is a parallel clustering method that finds every cluster after analyzing only once the cross sections of the histogram. The overall run time for the composite steps of the algorithm increases linearly by the number of data points. The clustering of 106 data points in 2D data space takes place within about 15 seconds on a standard laptop PC. Comparison of the performance of this algorithm with other state of the art automated flow cytometry gating methods indicate that Misty Mountain provides substantial improvements in both run time and in the accuracy of cluster assignment. Conclusions Misty Mountain is fast, unbiased for cluster shape, identifies stable clusters and is robust to noise. It provides a useful, general solution for multidimensional clustering problems. We demonstrate its suitability for automated gating of flow cytometry data.</p

Elsevier - Publisher Connector

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Modularization of biochemical networks based on classification of Petri net t-invariants

Author: A Sackmann
A Schrijver
AL Gartel
Andrea Sackmann
Astrid Speer
B Baumgarten
Björn H Junker
C Chaouiya
CH Schilling
CH Schilling
D Steinhausen
DL Davies
E Simão
Eva Grafahrend-Belau
Falk Schreiber
G Nagy
H Ma
H Matsuno
H Matsuno
I Koch
Ina Koch
J Dunn
J Handl
JA Studier
JL Peterson
JS Edwards
K Backhaus
K Lautenbach
Katja Winder
L Bardwell
L Hubert
LJ Steggles
M Chen
M Ederer
M Heiner
M Heiner
Monika Heiner
N Saitou
ND Price
P Legendre
PH Starke
PJ Rousseeuw
R David
R Durbin
R Srivastava
RJ Parikh
S Gunter
S Hardy
S Klamt
S Pérès
S Schuster
S Schuster
Stefanie Grunwald
T Dwyer
T Murata
W Marwan
Y Wang
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Structural analysis of biochemical networks is a growing field in bioinformatics and systems biology. The availability of an increasing amount of biological data from molecular biological networks promises a deeper understanding but confronts researchers with the problem of combinatorial explosion. The amount of qualitative network data is growing much faster than the amount of quantitative data, such as enzyme kinetics. In many cases it is even impossible to measure quantitative data because of limitations of experimental methods, or for ethical reasons. Thus, a huge amount of qualitative data, such as interaction data, is available, but it was not sufficiently used for modeling purposes, until now. New approaches have been developed, but the complexity of data often limits the application of many of the methods. Biochemical Petri nets make it possible to explore static and dynamic qualitative system properties. One Petri net approach is model validation based on the computation of the system's invariant properties, focusing on t-invariants. T-invariants correspond to subnetworks, which describe the basic system behavior. With increasing system complexity, the basic behavior can only be expressed by a huge number of t-invariants. According to our validation criteria for biochemical Petri nets, the necessary verification of the biological meaning, by interpreting each subnetwork (t-invariant) manually, is not possible anymore. Thus, an automated, biologically meaningful classification would be helpful in analyzing t-invariants, and supporting the understanding of the basic behavior of the considered biological system. Methods Here, we introduce a new approach to automatically classify t-invariants to cope with network complexity. We apply clustering techniques such as UPGMA, Complete Linkage, Single Linkage, and Neighbor Joining in combination with different distance measures to get biologically meaningful clusters (t-clusters), which can be interpreted as modules. To find the optimal number of t-clusters to consider for interpretation, the cluster validity measure, Silhouette Width, is applied. Results We considered two different case studies as examples: a small signal transduction pathway (pheromone response pathway in <it>Saccharomyces cerevisiae</it>) and a medium-sized gene regulatory network (gene regulation of Duchenne muscular dystrophy). We automatically classified the t-invariants into functionally distinct t-clusters, which could be interpreted biologically as functional modules in the network. We found differences in the suitability of the various distance measures as well as the clustering methods. In terms of a biologically meaningful classification of t-invariants, the best results are obtained using the Tanimoto distance measure. Considering clustering methods, the obtained results suggest that UPGMA and Complete Linkage are suitable for clustering t-invariants with respect to the biological interpretability. Conclusion We propose a new approach for the biological classification of Petri net t-invariants based on cluster analysis. Due to the biologically meaningful data reduction and structuring of network processes, large sets of t-invariants can be evaluated, allowing for model validation of qualitative biochemical Petri nets. This approach can also be applied to elementary mode analysis.</p

KOPS - The Institutional Repository of the University of Konstanz

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

MPG.PuRe

Solution for GNSS height anomaly fitting of mining area based on robust TLS

Author: AR Amiri-Simkooei
B Schaffrin
B Schaffrin
B Schaffrin
B Schaffrin
B Schaffrin
B Schaffrin
F Neitzel
G Pan
G Xuming
G Xunqiang
GH Golub
J Lu
JH Peter
L Dongfang
P Rousseeuw
RJ Adcock
T Xiaohua
V Mahboub
V Mahboub
V Mahboub
WA Heiskanen
X Fang
X Peiliang
Y Ling
Y Shen
Y Tao
Y Yang
Z Jiangwen
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

Crossref

Repository of the Academy's Library

A Mathematical Methodology for Determining the Temporal Order of Pathway Alterations Arising during Gliomagenesis

Human cancer is caused by the accumulation of genetic alterations in cells. Of special importance are changes that occur early during malignant transformation because they may result in oncogene addiction and thus represent promising targets for therapeutic intervention. We have previously described a computational approach, called Retracing the Evolutionary Steps in Cancer (RESIC), to determine the temporal sequence of genetic alterations during tumorigenesis from cross-sectional genomic data of tumors at their fully transformed stage. Since alterations within a set of genes belonging to a particular signaling pathway may have similar or equivalent effects, we applied a pathway-based systems biology approach to the RESIC methodology. This method was used to determine whether alterations of specific pathways develop early or late during malignant transformation. When applied to primary glioblastoma (GBM) copy number data from The Cancer Genome Atlas (TCGA) project, RESIC identified a temporal order of pathway alterations consistent with the order of events in secondary GBMs. We then further subdivided the samples into the four main GBM subtypes and determined the relative contributions of each subtype to the overall results: we found that the overall ordering applied for the proneural subtype but differed for mesenchymal samples. The temporal sequence of events could not be identified for neural and classical subtypes, possibly due to a limited number of samples. Moreover, for samples of the proneural subtype, we detected two distinct temporal sequences of events: (i) RAS pathway activation was followed by TP53 inactivation and finally PI3K2 activation, and (ii) RAS activation preceded only AKT activation. This extension of the RESIC methodology provides an evolutionary mathematical approach to identify the temporal sequence of pathway changes driving tumorigenesis and may be useful in guiding the understanding of signaling rearrangements in cancer development

CiteSeerX

Public Library of Science (PLOS)

Crossref

Harvard University - DASH

Directory of Open Access Journals

PubMed Central

FigShare

Natural Disasters and Economic Growth: A Review

Author: AC Harberger
C Raddatz
CRED (Centre for Research on the Epidemiology of Disasters)
D Cass
DC Dacy
E Cavallo
EA Cavallo
F Caselli
F Gourio
FP Ramsey
G Horwich
GD Hansen
I Noy
IM Noy
J Temple
J Temple
JA Schumpeter
JC Cuaresma
JM Albala-Bertrand
JS Mill
M Brückner
M Pelling
M Skidmore
ME Kahn
NG Mankiw
O Galor
O Galor
P Aghion
P Keefer
PJ Rousseeuw
PM Romer
RE Lucas
RE Lucas Jr
RJ Barro
RJ Barro
RJ Caballero
RM Solow
RW Ellson
S Kuznets
SN Durlauf
T Swan
TC Koopmans
World Bank
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Integer and Fractional Order Entropy Analysis of Earthquake Data-series

Author: A Clauset
A Levada
A Santis De
A Sornette
AI Khinchin
AJ Seely
AK Jain
AM Lopes
AM Lopes
António M. Lopes
B Gutenberg
C Goltz
C Pinto
CC Aggarwal
CM Ionescu
D Baleanu
D Sornette
D Valério
DL Turcotte
E Scordilis
G Balasis
G Ekström
G Stadler
H Hussein
H Kanamori
J. A. Tenreiro Machado
JA Hartigan
JA Hartigan
JA Tenreiro Machado
JM Carlson
JT Machado
M Ashtari Jafari
M Stucchi
MR Martínez-Torres
MS Mega
N Sarlis
O Sotolongo-Costa
PJ Rousseeuw
R Das
R Hallgass
RJ Geller
RR Sokal
S Stein
S Wiemer
T Utsu
TF Cox
TM Cover
V Keilis-Borok
V Rubeis De
YY Kagan
YY Kagan
Z Peng
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

This paper studies the statistical distributions of worldwide earthquakes from year 1963 up to year 2012. A Cartesian grid, dividing Earth into geographic regions, is considered. Entropy and the Jensen–Shannon divergence are used to analyze and compare real-world data. Hierarchical clustering and multi-dimensional scaling techniques are adopted for data visualization. Entropy-based indices have the advantage of leading to a single parameter expressing the relationships between the seismic data. Classical and generalized (fractional) entropy and Jensen–Shannon divergence are tested. The generalized measures lead to a clear identification of patterns embedded in the data and contribute to better understand earthquake distributions

Repositório Científico do Instituto Politécnico do Porto

Crossref