91 research outputs found

    A Fast Algorithm for Robust Regression with Penalised Trimmed Squares

    Full text link
    The presence of groups containing high leverage outliers makes linear regression a difficult problem due to the masking effect. The available high breakdown estimators based on Least Trimmed Squares often do not succeed in detecting masked high leverage outliers in finite samples. An alternative to the LTS estimator, called Penalised Trimmed Squares (PTS) estimator, was introduced by the authors in \cite{ZiouAv:05,ZiAvPi:07} and it appears to be less sensitive to the masking problem. This estimator is defined by a Quadratic Mixed Integer Programming (QMIP) problem, where in the objective function a penalty cost for each observation is included which serves as an upper bound on the residual error for any feasible regression line. Since the PTS does not require presetting the number of outliers to delete from the data set, it has better efficiency with respect to other estimators. However, due to the high computational complexity of the resulting QMIP problem, exact solutions for moderately large regression problems is infeasible. In this paper we further establish the theoretical properties of the PTS estimator, such as high breakdown and efficiency, and propose an approximate algorithm called Fast-PTS to compute the PTS estimator for large data sets efficiently. Extensive computational experiments on sets of benchmark instances with varying degrees of outlier contamination, indicate that the proposed algorithm performs well in identifying groups of high leverage outliers in reasonable computational time.Comment: 27 page

    Robust Fuzzy Clustering via Trimming and Constraints

    Get PDF
    Producción CientíficaA methodology for robust fuzzy clustering is proposed. This methodology can be widely applied in very different statistical problems given that it is based on probability likelihoods. Robustness is achieved by trimming a fixed proportion of “most outlying” observations which are indeed self-determined by the data set at hand. Constraints on the clusters’ scatters are also needed to get mathematically well-defined problems and to avoid the detection of non-interesting spurious clusters. The main lines for computationally feasible algorithms are provided and some simple guidelines about how to choose tuning parameters are briefly outlined. The proposed methodology is illustrated through two applications. The first one is aimed at heterogeneously clustering under multivariate normal assumptions and the second one migh be useful in fuzzy clusterwise linear regression problems.Ministerio de Economía, Industria y Competitividad (MTM2014-56235-C2-1-P)Junta de Castilla y León (programa de apoyo a proyectos de investigación – Ref. VA212U13

    A global classification of coastal flood hazard climates associated with large-scale oceanographic forcing

    Get PDF
    Coastal communities throughout the world are exposed to numerous and increasing threats, such as coastal flooding and erosion, saltwater intrusion and wetland degradation. Here, we present the first global-scale analysis of the main drivers of coastal flooding due to large-scale oceanographic factors. Given the large dimensionality of the problem (e.g. spatiotemporal variability in flood magnitude and the relative influence of waves, tides and surge levels), we have performed a computer-based classification to identify geographical areas with homogeneous climates. Results show that 75% of coastal regions around the globe have the potential for very large flooding events with low probabilities (unbounded tails), 82% are tide-dominated, and almost 49% are highly susceptible to increases in flooding frequency due to sea-level rise.A.R., F.J.M. and P.C. acknowledge the support of the Spanish ‘Ministerio de Economia y Competitividad’ under Grants BIA2014-59643-R and BIA2015-70644-R. This work was critically supported by the US Geological Survey under Grant/Cooperative Agreement G15AC00426 and from the US DOD Strategic Environmental Research and Development Program (SERDP Project RC-2644) through the NOAA National Centers for Environmental Information (NCEI). Dynamic atmospheric corrections (storm surge) are produced by CLS Space Oceanography Division using the Mog2D model from Legos and distributed by Aviso, with support from CNES (http://www.aviso.altimetry.fr/). Marine data from global reanalysis are provided by IHCantabria and are available for research purposes upon request at [email protected]

    Misty Mountain clustering: application to fast unsupervised flow cytometry gating

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>There are many important clustering questions in computational biology for which no satisfactory method exists. Automated clustering algorithms, when applied to large, multidimensional datasets, such as flow cytometry data, prove unsatisfactory in terms of speed, problems with local minima or cluster shape bias. Model-based approaches are restricted by the assumptions of the fitting functions. Furthermore, model based clustering requires serial clustering for all cluster numbers within a user defined interval. The final cluster number is then selected by various criteria. These supervised serial clustering methods are time consuming and frequently different criteria result in different optimal cluster numbers. Various unsupervised heuristic approaches that have been developed such as affinity propagation are too expensive to be applied to datasets on the order of 10<sup>6 </sup>points that are often generated by high throughput experiments.</p> <p>Results</p> <p>To circumvent these limitations, we developed a new, unsupervised density contour clustering algorithm, called Misty Mountain, that is based on percolation theory and that efficiently analyzes large data sets. The approach can be envisioned as a progressive top-down removal of clouds covering a data histogram relief map to identify clusters by the appearance of statistically distinct peaks and ridges. This is a parallel clustering method that finds every cluster after analyzing only once the cross sections of the histogram. The overall run time for the composite steps of the algorithm increases linearly by the number of data points. The clustering of 10<sup>6 </sup>data points in 2D data space takes place within about 15 seconds on a standard laptop PC. Comparison of the performance of this algorithm with other state of the art automated flow cytometry gating methods indicate that Misty Mountain provides substantial improvements in both run time and in the accuracy of cluster assignment.</p> <p>Conclusions</p> <p>Misty Mountain is fast, unbiased for cluster shape, identifies stable clusters and is robust to noise. It provides a useful, general solution for multidimensional clustering problems. We demonstrate its suitability for automated gating of flow cytometry data.</p

    Modularization of biochemical networks based on classification of Petri net t-invariants

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Structural analysis of biochemical networks is a growing field in bioinformatics and systems biology. The availability of an increasing amount of biological data from molecular biological networks promises a deeper understanding but confronts researchers with the problem of combinatorial explosion. The amount of qualitative network data is growing much faster than the amount of quantitative data, such as enzyme kinetics. In many cases it is even impossible to measure quantitative data because of limitations of experimental methods, or for ethical reasons. Thus, a huge amount of qualitative data, such as interaction data, is available, but it was not sufficiently used for modeling purposes, until now. New approaches have been developed, but the complexity of data often limits the application of many of the methods. Biochemical Petri nets make it possible to explore static and dynamic qualitative system properties. One Petri net approach is model validation based on the computation of the system's invariant properties, focusing on t-invariants. T-invariants correspond to subnetworks, which describe the basic system behavior.</p> <p>With increasing system complexity, the basic behavior can only be expressed by a huge number of t-invariants. According to our validation criteria for biochemical Petri nets, the necessary verification of the biological meaning, by interpreting each subnetwork (t-invariant) manually, is not possible anymore. Thus, an automated, biologically meaningful classification would be helpful in analyzing t-invariants, and supporting the understanding of the basic behavior of the considered biological system.</p> <p>Methods</p> <p>Here, we introduce a new approach to automatically classify t-invariants to cope with network complexity. We apply clustering techniques such as UPGMA, Complete Linkage, Single Linkage, and Neighbor Joining in combination with different distance measures to get biologically meaningful clusters (t-clusters), which can be interpreted as modules. To find the optimal number of t-clusters to consider for interpretation, the cluster validity measure, Silhouette Width, is applied.</p> <p>Results</p> <p>We considered two different case studies as examples: a small signal transduction pathway (pheromone response pathway in <it>Saccharomyces cerevisiae</it>) and a medium-sized gene regulatory network (gene regulation of Duchenne muscular dystrophy). We automatically classified the t-invariants into functionally distinct t-clusters, which could be interpreted biologically as functional modules in the network. We found differences in the suitability of the various distance measures as well as the clustering methods. In terms of a biologically meaningful classification of t-invariants, the best results are obtained using the Tanimoto distance measure. Considering clustering methods, the obtained results suggest that UPGMA and Complete Linkage are suitable for clustering t-invariants with respect to the biological interpretability.</p> <p>Conclusion</p> <p>We propose a new approach for the biological classification of Petri net t-invariants based on cluster analysis. Due to the biologically meaningful data reduction and structuring of network processes, large sets of t-invariants can be evaluated, allowing for model validation of qualitative biochemical Petri nets. This approach can also be applied to elementary mode analysis.</p

    A Mathematical Methodology for Determining the Temporal Order of Pathway Alterations Arising during Gliomagenesis

    Get PDF
    Human cancer is caused by the accumulation of genetic alterations in cells. Of special importance are changes that occur early during malignant transformation because they may result in oncogene addiction and thus represent promising targets for therapeutic intervention. We have previously described a computational approach, called Retracing the Evolutionary Steps in Cancer (RESIC), to determine the temporal sequence of genetic alterations during tumorigenesis from cross-sectional genomic data of tumors at their fully transformed stage. Since alterations within a set of genes belonging to a particular signaling pathway may have similar or equivalent effects, we applied a pathway-based systems biology approach to the RESIC methodology. This method was used to determine whether alterations of specific pathways develop early or late during malignant transformation. When applied to primary glioblastoma (GBM) copy number data from The Cancer Genome Atlas (TCGA) project, RESIC identified a temporal order of pathway alterations consistent with the order of events in secondary GBMs. We then further subdivided the samples into the four main GBM subtypes and determined the relative contributions of each subtype to the overall results: we found that the overall ordering applied for the proneural subtype but differed for mesenchymal samples. The temporal sequence of events could not be identified for neural and classical subtypes, possibly due to a limited number of samples. Moreover, for samples of the proneural subtype, we detected two distinct temporal sequences of events: (i) RAS pathway activation was followed by TP53 inactivation and finally PI3K2 activation, and (ii) RAS activation preceded only AKT activation. This extension of the RESIC methodology provides an evolutionary mathematical approach to identify the temporal sequence of pathway changes driving tumorigenesis and may be useful in guiding the understanding of signaling rearrangements in cancer development

    Integer and Fractional Order Entropy Analysis of Earthquake Data-series

    Get PDF
    This paper studies the statistical distributions of worldwide earthquakes from year 1963 up to year 2012. A Cartesian grid, dividing Earth into geographic regions, is considered. Entropy and the Jensen–Shannon divergence are used to analyze and compare real-world data. Hierarchical clustering and multi-dimensional scaling techniques are adopted for data visualization. Entropy-based indices have the advantage of leading to a single parameter expressing the relationships between the seismic data. Classical and generalized (fractional) entropy and Jensen–Shannon divergence are tested. The generalized measures lead to a clear identification of patterns embedded in the data and contribute to better understand earthquake distributions
    corecore