4 research outputs found

    A comparative evaluation of nonlinear dynamics methods for time series prediction

    Get PDF
    A key problem in time series prediction using autoregressive models is to fix the model order, namely the number of past samples required to model the time series adequately. The estimation of the model order using cross-validation may be a long process. In this paper, we investigate alternative methods to cross-validation, based on nonlinear dynamics methods, namely Grassberger-Procaccia, K,gl, Levina-Bickel and False Nearest Neighbors algorithms. The experiments have been performed in two different ways. In the first case, the model order has been used to carry out the prediction, performed by a SVM for regression on three real data time series showing that nonlinear dynamics methods have performances very close to the cross-validation ones. In the second case, we have tested the accuracy of nonlinear dynamics methods in predicting the known model order of synthetic time series. In this case, most of the methods have yielded a correct estimate and when the estimate was not correct, the value was very close to the real one

    NOVEL TECHNIQUES FOR INTRINSIC DIMENSION ESTIMATION

    Get PDF
    Since the 1950s, the rapid pace of technological advances allows to measure and record increasing amounts of data, motivating the urgent need to develop dimensionality reduction systems to be applied on datasets comprising high- dimensional points. To this aim, a fundamental information is provided by the intrinsic di- mension (id) defined by Bennet [1] as the minimum number of parameters needed to generate a data description by maintaining the \u201cintrinsic\u201d structure characterizing the dataset, so that the information loss is minimized. More recently, a quite intuitive definition employed by several authors in the past has been reported by Bishop in [2] where the author writes that \u201ca set in D dimensions is said to have an id equal to d if the data lies entirely within a d-dimensional subspace of D \u201d. Though more specific and different id definitions have been proposed in dif- ferent research fieldsthroughout the pattern recognition literature the presently prevailing id definition views a point set as a sample set uniformly drawn from an unknown smooth (or locally smooth) manifold structure, eventually embed- ded in an higher dimensional space through a non-linear smooth mapping; in this case, the id to be estimated is the manifold\u2019s topological dimension. Due to the importance of id in several theoretical and practical application fields, in the last two decades a great deal of research effort has been devoted to the development of effective id estimators. Though several techniques have been proposed in literature, the problem is still open for the following main reasons. 1At first, it must be highlighted that though Lebesgue\u2019s definition of topo- logical dimension (reported by [5]) is quite clear, in practice its estimation is difficult if only a finite set of points is available. Therefore, id estimation tech- niques proposed in literature are either founded on different notions of dimen- sion (e.g. fractal dimensions) approximating the topological one, or on various techniques aimed at preserving the characteristics of data-neighborhood distri- butions, which reflect the topology of the underlying manifold. Besides, the estimated id value markedly changes as the scale used to analyze the input dataset changes, and being the number of available points practically limited, several methods underestimate id when its value is sufficiently high (namely id 10). Other serious problems arise when the dataset is embedded in higher dimensional spaces through a non-linear map. Finally, the too high computa- tional complexity of most estimators makes them unpractical when the need is to process datasets comprising huge amounts of high-dimensional data. The main subject of this thesis work is the development of efficient and ef- fective id estimators. Precisely, two novel estimators, named MiND (Minimum Neighbor Distance estimators of intrinsic dimension, [6]) and DANCo (Dimension- ality from Angle and Norm Concentration, [4]) are described. The aforemen- tioned techniques are based on the exploitation of statistics characterizing the hidden structure of high dimensional spaces, such as the distribution of norms and angles, which are informative of the id and could therefore be exploited for its estimation. A simple practical example to show the informatory power of these features, is the clustering system proposed in [3]; based on the assumption that each class is represented by one manifold, the clustering procedure codes the input data by means of local id estimates and features related to them. This coding allows to obtain reliable results by applying classic and basic clustering algorithms. To evaluate the proposed estimators by objectively comparing them with relevant state-of-the-art techniques, a benchmark framework is proposed. The need of this framework is highlighted by the fact that in literature each method has been assessed on different datasets and by employing different evaluation measures; therefore it is difficult to provide an objective comparison by solely analyzing the results reported by the authors. Based on this observation, the proposed benchmark employs publicly available, synthetic and real, datasets that have been used by several authors in the literature for their interesting, and challenging, peculiarities. Moreover, some synthetic datasets have been added, to more deeply test the estimators\u2019 performance on high dimensional datasets being characterized by similarly high id. The application of this benchmark has shown to provide an objective comparative assessment in terms of robustness w.r.t. parameter settings, high dimensional datasets, datasets being character- ized by an high intrinsic dimension, and noisy datasets. The achieved results show that DANCo provides the most reliable estimates on both synthetic and real datasets. The thesis is organized as follows: in Chapter 1 a brief theoretical description of the various definitions of dimension is presented, along with the problems re- lated to id estimation and interesting application domains profitably exploiting the knowledge of id; in Chapter 2 notable state-of-the-art intrinsic id are sur- veyed, and grouped according to the employed methods; in Chapter 3 MinD, and DANCo are described; in Chapter 4, after summarizing mostly used experimental settings, we propose a benchmark framework and we employ it to objectively assess and compare relevant intrinsic dimensionality estimators; in Chapter 5 conclusions and open research problems are shortly reported. References [1] R. S. Bennett. The Intrinsic Dimensionality of Signal Collections. IEEE Trans. on Information Theory, IT-15(5):517\u2013525, 1969. [2] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, Oxford, 1995. [3] P. Campadelli, E. Casiraghi, C. Ceruti, G. Lombardi, and A. Rozza. Local intrinsic dimensionality based features for clustering. In Alfredo Petrosino, editor, ICIAP (1), volume 8156 of Lecture Notes in Computer Science, pages 41\u201350. Springer, 2013. [4] C. Ceruti, S. Bassis, A Rozza, G. Lombardi, E. Casiraghi, and P. Campadelli. DANCo: an intrinsic Dimensionalty estimator exploiting Angle and Norm Concentration. Pattern recognition, 2014. [5] M. Katetov and P. Simon. Origins of dimension theory. Handbook of the History of General Topology, 1997. [6] A. Rozza, G. Lombardi, C. Ceruti, E. Casiraghi, and P. Campadelli. Novel high intrinsic dimensionality estimators. Machine Learning Journal, 89(1- 2):37\u201365, May 2012
    corecore