112 research outputs found

    The Impact of Overfitting and Overgeneralization on the Classification Accuracy in Data Mining

    Get PDF
    Current classification approaches usually do not try to achieve a balance between fitting and generalization when they infer models from training data. Such approaches ignore the possibility of different penalty costs for the false-positive, false-negative, and unclassifiable types. Thus, their performances may not be optimal or may even be coincidental. This dissertation analyzes the above issues in depth. It also proposes two new approaches called the Homogeneity-Based Algorithm (HBA) and the Convexity-Based Algorithm (CBA) to address these issues. These new approaches aim at optimally balancing the data fitting and generalization behaviors of models when some traditional classification approaches are used. The approaches first define the total misclassification cost (TC) as a weighted function of the three penalty costs and their corresponding error rates. The approaches then partition the training data into regions. In the HBA, the partitioning is done according to some homogeneous properties derivable from the training data. Meanwhile, the CBA employs some convex properties to derive regions. A traditional classification method is then used in conjunction with the HBA and CBA. Finally, the approaches apply a genetic approach to determine the optimal levels of fitting and generalization. The TC serves as the fitness function in this genetic approach. Real-life datasets from a wide spectrum of domains were used to better understand the effectiveness of the HBA and CBA. The computational results have indicated that both the HBA and CBA might potentially fill a critical gap in the implementation of current or future classification approaches. Furthermore, the results have also shown that when the penalty cost of an error type was changed, the corresponding error rate followed stepwise patterns. The finding of stepwise patterns of classification errors can assist researchers in determining applicable penalties for classification errors. Thus, the dissertation also proposes a binary search approach (BSA) to produce those patterns. Real-life datasets were utilized to demonstrate for the BSA

    Intrinsic Dimension Estimation: Relevant Techniques and a Benchmark Framework

    Get PDF
    When dealing with datasets comprising high-dimensional points, it is usually advantageous to discover some data structure. A fundamental information needed to this aim is the minimum number of parameters required to describe the data while minimizing the information loss. This number, usually called intrinsic dimension, can be interpreted as the dimension of the manifold from which the input data are supposed to be drawn. Due to its usefulness in many theoretical and practical problems, in the last decades the concept of intrinsic dimension has gained considerable attention in the scientific community, motivating the large number of intrinsic dimensionality estimators proposed in the literature. However, the problem is still open since most techniques cannot efficiently deal with datasets drawn from manifolds of high intrinsic dimension and nonlinearly embedded in higher dimensional spaces. This paper surveys some of the most interesting, widespread used, and advanced state-of-the-art methodologies. Unfortunately, since no benchmark database exists in this research field, an objective comparison among different techniques is not possible. Consequently, we suggest a benchmark framework and apply it to comparatively evaluate relevant state-of-the-art estimators

    Distributed multi-label learning on Apache Spark

    Get PDF
    This thesis proposes a series of multi-label learning algorithms for classification and feature selection implemented on the Apache Spark distributed computing model. Five approaches for determining the optimal architecture to speed up multi-label learning methods are presented. These approaches range from local parallelization using threads to distributed computing using independent or shared memory spaces. It is shown that the optimal approach performs hundreds of times faster than the baseline method. Three distributed multi-label k nearest neighbors methods built on top of the Spark architecture are proposed: an exact iterative method that computes pair-wise distances, an approximate tree-based method that indexes the instances across multiple nodes, and an approximate local sensitive hashing method that builds multiple hash tables to index the data. The results indicated that the predictions of the tree-based method are on par with those of an exact method while reducing the execution times in all the scenarios. The aforementioned method is then used to evaluate the quality of a selected feature subset. The optimal adaptation for a multi-label feature selection criterion is discussed and two distributed feature selection methods for multi-label problems are proposed: a method that selects the feature subset that maximizes the Euclidean norm of individual information measures, and a method that selects the subset of features maximizing the geometric mean. The results indicate that each method excels in different scenarios depending on type of features and the number of labels. Rigorous experimental studies and statistical analyses over many multi-label metrics and datasets confirm that the proposals achieve better performances and provide better scalability to bigger data than the methods compared in the state of the art

    Intrinsic Dimension Estimation: Relevant Techniques and a Benchmark Framework

    Get PDF
    When dealing with datasets comprising high-dimensional points, it is usually advantageous to discover some data structure. A fundamental information needed to this aim is the minimum number of parameters required to describe the data while minimizing the information loss. This number, usually called intrinsic dimension, can be interpreted as the dimension of the manifold from which the input data are supposed to be drawn. Due to its usefulness in many theoretical and practical problems, in the last decades the concept of intrinsic dimension has gained considerable attention in the scientific community, motivating the large number of intrinsic dimensionality estimators proposed in the literature. However, the problem is still open since most techniques cannot efficiently deal with datasets drawn from manifolds of high intrinsic dimension and nonlinearly embedded in higher dimensional spaces. This paper surveys some of the most interesting, widespread used, and advanced state-of-the-art methodologies. Unfortunately, since no benchmark database exists in this research field, an objective comparison among different techniques is not possible. Consequently, we suggest a benchmark framework and apply it to comparatively evaluate relevant stateof-the-art estimators

    Fouling prediction using neural network model for membrane bioreactor system

    Get PDF
    Membrane bioreactor (MBR) technology is a new method for water and wastewater treatment due to its ability to produce better and high-quality effluent that meets water quality regulations. MBR also is an advanced way to displace the conventional activated sludge (CAS) process. Even this membrane gives better performances compared to CAS, it does have few drawbacks such as high maintenance cost and fouling problem. In order to overcome this problem, an optimal MBR plant operation needs to be developed. This can be achieved through an accurate model that can predict the fouling behaviour which could optimise the membrane operation. This paper presents the application of artificial neural network technique to predict the filtration of membrane bioreactor system. The Radial Basis Function Neural Network (RBFNN) is applied to model the developed submerged MBR filtration system. RBFNN model is expected to give good prediction model of filtration system for estimating the fouling that formed during filtration process

    Automated supervised classification of variable stars I. Methodology

    Get PDF
    The fast classification of new variable stars is an important step in making them available for further research. Selection of science targets from large databases is much more efficient if they have been classified first. Defining the classes in terms of physical parameters is also important to get an unbiased statistical view on the variability mechanisms and the borders of instability strips. Our goal is twofold: provide an overview of the stellar variability classes that are presently known, in terms of some relevant stellar parameters; use the class descriptions obtained as the basis for an automated `supervised classification' of large databases. Such automated classification will compare and assign new objects to a set of pre-defined variability training classes. For every variability class, a literature search was performed to find as many well-known member stars as possible, or a considerable subset if too many were present. Next, we searched on-line and private databases for their light curves in the visible band and performed period analysis and harmonic fitting. The derived light curve parameters are used to describe the classes and define the training classifiers. We compared the performance of different classifiers in terms of percentage of correct identification, of confusion among classes and of computation time. We describe how well the classes can be separated using the proposed set of parameters and how future improvements can be made, based on new large databases such as the light curves to be assembled by the CoRoT and Kepler space missions.Comment: This paper has been accepted for publication in Astronomy and Astrophysics (reference AA/2007/7638) Number of pages: 27 Number of figures: 1

    NOVEL TECHNIQUES FOR INTRINSIC DIMENSION ESTIMATION

    Get PDF
    Since the 1950s, the rapid pace of technological advances allows to measure and record increasing amounts of data, motivating the urgent need to develop dimensionality reduction systems to be applied on datasets comprising high- dimensional points. To this aim, a fundamental information is provided by the intrinsic di- mension (id) defined by Bennet [1] as the minimum number of parameters needed to generate a data description by maintaining the \u201cintrinsic\u201d structure characterizing the dataset, so that the information loss is minimized. More recently, a quite intuitive definition employed by several authors in the past has been reported by Bishop in [2] where the author writes that \u201ca set in D dimensions is said to have an id equal to d if the data lies entirely within a d-dimensional subspace of D \u201d. Though more specific and different id definitions have been proposed in dif- ferent research fieldsthroughout the pattern recognition literature the presently prevailing id definition views a point set as a sample set uniformly drawn from an unknown smooth (or locally smooth) manifold structure, eventually embed- ded in an higher dimensional space through a non-linear smooth mapping; in this case, the id to be estimated is the manifold\u2019s topological dimension. Due to the importance of id in several theoretical and practical application fields, in the last two decades a great deal of research effort has been devoted to the development of effective id estimators. Though several techniques have been proposed in literature, the problem is still open for the following main reasons. 1At first, it must be highlighted that though Lebesgue\u2019s definition of topo- logical dimension (reported by [5]) is quite clear, in practice its estimation is difficult if only a finite set of points is available. Therefore, id estimation tech- niques proposed in literature are either founded on different notions of dimen- sion (e.g. fractal dimensions) approximating the topological one, or on various techniques aimed at preserving the characteristics of data-neighborhood distri- butions, which reflect the topology of the underlying manifold. Besides, the estimated id value markedly changes as the scale used to analyze the input dataset changes, and being the number of available points practically limited, several methods underestimate id when its value is sufficiently high (namely id 10). Other serious problems arise when the dataset is embedded in higher dimensional spaces through a non-linear map. Finally, the too high computa- tional complexity of most estimators makes them unpractical when the need is to process datasets comprising huge amounts of high-dimensional data. The main subject of this thesis work is the development of efficient and ef- fective id estimators. Precisely, two novel estimators, named MiND (Minimum Neighbor Distance estimators of intrinsic dimension, [6]) and DANCo (Dimension- ality from Angle and Norm Concentration, [4]) are described. The aforemen- tioned techniques are based on the exploitation of statistics characterizing the hidden structure of high dimensional spaces, such as the distribution of norms and angles, which are informative of the id and could therefore be exploited for its estimation. A simple practical example to show the informatory power of these features, is the clustering system proposed in [3]; based on the assumption that each class is represented by one manifold, the clustering procedure codes the input data by means of local id estimates and features related to them. This coding allows to obtain reliable results by applying classic and basic clustering algorithms. To evaluate the proposed estimators by objectively comparing them with relevant state-of-the-art techniques, a benchmark framework is proposed. The need of this framework is highlighted by the fact that in literature each method has been assessed on different datasets and by employing different evaluation measures; therefore it is difficult to provide an objective comparison by solely analyzing the results reported by the authors. Based on this observation, the proposed benchmark employs publicly available, synthetic and real, datasets that have been used by several authors in the literature for their interesting, and challenging, peculiarities. Moreover, some synthetic datasets have been added, to more deeply test the estimators\u2019 performance on high dimensional datasets being characterized by similarly high id. The application of this benchmark has shown to provide an objective comparative assessment in terms of robustness w.r.t. parameter settings, high dimensional datasets, datasets being character- ized by an high intrinsic dimension, and noisy datasets. The achieved results show that DANCo provides the most reliable estimates on both synthetic and real datasets. The thesis is organized as follows: in Chapter 1 a brief theoretical description of the various definitions of dimension is presented, along with the problems re- lated to id estimation and interesting application domains profitably exploiting the knowledge of id; in Chapter 2 notable state-of-the-art intrinsic id are sur- veyed, and grouped according to the employed methods; in Chapter 3 MinD, and DANCo are described; in Chapter 4, after summarizing mostly used experimental settings, we propose a benchmark framework and we employ it to objectively assess and compare relevant intrinsic dimensionality estimators; in Chapter 5 conclusions and open research problems are shortly reported. References [1] R. S. Bennett. The Intrinsic Dimensionality of Signal Collections. IEEE Trans. on Information Theory, IT-15(5):517\u2013525, 1969. [2] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, Oxford, 1995. [3] P. Campadelli, E. Casiraghi, C. Ceruti, G. Lombardi, and A. Rozza. Local intrinsic dimensionality based features for clustering. In Alfredo Petrosino, editor, ICIAP (1), volume 8156 of Lecture Notes in Computer Science, pages 41\u201350. Springer, 2013. [4] C. Ceruti, S. Bassis, A Rozza, G. Lombardi, E. Casiraghi, and P. Campadelli. DANCo: an intrinsic Dimensionalty estimator exploiting Angle and Norm Concentration. Pattern recognition, 2014. [5] M. Katetov and P. Simon. Origins of dimension theory. Handbook of the History of General Topology, 1997. [6] A. Rozza, G. Lombardi, C. Ceruti, E. Casiraghi, and P. Campadelli. Novel high intrinsic dimensionality estimators. Machine Learning Journal, 89(1- 2):37\u201365, May 2012

    Geometric Structure Extraction and Reconstruction

    Get PDF
    Geometric structure extraction and reconstruction is a long-standing problem in research communities including computer graphics, computer vision, and machine learning. Within different communities, it can be interpreted as different subproblems such as skeleton extraction from the point cloud, surface reconstruction from multi-view images, or manifold learning from high dimensional data. All these subproblems are building blocks of many modern applications, such as scene reconstruction for AR/VR, object recognition for robotic vision and structural analysis for big data. Despite its importance, the extraction and reconstruction of a geometric structure from real-world data are ill-posed, where the main challenges lie in the incompleteness, noise, and inconsistency of the raw input data. To address these challenges, three studies are conducted in this thesis: i) a new point set representation for shape completion, ii) a structure-aware data consolidation method, and iii) a data-driven deep learning technique for multi-view consistency. In addition to theoretical contributions, the algorithms we proposed significantly improve the performance of several state-of-the-art geometric structure extraction and reconstruction approaches, validated by extensive experimental results

    Aprendizaje multi-etiqueta distribuido en Apache Spark

    Get PDF
    This thesis proposes a series of multi-label learning algorithms for classication and feature selection implemented on the Apache Spark distributed computing model. Five approaches for determining the optimal architecture to speed up the multi-label learning methods are presented. These approaches range from local parallelization using threads to distributed computing using independent or shared memory spaces. It is shown that the optimal approach performs hundreds of times faster than the baseline method. Three distributed multi-label k nearest neighbors methods built on top of the Spark architecture are proposed: an exact iterative method that computes pair-wise distances, an approximate tree-based method that indexes the instances across multiple nodes, and an approximate local sensitive hashing method that builds multiple hash tables to index the data. The results indicated that the predictions of the tree-based method are on par with those of an exact method while reducing the execution times in all the scenarios. The aforementioned method is then used to evaluate the quality of a selected feature subset. The optimal adaptation for a multi-label feature selection criterion is discussed and two distributed feature selection methods for multi-label problems are proposed: a method that selects the feature subset that maximizes the Euclidean norm of the individual information measures, and a method selects the subset of features that maximize the geometrical mean. The results indicate that each method excels in di_erent scenarios depending on type of features and the number of labels. Rigorous experimental studies and statistical analyses over many multi-label metrics and datasets con_rm that the proposals achieve better performances and provide better scalability to bigger data than the methods compared in the state of the art.Esta Tesis Doctoral propone unos algoritmos de clasificación y selección de atributos para aprendizaje multi-etiqueta distribuidos implementados en Apache Spark. Cinco estrategias para determinar la arquitectura óptima para acelerar el aprendizaje multi-etiqueta son presentadas. Estas estrategias varían desde la paralelización local utilizando hilos hasta la distribución de la computación utilizando espacios de memoria compartidos o independientes. Ha sido demostrado que la estrategia óptima permite ejecutar cientos de veces más rápido que el método de referencia. Se proponen tres métodos distribuidos de \k nearest neighbors" multi-etiqueta sobre la arquitectura de Spark seleccionada: un método exacto que computa iterativamente las distancias, un método aproximado que usa un árbol para indexar las instancias, y un método aproximado que utiliza tablas hash para indexar las instancias. Los resultados indican que las predicciones del método basado en árboles son equivalente a aquellas producidas por un método exacto a la vez que reduce los tiempos de ejecución en todos los escenarios. Dicho método es utilizado para evaluar la calidad de un subconjunto de atributos. Se discute el criterio para seleccionar atributos en problemas multi-etiqueta, proponiendo: un método que selecciona el subconjunto de atributos cuyas medidas de información individuales poseen la mayor norma Euclídea, y un método que selecciona el subconjunto de atributos con la mayor media geométrica. Los resultados indican que cada método destaca en escenarios diferentes dependiendo del tipo de atributos y el número de etiquetas. Los estudios experimentales y análisis estadísticos utilizando múltiples métricas y datos multi-etiqueta confirman que nuestras propuestas alcanzan un mejor rendimiento y proporcionan una mejor escalabilidad para datos de gran tamaño respecto a los métodos de referencia
    corecore