28 research outputs found

    Large-scale nonlinear dimensionality reduction for network intrusion detection

    Get PDF
    International audienceNetwork intrusion detection (NID) is a complex classification problem. In this paper, we combine classification with recent and scalable nonlinear dimensionality reduction (NLDR) methods. Classification and DR are not necessarily adversarial, provided adequate cluster magnification occurring in NLDR methods like tt-SNE: DR mitigates the curse of dimensionality, while cluster magnification can maintain class separability. We demonstrate experimentally the effectiveness of the approach by analyzing and comparing results on the big KDD99 dataset, using both NLDR quality assessment and classification rate for SVMs and random forests. Since data involves features of mixed types (numerical and categorical), the use of Gower's similarity coefficient as metric further improves the results over the classical similarity metric

    Deploying a Quantum Annealing Processor to Detect Tree Cover in Aerial Imagery of California

    Get PDF
    Quantum annealing is an experimental and potentially breakthrough computational technology for handling hard optimization problems, including problems of computer vision. We present a case study in training a production-scale classifier of tree cover in remote sensing imagery, using early-generation quantum annealing hardware built by D-wave Systems, Inc. Beginning within a known boosting framework, we train decision stumps on texture features and vegetation indices extracted from four-band, one-meter-resolution aerial imagery from the state of California. We then impose a regulated quadratic training objective to select an optimal voting subset from among these stumps. The votes of the subset define the classifier. For optimization, the logical variables in the objective function map to quantum bits in the hardware device, while quadratic couplings encode as the strength of physical interactions between the quantum bits. Hardware design limits the number of couplings between these basic physical entities to five or six. To account for this limitation in mapping large problems to the hardware architecture, we propose a truncation and rescaling of the training objective through a trainable metaparameter. The boosting process on our basic 108- and 508-variable problems, thus constituted, returns classifiers that incorporate a diverse range of color- and texture-based metrics and discriminate tree cover with accuracies as high as 92% in validation and 90% on a test scene encompassing the open space preserves and dense suburban build of Mill Valley, CA

    An Automatic Representation Optimization and Model Selection Framework for Machine Learning

    Get PDF
    The classification problem is an important part of machine learning and occurs in many application fields like image-based object recognition or industrial quality inspection. In the ideal case, only a training dataset consisting of feature data and true class labels has to be obtained to learn the connection between features and class labels. This connection is represented by a so-called classifier model. However, even today the development of a well-performing classifier for a given task is difficult and requires a lot of expertise. Numerous challenges occur in real-world classification problems that can degrade the generalization performance. Typical challenges are not enough training samples, noisy feature data as well as suboptimal choices of algorithms or hyperparameters. Many solutions exist to tackle these challenges, such as automatic feature and model selection algorithms, hyperparameter tuning or data preprocessing methods. Furthermore, representation learning, which is connected to the recently evolving field of deep learning, is also a promising approach that aims at automatically learning more useful features out of low-level data. Due to the lack of a holistic framework that considers all of these aspects, this work proposes the Automatic Representation Optimization and Model Selection Framework, abbreviated as AROMS-Framework. The central classification pipeline contains feature selection and portfolios of preprocessing, representation learning and classification methods. An optimization algorithm based on Evolutionary Algorithms is developed to automatically adapt the pipeline configuration to a given learning task. Additionally, two kinds of extended analyses are proposed that exploit the optimization trajectory. The first one aims at a better understanding of the complex interplay of the pipeline components using a suitable visualization technique. The second one is a multi-pipeline classifier with the purpose to improve the generalization performance by fusing the decisions of several classification pipelines. Finally, suitable experiments are conducted to evaluate all aspects of the proposed framework regarding its generalization performance, optimization runtime and classification speed. The goal is to show benefits and limitations of the framework when a large variety of datasets from different real-world applications is considered.Ein Framework zur automatischen Optimierung von Merkmalsrepräsentationen und Modellen für maschinelles Lernen Das Klassifikationsproblem ist ein wichtiger Teil der Forschungsrichtung des maschinellen Lernens. Dieses Problem tritt in vielen Anwendungsbereichen wie der bildbasierten Objekterkennung oder industriellen Qualitätsinspektion auf. Im Idealfall muss nur ein Trainingsdatensatz gesammelt werden, der aus einer Menge an Merkmalsdaten und den entsprechenden, geforderten Klassenzuordnungen besteht. Das Ziel ist das Lernen des Zusammenhangs zwischen den Merkmalsdaten und den Klassenzuordnungen mittels eines sogenannten Klassifikatormodells. Auch heute noch ist die Entwicklung eines gut funktionierenden Klassifikators für eine gegebene Anwendung eine anspruchsvolle Aufgabe, die eine Menge Expertenwissen voraussetzt. In praxisnahen Anwendungen müssen viele Probleme gelöst werden, die die Leistungsfähigkeit des Klassifikators einschränken können: Es sind oft nicht ausreichend viele Trainingsdaten vorhanden, die Merkmalsdaten enthalten zu viel Rauschen oder die gewählten Algorithmen oder deren Hyperparameter sind suboptimal eingestellt. Es existiert eine Vielzahl an Lösungsansätzen für diese Herausforderungen, wie z.B. eine automatische Auswahl von Merkmalen, Klassifikatormodellen und Hyperparametern sowie geeigneten Datenvorverarbeitungsmethoden. Zudem gibt es vielversprechende Methoden des sogenannten Repräsentationslernens, das mit dem aktuellen Forschungszweig Deep Learning verbunden ist: Hier ist ein automatisches Erlernen von besseren Merkmalsrepräsentationen aus Rohdaten das Ziel. Es existiert bisher kein ganzheitliches Framework, welches all die vorhergehend genannten Aspekte miteinbezieht. Daher wird in dieser Arbeit ein automatisches Framework zur Optimierung von Merkmalsrepräsentationen und Modellen für maschinelles Lernen eingeführt, das als AROMS-Framework abgekürzt wird. Die zentrale Klassifikations-Pipeline enthält Merkmalsselektion und Algorithmen-Portfolios mit verschiedenen Vorverarbeitungsmethoden, Methoden des Repräsentationslernens sowie Klassifikatoren. Es wird ein Optimierungsverfahren basierend auf evolutionären Algorithmen präsentiert, das zur automatischen Anpassung der Pipeline-Konfiguration an ein Lernproblem genutzt wird. Weiterhin werden zwei erweiterte Analysen der Daten aus dem Verlauf des Optimierungsverfahrens vorgeschlagen: Die erste Erweiterung zielt auf eine verständliche Visualisierung des komplexen Zusammenspiels der Komponenten der Klassifikations-Pipeline ab. Die zweite Erweiterung ist ein Multi-Pipeline-Klassifikator, der die Generalisierung verbessern soll, in dem die Entscheidungen mehrerer Klassifikations-Pipelines fusioniert werden. Abschließend werden geeignete Experimente durchgeführt, um alle Aspekte des vorgeschlagenen Frameworks im Hinblick auf die Generalisierungsleistung, der Optimierungslaufzeit und der Klassifikationsgeschwindigkeit zu untersuchen. Das Ziel ist das Aufzeigen von Vorteilen und Einschränkungen des Frameworks, wenn eine große Vielfalt an Datensätzen aus verschiedenen Anwendungsbereichen betrachtet wird

    Feature-based Time Series Analytics

    Get PDF
    Time series analytics is a fundamental prerequisite for decision-making as well as automation and occurs in several applications such as energy load control, weather research, and consumer behavior analysis. It encompasses time series engineering, i.e., the representation of time series exhibiting important characteristics, and data mining, i.e., the application of the representation to a specific task. Due to the exhaustive data gathering, which results from the ``Industry 4.0'' vision and its shift towards automation and digitalization, time series analytics is undergoing a revolution. Big datasets with very long time series are gathered, which is challenging for engineering techniques. Traditionally, one focus has been on raw-data-based or shape-based engineering. They assess the time series' similarity in shape, which is only suitable for short time series. Another focus has been on model-based engineering. It assesses the time series' similarity in structure, which is suitable for long time series but requires larger models or a time-consuming modeling. Feature-based engineering tackles these challenges by efficiently representing time series and comparing their similarity in structure. However, current feature-based techniques are unsatisfactory as they are designed for specific data-mining tasks. In this work, we introduce a novel feature-based engineering technique. It efficiently provides a short representation of time series, focusing on their structural similarity. Based on a design rationale, we derive important time series characteristics such as the long-term and cyclically repeated characteristics as well as distribution and correlation characteristics. Moreover, we define a feature-based distance measure for their comparison. Both the representation technique and the distance measure provide desirable properties regarding storage and runtime. Subsequently, we introduce techniques based on our feature-based engineering and apply them to important data-mining tasks such as time series generation, time series matching, time series classification, and time series clustering. First, our feature-based generation technique outperforms state-of-the-art techniques regarding the accuracy of evolved datasets. Second, with our features, a matching method retrieves a match for a time series query much faster than with current representations. Third, our features provide discriminative characteristics to classify datasets as accurately as state-of-the-art techniques, but orders of magnitude faster. Finally, our features recommend an appropriate clustering of time series which is crucial for subsequent data-mining tasks. All these techniques are assessed on datasets from the energy, weather, and economic domains, and thus, demonstrate the applicability to real-world use cases. The findings demonstrate the versatility of our feature-based engineering and suggest several courses of action in order to design and improve analytical systems for the paradigm shift of Industry 4.0

    Prediction and identification of physical systems by means of physically-guided neural networks with meaningful internal layers

    Get PDF
    Substitution of well-grounded theoretical models by data-driven predictions is not as simple in engineering and sciences as it is in social and economic fields. Scientific problems suffer many times from paucity of data, while they may involve a large number of variables and parameters that interact in complex and non-stationary ways, obeying certain physical laws. Moreover, a physically-based model is not only useful for making predictions, but to gain knowledge by the interpretation of its structure, parameters, and mathematical properties. The solution to these shortcomings seems to be the seamless blending of the tremendous predictive power of the data-driven approach with the scientific consistency and interpretability of physically-based models. We use here the concept of Physically-Guided Neural Networks (PGNN) to predict the input-output relation in a physical system, while, at the same time, fulfilling the physical constraints. With this goal, the internal hidden state variables of the system are associated with a set of internal neuron layers, whose values are constrained by known physical relations, as well as any additional knowledge on the system. Furthermore, when having enough data, it is possible to infer knowledge about the internal structure of the system and, if parameterized, to predict the state parameters for a particular input-output relation. We show that this approach, besides getting physically-based predictions, accelerates the training process, reduces the amount of data required to get similar accuracy, partly filters the intrinsic noise in the experimental data and improves its extrapolation capacity. (C) 2021 ElsevierB.V. All rights reserved

    The Matching-Graph

    Get PDF
    The increasing amount of data available and the rate at which it is being collected is driving the rapid development of intelligent information processing and pattern recognition systems. Often the underlying data is inherently complex, making it difficult to represent it using linear, vectorial data structures. Graphs offer a versatile alternative for formal data representation. Actually, quite a number of graph-based pattern recognition methods have been proposed, and a considerable part of these methods rely on graph matching. This thesis introduces a novel method for encoding specific graph matching information into a meta-graph, termed matching-graph. The basic idea is to formalize the stable cores of individual classes of graphs – discovered during intra-class matching. This meta-graph is useful in several applications ranging from the analysis of inherent patterns, over graph classification, to graph augmentation. The benefits of the matching-graphs are evaluated in three parts. First, their usefulness in classification scenarios is evaluated in two approaches. The first approach is a distance-based classifier that focuses on the matching-graphs during dissimilarity computation. The second approach uses sets of matching-graphs to embed input graphs into a vector space. The basic idea is to first generate hundreds of matching-graphs, and then represent each graph g as a vector that shows the occurrence of, or the distance to, each matching-graph. In a thorough experimental evaluation on real-world data sets it is empirically confirmed that these novel approaches are able to improve the classification accuracy of systems that rely on comparable information as well as state-of-the-art methods. The second part of the research targets a prevalent challenge in graph-based pattern recognition, viz. computing the maximum common subgraph (MCS). Current exact algorithms compute the MCS with exponential time complexity. In this second part, it is investigated whether matching-graphs, computable in polynomial time, provide a suitable approximation for the MCS. Results show that, for specific graphs, a matching-graph equals the maximum common edge subgraph, thereby establishing an upper limit to the size of the maximum common induced subgraph. The experimental evaluation further confirms that matching-graphs outperform existing algorithms in terms of computation time and classification accuracy. The third part of this thesis addresses the problem of graph augmentation. Regardless of the actual representation formalism used, it is inevitable that supervised pattern recognition algorithms need access to large sets of labeled training samples. However, in some cases, this requirement cannot be met because the set of labeled samples is inherently limited. The last part shows that matching-graphs can be used to augment graph training sets in order to make the training of a classifier more robust. The benefit of this approach is empirically validated in two different experiments. First, the augmentation approach is studied on very small graph data sets in conjunction with a graph kernel classifier, and second, the augmentation approach is studied on data sets with reasonable size in conjunction with a graph neural network classifier

    Relevant data representation by a Kernel-based framework

    Get PDF
    Nowadays, the analysis of a large amount of data has emerged as an issue of great interest taking increasing place in the scientific community, especially in automation, signal processing, pattern recognition, and machine learning. In this sense, the identification, description, classification, visualization, and clustering of events or patterns are important problems for engineering developments and scientific issues, such as biology, medicine, economy, artificial vision, artificial intelligence, and industrial production. Nonetheless, it is difficult to interpret the available information due to its complexity and a large amount of obtained features. In addition, the analysis of the input data requires the development of methodologies that allow to reveal the relevant behaviors of the studied process, particularly, when such signals contain hidden structures varying over a given domain, e.g., space and/or time. When the analyzed signal contains such kind of properties, directly applying signal processing and machine learning procedures without considering a suitable model that deals with both the statistical distribution and the data structure, can lead in unstable performance results. Regarding this, kernel functions appear as an alternative approach to address the aforementioned issues by providing flexible mathematical tools that allow enhancing data representation for supporting signal processing and machine learning systems. Moreover, kernelbased methods are powerful tools for developing better-performing solutions by adapting the kernel to a given problem, instead of learning data relationships from explicit raw vector representations. However, building suitable kernels requires some user prior knowledge about input data, which is not available in most of the practical cases. Furthermore, using the definitions of traditional kernel methods directly, possess a challenging estimation problem that often leads to strong simplifications that restrict the kind of representation that we can use on the data. In this study, we propose a data representation framework based on kernel methods to learn automatically relevant sample relationships in learning systems. Namely, the proposed framework is divided into five kernel-based approaches, which aim to compute relevant data representations by adapting them according to both the imposed sample relationships constraints and the learning scenario (unsupervised or supervised task). First, we develop a kernel-based representation approach that allows revealing the main input sample relations by including relevant data structures using graph-based sparse constraints. Thus, salient data structures are highlighted aiming to favor further unsupervised clustering stages. This approach can be viewed as a graph pruning strategy within a spectral clustering framework which allows enhancing both the local and global data consistencies for a given input similarity matrix. Second, we introduce a kernel-based representation methodology that captures meaningful data relations in terms of their statistical distribution. Thus, an information theoretic learning (ITL) based penalty function is introduced to estimate a kernel-based similarity that maximizes the whole information potential variability. So, we seek for a reproducing kernel Hilbert space (RKHS) that spans the widest information force magnitudes among data points to support further clustering stages. Third, an entropy-like functional on positive definite matrices based on Renyi’s definition is adapted to develop a kernel-based representation approach which considers the statistical distribution and the salient data structures. Thereby, relevant input patterns are highlighted in unsupervised learning tasks. Particularly, the introduced approach is tested as a tool to encode relevant local and global input data relationships in dimensional reduction applications. Fourth, a supervised kernel-based representation is introduced via a metric learning procedure in RKHS that takes advantage of the user-prior knowledge, when available, regarding the studied learning task. Such an approach incorporates the proposed ITL-based kernel functional estimation strategy to adapt automatically the relevant representation using both the supervised information and the input data statistical distribution. As a result, relevant sample dependencies are highlighted by weighting the input features that mostly encode the supervised learning task. Finally, a new generalized kernel-based measure is proposed by taking advantage of different RKHSs. In this way, relevant dependencies are highlighted automatically by considering the input data domain-varying behavior and the user-prior knowledge (supervised information) when available. The proposed measure is an extension of the well-known crosscorrentropy function based on Hilbert space embeddings. Throughout the study, the proposed kernel-based framework is applied to biosignal and image data as an alternative to support aided diagnosis systems and image-based object analysis. Indeed, the introduced kernel-based framework improve, in most of the cases, unsupervised and supervised learning performances, aiding researchers in their quest to process and to favor the understanding of complex dataResumen: Hoy en día, el análisis de datos se ha convertido en un tema de gran interés para la comunidad científica, especialmente en campos como la automatización, el procesamiento de señales, el reconocimiento de patrones y el aprendizaje de máquina. En este sentido, la identificación, descripción, clasificación, visualización, y la agrupación de eventos o patrones son problemas importantes para desarrollos de ingeniería y cuestiones científicas, tales como: la biología, la medicina, la economía, la visión artificial, la inteligencia artificial y la producción industrial. No obstante, es difícil interpretar la información disponible debido a su complejidad y la gran cantidad de características obtenidas. Además, el análisis de los datos de entrada requiere del desarrollo de metodologías que permitan revelar los comportamientos relevantes del proceso estudiado, en particular, cuando tales señales contienen estructuras ocultas que varían sobre un dominio dado, por ejemplo, el espacio y/o el tiempo. Cuando la señal analizada contiene este tipo de propiedades, los rendimientos pueden ser inestables si se aplican directamente técnicas de procesamiento de señales y aprendizaje automático sin tener en cuenta la distribución estadística y la estructura de datos. Al respecto, las funciones núcleo (kernel) aparecen como un enfoque alternativo para abordar las limitantes antes mencionadas, proporcionando herramientas matemáticas flexibles que mejoran la representación de los datos de entrada. Por otra parte, los métodos basados en funciones núcleo son herramientas poderosas para el desarrollo de soluciones de mejor rendimiento mediante la adaptación del núcleo de acuerdo al problema en estudio. Sin embargo, la construcción de funciones núcleo apropiadas requieren del conocimiento previo por parte del usuario sobre los datos de entrada, el cual no está disponible en la mayoría de los casos prácticos. Por otra parte, a menudo la estimación de las funciones núcleo conllevan sesgos el modelo, siendo necesario apelar a simplificaciones matemáticas que no siempre son acordes con la realidad. En este estudio, se propone un marco de representación basado en métodos núcleo para resaltar relaciones relevantes entre los datos de forma automática en sistema de aprendizaje de máquina. A saber, el marco propuesto consta de cinco enfoques núcleo, en aras de adaptar la representación de acuerdo a las relaciones impuestas sobre las muestras y sobre el escenario de aprendizaje (sin/con supervisión). En primer lugar, se desarrolla un enfoque de representación núcleo que permite revelar las principales relaciones entre muestras de entrada mediante la inclusión de estructuras relevantes utilizando restricciones basadas en modelado por grafos. Por lo tanto, las estructuras de datos más sobresalientes se destacan con el objetivo de favorecer etapas posteriores de agrupamiento no supervisado. Este enfoque puede ser visto como una estrategia de depuración de grafos dentro de un marco de agrupamiento espectral que permite mejorar las consistencias locales y globales de los datos En segundo lugar, presentamos una metodología de representación núcleo que captura relaciones significativas entre muestras en términos de su distribución estadística. De este modo, se introduce una función de costo basada en aprendizaje por teoría de la información para estimar una similitud que maximice la variabilidad del potencial de información de los datos de entrada. Así, se busca un espacio de Hilbert generado por el núcleo que contenga altas fuerzas de información entre los puntos para favorecer el agrupamiento entre los mismos. En tercer lugar, se propone un esquema de representación que incluye un funcional de entropía para matrices definidas positivas a partir de la definición de Renyi. En este sentido, se pretenden incluir la distribución estadística de las muestras y sus estructuras relevantes. Por consiguiente, los patrones de entrada pertinentes se destacan en tareas de aprendizaje sin supervisión. En particular, el enfoque introducido se prueba como una herramienta para codificar las relaciones locales y globales de los datos en tareas de reducción de dimensión. En cuarto lugar, se introduce una metodología de representación núcleo supervisada a través de un aprendizaje de métrica en el espacio de Hilbert generado por una función núcleo en aras de aprovechar el conocimiento previo del usuario con respecto a la tarea de aprendizaje. Este enfoque incorpora un funcional por teoría de información que permite adaptar automáticamente la representación utilizando tanto información supervisada y la distribución estadística de los datos de entrada. Como resultado, las dependencias entre las muestras se resaltan mediante la ponderación de las características de entrada que codifican la tarea de aprendizaje supervisado. Por último, se propone una nueva medida núcleo mediante el aprovechamiento de diferentes espacios de representación. De este modo, las dependencias más relevantes entre las muestras se resaltan automáticamente considerando el dominio de interés de los datos de entrada y el conocimiento previo del usuario (información supervisada). La medida propuesta es una extensión de la función de cross-correntropia a partir de inmersiones en espacios de Hilbert. A lo largo del estudio, el esquema propuesto se valida sobre datos relacionados con bioseñales e imágenes como una alternativa para apoyar sistemas de apoyo diagnóstico y análisis objetivo basado en imágenes. De hecho, el marco introducido permite mejorar, en la mayoría de los casos, el rendimiento de sistemas de aprendizaje supervisado y no supervisado, favoreciendo la precisión de la tarea y la interpretabilidad de los datosDoctorad

    Proceedings - 30. Workshop Computational Intelligence : Berlin, 26. - 27. November 2020

    Get PDF
    Dieser Tagungsband enthält die Beiträge des 30. Workshops Computational Intelligence. Die Schwerpunkte sind Methoden, Anwendungen und Tools für Fuzzy-Systeme, Künstliche Neuronale Netze, Evolutionäre Algorithmen und Data-Mining-Verfahren sowie der Methodenvergleich anhand von industriellen und Benchmark-Problemen
    corecore