11 research outputs found

    A Vertical and Horizontal Intelligent Dataset Reduction Approach for Cyber-Physical Power Aware Intrusion Detection Systems

    Get PDF
    The Cypher Physical Power Systems (CPPS) became vital targets for intruders because of the large volume of high speed heterogeneous data provided from the Wide Area Measurement Systems (WAMS). The Nonnested Generalized Exemplars (NNGE) algorithm is one of the most accurate classification techniques that can work with such data of CPPS. However, NNGE algorithm tends to produce rules that test a large number of input features. This poses some problems for the large volume data and hinders the scalability of any detection system. In this paper, we introduce VHDRA, a Vertical and Horizontal Data Reduction Approach, to improve the classification accuracy and speed of the NNGE algorithm and reduce the computational resource consumption. VHDRA provides the following functionalities: (1) it vertically reduces the dataset features by selecting the most significant features and by reducing the NNGE's hyperrectangles. (2) It horizontally reduces the size of data while preserving original key events and patterns within the datasets using an approach called STEM, State Tracking and Extraction Method. The experiments show that the overall performance of VHDRA using both the vertical and the horizontal reduction reduces the NNGE hyperrectangles by 29.06%, 37.34%, and 26.76% and improves the accuracy of the NNGE by 8.57%, 4.19%, and 3.78% using the Multi-, Binary, and Triple class datasets, respectively.This work was made possible by NPRP Grant # NPRP9-005-1-002 from the Qatar National Research Fund (a member of Qatar Foundation).Scopu

    Modelos híbridos de aprendizaje basados en instancias y reglas para Clasificación Monotónica

    Get PDF
    En los problemas de clasificación supervisada, el atributo respuesta depende de determinados atributos de entrada explicativos. En muchos problemas reales el atributo respuesta está representado por valores ordinales que deberían incrementarse cuando algunos de los atributos explicativos de entrada también lo hacen. Estos son los llamados problemas de clasificación con restricciones monotónicas. En esta Tesis, hemos revisado los clasificadores monotónicos propuestos en la literatura y hemos formalizado la teoría del aprendizaje basado en ejemplos anidados generalizados para abordar la clasificación monotónica. Propusimos dos algoritmos, un primer algoritmos voraz, que require de datos monotónicos y otro basado en algoritmos evolutivos, que es capaz de abordar datos imperfectos que presentan violaciones monotónicas entre las instancias. Ambos mejoran el acierto, el índice de no-monotonicidad de las predicciones y la simplicidad de los modelos sobre el estado-del-arte.In supervised prediction problems, the response attribute depends on certain explanatory attributes. Some real problems require the response attribute to represent ordinal values that should increase with some of the explaining attributes. They are called classification problems with monotonicity constraints. In this thesis, we have reviewed the monotonic classifiers proposed in the literature and we have formalized the nested generalized exemplar learning theory to tackle monotonic classification. Two algorithms were proposed, a first greedy one, which require monotonic data and an evolutionary based algorithm, which is able to address imperfect data with monotonic violations present among the instances. Both improve the accuracy, the non-monotinic index of predictions and the simplicity of models over the state-of-the-art.Tesis Univ. Jaén. Departamento INFORMÁTIC

    Time series forecasting using a weighted cross-validation evolutionary artificial neural network ensemble

    Get PDF
    The ability to forecast the future based on past data is a key tool to support individual and organizational decision making. In particular, the goal of Time Series Forecasting (TSF) is to predict the behavior of complex systems by looking only at past patterns of the same phenomenon. In recent years, several works in the literature have adopted Evolutionary Artificial Neural Networks (EANNs) for TSF. In this work, we propose a novel EANN approach, where a weighted n-fold validation fitness scheme is used to build an ensemble of neural networks, under four different combination methods: mean, median, softmax and rank-based. Several experiments were held, using six real-world time series with different characteristics and from distinct domains. Overall, the proposed approach achieved competitive results when compared with a non-weighted n-fold EANN ensemble, the simpler 0-fold EANN and also the popular Holt–Winters statistical method.This work was supported by University Carlos III of Madrid and by Community of Madrid under project CCG10-UC3M/TIC-5174. The work of P. Cortez was funded by FEDER (program COMPETE and FCT) under project FCOMP-01-0124-FEDER-022674

    IRDDS: Instance reduction based on Distance-based decision surface

    Get PDF
    In instance-based learning, a training set is given to a classifier for classifying new instances. In practice, not all information in the training set is useful for classifiers. Therefore, it is convenient to discard irrelevant instances from the training set. This process is known as instance reduction, which is an important task for classifiers since through this process the time for classification or training could be reduced. Instance-based learning methods are often confronted with the difficulty of choosing the instances which must be stored to be used during an actual test. Storing too many instances may result in large memory requirements and slow execution speed. In this paper, first, a Distance-based Decision Surface (DDS) is proposed which is used as a separating surface between the classes, then an instance reduction method, which is based on the DDS surface is proposed, namely IRDDS (Instance Reduction based on Distance-based Decision Surface). Using the DDS surface with Genetic algorithm selects a reference set for classification. IRDDS selects the most representative instances, satisfying both following objectives: high accuracy and reduction rates. The performance of IRDDS has been evaluated on real world data sets from UCI repository by the 10-fold cross-validation method. The results of the experiments are compared with some state-of-the-art methods, which show the superiority of the proposed method over the surveyed literature, in terms of both classification accuracy and reduction percentage

    Development and evaluation of optimization based data mining techniques analysis of brain data

    Get PDF
    Neuroscience is an interdisciplinary science which deals with the study of structure and function of the brain and nervous system. Neuroscience encompasses disciplines such as computer science, mathematics, engineering, and linguistics. The structure of the healthy brain and representation of information by neural activity are among most challenging problems in neuroscience. Neuroscience is experiencing exponentially growing volumes of data obtained by using different technologies. The investigation of such data has tremendous impact on developing new and improving existing models of both healthy and diseased brains. Various techniques have been used for collecting brain data sets for addressing neuroscience problems. These data sets can be categorized into two main groups: resting-state and state-dependent data sets. Resting-state data is based on recording the brain activity when a subject does not think about any specific concept while state-dependent data is based on recording brain activity related to specific tasks. In general, brain data sets contain a large number of features (e.g. tens of thousands) and significantly fewer samples (e.g. several hundred). Such data sets are sparse and noisy. In addition to these problems, brain data sets have a few number of subjects. Brains are very complex systems and data about any brain activity reflects very complex relationship between neurons as well as different parts of the brain. Such relationships are highly nonlinear and general purpose data mining algorithms are not always efficient for their study. The development of machine learning techniques for brain data sets is an emerging research area in neuroscience. Over the last decade, various machine learning techniques have been developed for application to brain data sets. In the meantime, some well-known algorithms such as feature selection and supervised classification have been modified for analysis of brain data sets. Support vector machines, logistic regression, and Gaussian Naive Bayes classifiers are widely used for application to brain data sets. However, Support vector machines and logistic regression algorithms are not efficient for sparse and noisy data sets and Gaussian Naive Bayes classifiers do not give high accuracy. The aim of this study is to develop new and modify the existing data mining algorithms for the analysis brain data sets. Our contribution in this thesis can be listed as follow: 1. Development of new algorithms: 1.1. Development of new voxel (feature) selection algorithms for Functional magnetic resonance imaging (fMRI) data sets, and evaluation of these algorithms on the Haxby and Science 2008 data sets. 1.2. Development of new feature selection algorithm based on the catastrophe model for regression analysis problems. 2. Development and evaluation of different versions of the adaptive neuro-fuzzy model for the analysis of the spike-discharge as a function of other neuronal parameters. 3. Development and evaluation of the modified global k-means clustering algorithm for investigation of the structure of the healthy brain. 4. Development and evaluation of region of interest (ROI) method for analysis of brain functionalconnectivity in healthy subjects and schizophrenia patients.Doctor of Philosoph

    Neuroengineering of Clustering Algorithms

    Get PDF
    Cluster analysis can be broadly divided into multivariate data visualization, clustering algorithms, and cluster validation. This dissertation contributes neural network-based techniques to perform all three unsupervised learning tasks. Particularly, the first paper provides a comprehensive review on adaptive resonance theory (ART) models for engineering applications and provides context for the four subsequent papers. These papers are devoted to enhancements of ART-based clustering algorithms from (a) a practical perspective by exploiting the visual assessment of cluster tendency (VAT) sorting algorithm as a preprocessor for ART offline training, thus mitigating ordering effects; and (b) an engineering perspective by designing a family of multi-criteria ART models: dual vigilance fuzzy ART and distributed dual vigilance fuzzy ART (both of which are capable of detecting complex cluster structures), merge ART (aggregates partitions and lessens ordering effects in online learning), and cluster validity index vigilance in fuzzy ART (features a robust vigilance parameter selection and alleviates ordering effects in offline learning). The sixth paper consists of enhancements to data visualization using self-organizing maps (SOMs) by depicting in the reduced dimension and topology-preserving SOM grid information-theoretic similarity measures between neighboring neurons. This visualization\u27s parameters are estimated using samples selected via a single-linkage procedure, thereby generating heatmaps that portray more homogeneous within-cluster similarities and crisper between-cluster boundaries. The seventh paper presents incremental cluster validity indices (iCVIs) realized by (a) incorporating existing formulations of online computations for clusters\u27 descriptors, or (b) modifying an existing ART-based model and incrementally updating local density counts between prototypes. Moreover, this last paper provides the first comprehensive comparison of iCVIs in the computational intelligence literature --Abstract, page iv

    Hybridization of machine learning for advanced manufacturing

    Get PDF
    Tesis por compendio de publicacioines[ES] En el contexto de la industria, hoy por hoy, los términos “Fabricación Avanzada”, “Industria 4.0” y “Fábrica Inteligente” están convirtiéndose en una realidad. Las empresas industriales buscan ser más competitivas, ya sea en costes, tiempo, consumo de materias primas, energía, etc. Se busca ser eficiente en todos los ámbitos y además ser sostenible. El futuro de muchas compañías depende de su grado de adaptación a los cambios y su capacidad de innovación. Los consumidores son cada vez más exigentes, buscando productos personalizados y específicos con alta calidad, a un bajo coste y no contaminantes. Por todo ello, las empresas industriales implantan innovaciones tecnológicas para conseguirlo. Entre estas innovaciones tecnológicas están la ya mencionada Fabricación Avanzada (Advanced Manufacturing) y el Machine Learning (ML). En estos campos se enmarca el presente trabajo de investigación, en el que se han concebido y aplicado soluciones inteligentes híbridas que combinan diversas técnicas de ML para resolver problemas en el campo de la industria manufacturera. Se han aplicado técnicas inteligentes tales como Redes Neuronales Artificiales (RNA), algoritmos genéticos multiobjetivo, métodos proyeccionistas para la reducción de la dimensionalidad, técnicas de agrupamiento o clustering, etc. También se han utilizado técnicas de Identificación de Sistemas con el propósito de obtener el modelo matemático que representa mejor el sistema real bajo estudio. Se han hibridado diversas técnicas con el propósito de construir soluciones más robustas y fiables. Combinando técnicas de ML específicas se crean sistemas más complejos y con una mayor capacidad de representación/solución. Estos sistemas utilizan datos y el conocimiento sobre estos para resolver problemas. Las soluciones propuestas buscan solucionar problemas complejos del mundo real y de un amplio espectro, manejando aspectos como la incertidumbre, la falta de precisión, la alta dimensionalidad, etc. La presente tesis cubre varios casos de estudio reales, en los que se han aplicado diversas técnicas de ML a distintas problemáticas del campo de la industria manufacturera. Los casos de estudio reales de la industria en los que se ha trabajado, con cuatro conjuntos de datos diferentes, se corresponden con: • Proceso de fresado dental de alta precisión, de la empresa Estudio Previo SL. • Análisis de datos para el mantenimiento predictivo de una empresa del sector de la automoción, como es la multinacional Grupo Antolin. Adicionalmente se ha colaborado con el grupo de investigación GICAP de la Universidad de Burgos y con el centro tecnológico ITCL en los casos de estudio que forman parte de esta tesis y otros relacionados. Las diferentes hibridaciones de técnicas de ML desarrolladas han sido aplicadas y validadas con conjuntos de datos reales y originales, en colaboración con empresas industriales o centros de fresado, permitiendo resolver problemas actuales y complejos. De esta manera, el trabajo realizado no ha tenido sólo un enfoque teórico, sino que se ha aplicado de modo práctico permitiendo que las empresas industriales puedan mejorar sus procesos, ahorrar en costes y tiempo, contaminar menos, etc. Los satisfactorios resultados obtenidos apuntan hacia la utilidad y aportación que las técnicas de ML pueden realizar en el campo de la Fabricación Avanzada

    Automatic parts of speech determination in amorphologically complex language

    No full text
    Istraţivanje je imalo za cilj da provjeri u kojoj mjeri se naš kognitivni sistem moţe osloniti na fonotaktiĉke informacije, tj. moguće/dozvoljene kombinacije fonema/ grafema, u zadacima automatske percepcije i produkcije rijeĉi u jezicima sa bogatom infleksionom morfologijom. Da bi se dobio odgovor na to pitanje, sprovedene su tri studije. U prvoj studiji, uz pomoć mašina sa vektorima podrške (SVM), obavljena je diskriminacija promjenljivih vrsta rijeĉi. U drugoj studiji, produkcija infleksionih oblika rijeĉi izvedena je pomoću uĉenja zasnovanog na memoriji (MBL). Na osnovu rezultata iz druge studije, izveden je eksperiment u kojem se traţila potvrda kognitivne vjerodostojnosti modela i korišćenih informacija. Diskriminacija promjenljivih vrsta rijeĉi obavljena je na osnovu dozvoljenih sekvenci dva i tri grafema/fonema (tzv. bigrama i trigrama), ĉije su frekvencije javljanja unutar pojedinaĉnih gramatiĉkih tipova izraĉunate u zavisnosti od njihovog poloţaja u rijeĉima: na poĉetku, na kraju, unutar rijeĉi, svi zajedno. Maksimalna taĉnost se kretala oko 95% i dobijena je na svim bigramima, uz pomoć RBF jezgrene funkcije. Ovako visok procenat taĉne diskriminacije ukazuje da postoje karakteristiĉne distribucije bigrama za razliĉite vrste promjenljivih rijeĉi. S druge strane, najmanje informativnim su se pokazali bigrami na kraju i na poĉetku rijeĉi. MBL model iskorišćen je u zadatku automatske infleksione produkcije, tako što je za zadatu rijeĉ, na osnovu fonotaktiĉkih informacija iz posljednja ĉetiri sloga, generisan traţeni infleksioni oblik. Na uzorku od 89024 promjenljivih rijeĉi uzetih iz Frekvencijskog reĉnika dnevne štampe srpskog jezika, koristeći metod izostavljanja jednog primjera i konstantu veliĉinu skupa susjeda (k = 7), ostvarena je taĉnost oko 92%. Identifikovano je nekoliko faktora koji su uticali na ovu taĉnost, kao što su: vrsta rijeĉi, gramatiĉki tip, naĉin tvorbe i broj primjera u okviru jednog gramatiĉkog tipa, broju izuzetaka, broj fonoloških alternacija itd. U istraţivanju na subjektima, u zadatku leksiĉke odluke, za rijeĉi koje je MBL pogrešno obradio utvrĊeno je duţe vrijeme obrade. Ovo ukazuje na kognitivnu vjerodostojnost uĉenja zasnovanog na memoriji. Osim toga, potvrĊena je i kognitivna vjerodostojnost fonotaktiĉkih informacija, ovaj put u zadatku razumijevanja jezika. Sveukupno, nalazi dobijeni u ove tri studije govore u prilog teze o znaĉajnoj ulozi fonotaktiĉkih informacija u percepciji i produkciji morfološki sloţenih rijeĉi. Rezultati, takoĊe, ukazuju na potrebu da se ove informacije uzmu u obzir kada se diskutuje pojavljivanje većih jeziĉkih jedinica i obrazaca.The study was aimed at testing the extent to which our cognitive system can rely on phonotactic information, i.e., possible/ permissible combinations of phonemes/ graphemes, in the tasks of automatic processing and production of words in languages with rich inflectional morphology. In order to obtain the answer to this question, three studies have been conducted. In the first study, by applying the support vector machines (SVM) the discrimination of part of speech (PoS) with more than one possible meaning (i.e., ambiguous PoS) was performed. In the second study, the production of inflected word forms was done with memory based learning (MBL). Based on the results from the second study, a behavioral experiment was conducted as the third study, to test cognitive plausibility of the MBL performance. The discrimination of ambiguous PoS was performed using permissible sequences of two and three characters/sounds (i.e., bigrams and trigrams), whose frequency of occurrence within individual grammatical types was calculated depending on their position in a word: at the beginning, at the end, and irrespective of position in a word. Maximum accuracy achieved was approximatelly 95%. It was obtained when bigrams irrespective of position in a word were used. SVM model used RBF kernel function. Such high accuracy suggests that brigrams' probability distribution is informative about the types of flective words. Interestingly, the least informative were bigrams at the end and at the beginning of words. The MBL model was used in the task of automatic production of inflected forms, utilizingphonotactic information from the last four syllables. In a sample of 89024 flective words, taken from the Frequency dictionary of Serbian language (daily press), achieved accuracy was 92%. For this result the MBL used leave -one -out method and nearest neighborhood size of 7 (k = 7). We identified several factors that have contributed to the accuracy; in particular, part of speech, grammatical type, formation method and number of examples within one grammatical type, number of exceptions, the number of phonological alternations, etc. The visual lexical decision experiment revealed that words that the MBL model produced incorrectly also induced elongated reaction time latencies. Thus, we concluded that the MBL model might be cognitively plausibile. In addition, we reconfirmed informativeness of phonotactic information, this time in human conmprehension task. Overall, findings from three undertaken studies are in favor of phonotactic information for both processing and production of morphologically complex words. Results also suggest a necessity of taking into account this information when discussing emergence of larger units and language patterns

    Hyperrectangles Selection for Monotonic Classification by Using Evolutionary Algorithms

    Get PDF
    In supervised learning, some real problems require the response attribute to represent ordinal values that should increase with some of the explaining attributes. They are called classification problems with monotonicity constraints. Hyperrectangles can be viewed as storing objects in Rn which can be used to learn concepts combining instance-based classification with the axis-parallel rectangle mainly used in rule induction systems. This hybrid paradigm is known as nested generalized exemplar learning. In this paper, we propose the selection of the most effective hyperrectangles by means of evolutionary algorithms to tackle monotonic classification. The model proposed is compared through an exhaustive experimental analysis involving a large number of data sets coming from real classification and regression problems. The results reported show that our evolutionary proposal outperforms other instance-based and rule learning models, such as OLM, OSDL, k-NN and MID; in accuracy and mean absolute error, requiring a fewer number of hyperrectangles.TIN2014-57251-
    corecore