175 research outputs found

    Incremental Cluster Validity Indices for Online Learning of Hard Partitions: Extensions and Comparative Study

    Get PDF
    Validation is one of the most important aspects of clustering, particularly when the user is designing a trustworthy or explainable system. However, most clustering validation approaches require batch calculation. This is an important gap because of the value of clustering in real-time data streaming and other online learning applications. Therefore, interest has grown in providing online alternatives for validation. This paper extends the incremental cluster validity index (iCVI) family by presenting incremental versions of Calinski-Harabasz (iCH), Pakhira-Bandyopadhyay-Maulik (iPBM), WB index (iWB), Silhouette (iSIL), Negentropy Increment (iNI), Representative Cross Information Potential (irCIP), Representative Cross Entropy (irH), and Conn_Index (iConn_Index). This paper also provides a thorough comparative study of correct, under- and over-partitioning on the behavior of these iCVIs, the Partition Separation (PS) index as well as four recently introduced iCVIs: incremental Xie-Beni (iXB), incremental Davies-Bouldin (iDB), and incremental generalized Dunn\u27s indices 43 and 53 (iGD43 and iGD53). Experiments were carried out using a framework that was designed to be as agnostic as possible to the clustering algorithms. The results on synthetic benchmark data sets showed that while evidence of most under-partitioning cases could be inferred from the behaviors of the majority of these iCVIs, over-partitioning was found to be a more challenging problem, detected by fewer of them. Interestingly, over-partitioning, rather then under-partitioning, was more prominently detected on the real-world data experiments within this study. The expansion of iCVIs provides significant novel opportunities for assessing and interpreting the results of unsupervised lifelong learning in real-time, wherein samples cannot be reprocessed due to memory and/or application constraints

    Оптимизационный подход к выбору методов обнаружения аномалий в однородных текстовых коллекциях

    Get PDF
    Рассматривается задача обнаружения аномальных документов в текстовых коллекциях. Существующие методы выявления аномалий не универсальны и не показывают стабильный результат на разных наборах данных. Точность результатов зависит от выбора параметров на каждом из шагов алгоритма, и для разных коллекций оптимальны различные наборы параметров. Не все из существующих алгоритмов обнаружения аномалий эффективно работают с текстовыми данными, векторное представление которых характеризуется большой размерностью при сильной разреженности. Задача поиска аномалий рассматривается в следующей постановке: требуется проверить новый документ, загружаемый в прикладную интеллектуальную информационную систему (ПИИС), на соответствие хранящейся в ней однородной коллекции документов. В ПИИС, обрабатывающих юридически значимые документы, на методы обнаружения аномалий накладываются следующие ограничения: высокая точность, вычислительная эффективность, воспроизводимость результатов, а также объяснимость решения. Исследуются методы, удовлетворяющие этим условиям. В работе изучается возможность оценки текстовых документов по шкале аномальности путем внедрения в коллекцию заведомо инородного документа. Предложена стратегия обнаружения в документе новизны по отношению к коллекции, предполагающая обоснованный подбор методов и параметров. Показано, как на точность решения влияет выбор вариантов векторизации, принципов токенизации, методов снижения размерности и параметров алгоритмов поиска аномалий. Эксперимент проведен на двух однородных коллекциях нормативно-технических документов: стандартов в отношении информационных технологий и в сфере железных дорог. Использовались подходы: вычисление индекса аномальности как расстояния Хеллингера между распределениями близости документов к центру коллекции и к инородному документу; оптимизация алгоритмов поиска аномалий в зависимости от методов векторизации и снижения размерности. Векторное пространство строилось с помощью преобразования TF-IDF и тематического моделирования ARTM. Тестировались алгоритмы Isolation Forest (изолирующий лес), Local Outlier Factor (локальный фактор выброса), OneClass SVM (вариант метода опорных векторов). Эксперимент подтвердил эффективность предложенной оптимизационной стратегии для определения подходящего метода обнаружения аномалий для заданной текстовой коллекции. При поиске аномалии в рамках тематической кластеризации юридически значимых документов эффективен метод изолирующего леса. При векторизации документов по TF-IDF целесообразно подобрать оптимальные параметры словаря и использовать метод опорных векторов с соответствующей функцией преобразования признакового пространства

    Оптимизационный подход к выбору методов обнаружения аномалий в однородных текстовых коллекциях

    Get PDF
    The problem of detecting anomalous documents in text collections is considered. The existing methods for detecting anomalies are not universal and do not show a stable result on different data sets. The accuracy of the results depends on the choice of parameters at each step of the problem solving algorithm process, and for different collections different sets of parameters are optimal. Not all of the existing algorithms for detecting anomalies work effectively with text data, which vector representation is characterized by high dimensionality with strong sparsity. The problem of finding anomalies is considered in the following statement: it is necessary to checking a new document uploaded to an applied intelligent information system for congruence with a homogeneous collection of documents stored in it. In such systems that process legal documents the following limitations are imposed on the anomaly detection methods: high accuracy, computational efficiency, reproducibility of results and explicability of the solution. Methods satisfying these conditions are investigated. The paper examines the possibility of evaluating text documents on the scale of anomaly by deliberately introducing a foreign document into the collection. A strategy for detecting novelty of the document in relation to the collection is proposed, which assumes a reasonable selection of methods and parameters. It is shown how the accuracy of the solution is affected by the choice of vectorization options, tokenization principles, dimensionality reduction methods and parameters of novelty detection algorithms. The experiment was conducted on two homogeneous collections of documents containing technical norms: standards in the field of information technology and railways. The following approaches were used: calculation of the anomaly index as the Hellinger distance between the distributions of the remoteness of documents to the center of the collection and to the foreign document; optimization of the novelty detection algorithms depending on the methods of vectorization and dimensionality reduction. The vector space was constructed using the TF-IDF transformation and ARTM topic modeling. The following algorithms have been tested: Isolation Forest, Local Outlier Factor and One-Class SVM (based on Support Vector Machine). The experiment confirmed the effectiveness of the proposed optimization strategy for determining the appropriate method for detecting anomalies for a given text collection. When searching for an anomaly in the context of topic clustering of legal documents, the Isolating Forest method is proved to be effective. When vectorizing documents using TF-IDF, it is advisable to choose the optimal dictionary parameters and use the One-Class SVM method with the corresponding feature space transformation function.Рассматривается задача обнаружения аномальных документов в текстовых коллекциях. Существующие методы выявления аномалий не универсальны и не показывают стабильный результат на разных наборах данных. Точность результатов зависит от выбора параметров на каждом из шагов алгоритма, и для разных коллекций оптимальны различные наборы параметров. Не все из существующих алгоритмов обнаружения аномалий эффективно работают с текстовыми данными, векторное представление которых характеризуется большой размерностью при сильной разреженности. Задача поиска аномалий рассматривается в следующей постановке: требуется проверить новый документ, загружаемый в прикладную интеллектуальную информационную систему (ПИИС), на соответствие хранящейся в ней однородной коллекции документов. В ПИИС, обрабатывающих юридически значимые документы, на методы обнаружения аномалий накладываются следующие ограничения: высокая точность, вычислительная эффективность, воспроизводимость результатов, а также объяснимость решения. Исследуются методы, удовлетворяющие этим условиям. В работе изучается возможность оценки текстовых документов по шкале аномальности путем внедрения в коллекцию заведомо инородного документа. Предложена стратегия обнаружения в документе новизны по отношению к коллекции, предполагающая обоснованный подбор методов и параметров. Показано, как на точность решения влияет выбор вариантов векторизации, принципов токенизации, методов снижения размерности и параметров алгоритмов поиска аномалий. Эксперимент проведен на двух однородных коллекциях нормативно-технических документов: стандартов в отношении информационных технологий и в сфере железных дорог. Использовались подходы: вычисление индекса аномальности как расстояния Хеллингера между распределениями близости документов к центру коллекции и к инородному документу; оптимизация алгоритмов поиска аномалий в зависимости от методов векторизации и снижения размерности. Векторное пространство строилось с помощью преобразования TF-IDF и тематического моделирования ARTM. Тестировались алгоритмы Isolation Forest (изолирующий лес), Local Outlier Factor (локальный фактор выброса), OneClass SVM (вариант метода опорных векторов). Эксперимент подтвердил эффективность предложенной оптимизационной стратегии для определения подходящего метода обнаружения аномалий для заданной текстовой коллекции. При поиске аномалии в рамках тематической кластеризации юридически значимых документов эффективен метод изолирующего леса. При векторизации документов по TF-IDF целесообразно подобрать оптимальные параметры словаря и использовать метод опорных векторов с соответствующей функцией преобразования признакового пространства

    A new derivative of midimew-connected mesh network

    Get PDF
    In this paper, we present a derivative of Midimew connected Mesh Network (MMN) by reassigning the free links for higher level interconnection for the optimum performance of the MMN; called Derived MMN (DMMN). We present the architecture of DMMN, addressing of nodes, routing of message and evaluate the static network performance. It is shown that the proposed DMMN possesses several attractive features, including constant degree, small diameter, low cost, small average distance, moderate bisection width, and same fault tolerant performance than that of other conventional and hierarchical interconnection networks. With the same node degree, arc connectivity, bisection width, and wiring complexity, the average distance of the DMMN is lower than that of other networks

    Concept drift learning and its application to adaptive information filtering

    Get PDF
    Tracking the evolution of user interests is a problem instance of concept drift learning. Keeping track of multiple interest categories is a natural phenomenon as well as an interesting tracking problem because interests can emerge and diminish at different time frames. The first part of this dissertation presents a Multiple Three-Descriptor Representation (MTDR) algorithm, a novel algorithm for learning concept drift especially built for tracking the dynamics of multiple target concepts in the information filtering domain. The learning process of the algorithm combines the long-term and short-term interest (concept) models in an attempt to benefit from the strength of both models. The MTDR algorithm improves over existing concept drift learning algorithms in the domain. Being able to track multiple target concepts with a few examples poses an even more important and challenging problem because casual users tend to be reluctant to provide the examples needed, and learning from a few labeled data is generally difficult. The second part presents a computational Framework for Extending Incomplete Labeled Data Stream (FEILDS). The system modularly extends the capability of an existing concept drift learner in dealing with incomplete labeled data stream. It expands the learner's original input stream with relevant unlabeled data; the process generates a new stream with improved learnability. FEILDS employs a concept formation system for organizing its input stream into a concept (cluster) hierarchy. The system uses the concept and cluster hierarchy to identify the instance's concept and unlabeled data relevant to a concept. It also adopts the persistence assumption in temporal reasoning for inferring the relevance of concepts. Empirical evaluation indicates that FEILDS is able to improve the performance of existing learners particularly when learning from a stream with a few labeled data. Lastly, a new concept formation algorithm, one of the key components in the FEILDS architecture, is presented. The main idea is to discover intrinsic hierarchical structures regardless of the class distribution and the shape of the input stream. Experimental evaluation shows that the algorithm is relatively robust to input ordering, consistently producing a hierarchy structure of high quality

    Incorporating Reconfigurability, Error Detection and Recovery into the Chameleon ARMOR Architecture

    Get PDF
    Coordinated Science Laboratory was formerly known as Control Systems LaboratoryJet Propulsion Lab / NASA JPL 96134

    Neuroengineering of Clustering Algorithms

    Get PDF
    Cluster analysis can be broadly divided into multivariate data visualization, clustering algorithms, and cluster validation. This dissertation contributes neural network-based techniques to perform all three unsupervised learning tasks. Particularly, the first paper provides a comprehensive review on adaptive resonance theory (ART) models for engineering applications and provides context for the four subsequent papers. These papers are devoted to enhancements of ART-based clustering algorithms from (a) a practical perspective by exploiting the visual assessment of cluster tendency (VAT) sorting algorithm as a preprocessor for ART offline training, thus mitigating ordering effects; and (b) an engineering perspective by designing a family of multi-criteria ART models: dual vigilance fuzzy ART and distributed dual vigilance fuzzy ART (both of which are capable of detecting complex cluster structures), merge ART (aggregates partitions and lessens ordering effects in online learning), and cluster validity index vigilance in fuzzy ART (features a robust vigilance parameter selection and alleviates ordering effects in offline learning). The sixth paper consists of enhancements to data visualization using self-organizing maps (SOMs) by depicting in the reduced dimension and topology-preserving SOM grid information-theoretic similarity measures between neighboring neurons. This visualization\u27s parameters are estimated using samples selected via a single-linkage procedure, thereby generating heatmaps that portray more homogeneous within-cluster similarities and crisper between-cluster boundaries. The seventh paper presents incremental cluster validity indices (iCVIs) realized by (a) incorporating existing formulations of online computations for clusters\u27 descriptors, or (b) modifying an existing ART-based model and incrementally updating local density counts between prototypes. Moreover, this last paper provides the first comprehensive comparison of iCVIs in the computational intelligence literature --Abstract, page iv

    Semi-Supervised Named Entity Recognition:\ud Learning to Recognize 100 Entity Types with Little Supervision\ud

    Get PDF
    Named Entity Recognition (NER) aims to extract and to classify rigid designators in text such as proper names, biological species, and temporal expressions. There has been growing interest in this field of research since the early 1990s. In this thesis, we document a trend moving away from handcrafted rules, and towards machine learning approaches. Still, recent machine learning approaches have a problem with annotated data availability, which is a serious shortcoming in building and maintaining large-scale NER systems. \ud \ud In this thesis, we present an NER system built with very little supervision. Human supervision is indeed limited to listing a few examples of each named entity (NE) type. First, we introduce a proof-of-concept semi-supervised system that can recognize four NE types. Then, we expand its capacities by improving key technologies, and we apply the system to an entire hierarchy comprised of 100 NE types. \ud \ud Our work makes the following contributions: the creation of a proof-of-concept semi-supervised NER system; the demonstration of an innovative noise filtering technique for generating NE lists; the validation of a strategy for learning disambiguation rules using automatically identified, unambiguous NEs; and finally, the development of an acronym detection algorithm, thus solving a rare but very difficult problem in alias resolution. \ud \ud We believe semi-supervised learning techniques are about to break new ground in the machine learning community. In this thesis, we show that limited supervision can build complete NER systems. On standard evaluation corpora, we report performances that compare to baseline supervised systems in the task of annotating NEs in texts. \u
    corecore