6 research outputs found

    A Semantic Graph-Based Approach for Mining Common Topics From Multiple Asynchronous Text Streams

    Get PDF
    In the age of Web 2.0, a substantial amount of unstructured content are distributed through multiple text streams in an asynchronous fashion, which makes it increasingly difficult to glean and distill useful information. An effective way to explore the information in text streams is topic modelling, which can further facilitate other applications such as search, information browsing, and pattern mining. In this paper, we propose a semantic graph based topic modelling approach for structuring asynchronous text streams. Our model in- tegrates topic mining and time synchronization, two core modules for addressing the problem, into a unified model. Specifically, for handling the lexical gap issues, we use global semantic graphs of each timestamp for capturing the hid- den interaction among entities from all the text streams. For dealing with the sources asynchronism problem, local semantic graphs are employed to discover similar topics of different entities that can be potentially separated by time gaps. Our experiment on two real-world datasets shows that the proposed model significantly outperforms the existing ones

    Orthonormal Explicit Topic Analysis for Cross-lingual Document Matching

    Get PDF
    McCrae J, Cimiano P, Klinger R. Orthonormal Explicit Topic Analysis for Cross-lingual Document Matching. In: Proceedings of the 2013 Conference on Empirical Natural Language Processing. 2013: 1732-1740

    Neuromorphic Learning Systems for Supervised and Unsupervised Applications

    Get PDF
    The advancements in high performance computing (HPC) have enabled the large-scale implementation of neuromorphic learning models and pushed the research on computational intelligence into a new era. Those bio-inspired models are constructed on top of unified building blocks, i.e. neurons, and have revealed potentials for learning of complex information. Two major challenges remain in neuromorphic computing. Firstly, sophisticated structuring methods are needed to determine the connectivity of the neurons in order to model various problems accurately. Secondly, the models need to adapt to non-traditional architectures for improved computation speed and energy efficiency. In this thesis, we address these two problems and apply our techniques to different cognitive applications. This thesis first presents the self-structured confabulation network for anomaly detection. Among the machine learning applications, unsupervised detection of the anomalous streams is especially challenging because it requires both detection accuracy and real-time performance. Designing a computing framework that harnesses the growing computing power of the multicore systems while maintaining high sensitivity and specificity to the anomalies is an urgent research need. We present AnRAD (Anomaly Recognition And Detection), a bio-inspired detection framework that performs probabilistic inferences. We leverage the mutual information between the features and develop a self-structuring procedure that learns a succinct confabulation network from the unlabeled data. This network is capable of fast incremental learning, which continuously refines the knowledge base from the data streams. Compared to several existing anomaly detection methods, the proposed approach provides competitive detection accuracy as well as the insight to reason the decision making. Furthermore, we exploit the massive parallel structure of the AnRAD framework. Our implementation of the recall algorithms on the graphic processing unit (GPU) and the Xeon Phi co-processor both obtain substantial speedups over the sequential implementation on general-purpose microprocessor (GPP). The implementation enables real-time service to concurrent data streams with diversified contexts, and can be applied to large problems with multiple local patterns. Experimental results demonstrate high computing performance and memory efficiency. For vehicle abnormal behavior detection, the framework is able to monitor up to 16000 vehicles and their interactions in real-time with a single commodity co-processor, and uses less than 0.2ms for each testing subject. While adapting our streaming anomaly detection model to mobile devices or unmanned systems, the key challenge is to deliver required performance under the stringent power constraint. To address the paradox between performance and power consumption, brain-inspired hardware, such as the IBM Neurosynaptic System, has been developed to enable low power implementation of neural models. As a follow-up to the AnRAD framework, we proposed to port the detection network to the TrueNorth architecture. Implementing inference based anomaly detection on a neurosynaptic processor is not straightforward due to hardware limitations. A design flow and the supporting component library are developed to flexibly map the learned detection networks to the neurosynaptic cores. Instead of the popular rate code, burst code is adopted in the design, which represents numerical value using the phase of a burst of spike trains. This does not only reduce the hardware complexity, but also increases the result\u27s accuracy. A Corelet library, NeoInfer-TN, is implemented for basic operations in burst code and two-phase pipelines are constructed based on the library components. The design can be configured for different tradeoffs between detection accuracy, hardware resource consumptions, throughput and energy. We evaluate the system using network intrusion detection data streams. The results show higher detection rate than some conventional approaches and real-time performance, with only 50mW power consumption. Overall, it achieves 10^8 operations per Joule. In addition to the modeling and implementation of unsupervised anomaly detection, we also investigate a supervised learning model based on neural networks and deep fragment embedding and apply it to text-image retrieval. The study aims at bridging the gap between image and natural language. It continues to improve the bidirectional retrieval performance across the modalities. Unlike existing works that target at single sentence densely describing the image objects, we elevate the topic to associating deep image representations with noisy texts that are only loosely correlated. Based on text-image fragment embedding, our model employs a sequential configuration, connects two embedding stages together. The first stage learns the relevancy of the text fragments, and the second stage uses the filtered output from the first one to improve the matching results. The model also integrates multiple convolutional neural networks (CNN) to construct the image fragments, in which rich context information such as human faces can be extracted to increase the alignment accuracy. The proposed method is evaluated with both synthetic dataset and real-world dataset collected from picture news website. The results show up to 50% ranking performance improvement over the comparison models

    Learning Algorithm to Automate Fast Author Name Disambiguation

    Get PDF
    RÉSUMÉ : La production scientifique mondiale représente une quantité massive d’enregistrements auxquels on peut accéder via de nombreuses bases de données. En raison de la présence d’enregistrements ambigus, un processus de désambiguïsation efficace dans un délai raisonnable est nécessaire comme étape essentielle pour extraire l’information correcte et générer des statistiques de publication. Cependant, la tâche de désambiguïsation est exhaustive et complexe en raison des bases de données volumineuses et des données manquantes. Actuellement, il n’existe pas de méthode automatique complète capable de produire des résultats satisfaisants pour le processus de désambiguïsation. Auparavant, une application efficace de désambiguïsation d’entité a été développée, qui est un algorithme en cascade supervisé donnant des résultats prometteurs sur de grandes bases de données bibliographiques. Bien que le travail existant produise des résultats de haute qualité dans un délai de traitement raisonnable, il manque un choix efficace de métriques et la structure des classificateurs est déterminée d’une manière heuristique par l’analyse des erreurs de précision et de rappel. De toute évidence, une approche automatisée qui rend l’application flexible et réglable améliorerait directement la convivialité de l’application. Une telle approche permettrait de comprendre l’importance de chaque classification d’attributs dans le processus de désambiguïsation et de sélectionner celles qui sont les plus performantes. Dans cette recherche, nous proposons un algorithme d’apprentissage pour automatiser le processus de désambiguïsation de cette application. Pour atteindre nos objectifs, nous menons trois étapes majeures: premièrement, nous abordons le problème d’évaluation des algorithmes de codage phonétique qui peuvent être utilisés dans le blocking. Six algorithmes de codage phonétique couramment utilisés ont été sélectionnés et des mesures d’évaluation quantitative spécifiques ont été développées afin d’évaluer leurs limites et leurs avantages et de recruter le meilleur. Deuxièmement, nous testons différentes mesures de similarité de chaîne de caractères et nous analysons les avantages et les inconvénients de chaque technique. En d’autres termes, notre deuxième objectif est de construire une méthode de désambiguïsation efficace en comparant plusieurs algorithmes basés sur les edits et les tokens pour améliorer la méthode du blocking. Enfin, en utilisant les méthodes d’agrégation bootstrap (Bagging) et AdaBoost, un algorithme a été développé qui utilise des techniques d’optimisation de particle swarm et d’optimisation de set covers pour concevoir un cadre d’apprentissage qui permet l’ordre automatique des weak classifiers et la détermination de leurs seuils. Des comparaisons de performance ont été effectuées sur des données réelles extraites du Web of Science (WoS) et des bases de données bibliographiques SCOPUS. En résumé, ce travail nous permet de tirer des conclusions sur les qualités et les faiblesses de chaque algorithme phonétique et mesure de similarité dans la perspective de notre application. Nous avons montré que l’algorithme phonétique NYSIIS est un meilleur choix à utiliser dans l’étape de blocking de l’application de désambiguïsation. De plus, l’algorithme de Weighting Table-based surpassait certains des algorithmes de similarité couramment utilisés en terme de efficacité de temps, tout en produisant des résultats satisfaisants. En outre, nous avons proposé une méthode d’apprentissage pour déterminer automatiquement la structure de l’algorithme de désambiguïsation.----------ABSTRACT : The worldwide scientific production represents a massive amount of records which can be accessed via numerous databases. Because of the presence of ambiguous records, a time-efficient disambiguation process is required as an essential step of extracting correct information and generating publication statistics. However, the disambiguation task is exhaustive and complex due to the large volume databases and existing missing data. Currently there is no complete automatic method that is able to produce satisfactory results for the disambiguation process. Previously, an efficient entity disambiguation application was developed that is a supervised cascade algorithm which gives promising results on large bibliographic databases. Although the existing work produces high-quality results within a reasonable processing time, it lacks an efficient choice of metrics and the structure of the classifiers is determined in a heuristic manner by the analysis of precision and recall errors. Clearly, an automated approach that makes the application flexible and adjustable would directly enhance the usability of the application. Such approach would help to understand the importance of each feature classification in the disambiguation process and select the most efficient ones. In this research, we propose a learning algorithm for automating the disambiguation process of this application. In fact, the aim of this work is to help to employ the most appropriate phonetic algorithm and similarity measures as well as introduce a desirable automatic approach instead of a heuristic approach. To achieve our goals, we conduct three major steps: First, we address the problem of evaluating phonetic encoding algorithms that can be used in blocking. Six commonly used phonetic encoding algorithm were selected and specific quantitative evaluation metrics were developed in order to assess their limitations and advantages and recruit the best one. Second, we test different string similarity measures and we analyze the advantages and disadvantages of each technique. In other words, our second goal is to build an efficient disambiguation method by comparing several editand token-based algorithms to improve the blocking method. Finally, using bootstrap aggregating (Bagging) and AdaBoost methods, an algorithm has been developed that employs particle swarm and set cover optimization techniques to design a learning framework that enables automatic ordering of the weak classifiers and determining their thresholds. Performance comparisons were carried out on real data extracted from the web of science (WoS) and the SCOPUS bibliographic databases. In summary, this work allows us to draw conclusions about the qualities and weaknesses of each phonetic algorithm and similarity measure in the perspective of our application. We have shown that the NYSIIS phonetic algorithm is a better choice to use in blocking step of the disambiguation application. In addition, the Weighting Table-based algorithm outperforms some of the commonly used similarity algorithms in terms of time-efficiency, while producing satisfactory results. Moreover, we proposed a learning method to determine the structure of the disambiguation algorithm automatically

    Understanding and Enhancing the Use of Context for Machine Translation

    Get PDF
    To understand and infer meaning in language, neural models have to learn complicated nuances. Discovering distinctive linguistic phenomena from data is not an easy task. For instance, lexical ambiguity is a fundamental feature of language which is challenging to learn. Even more prominently, inferring the meaning of rare and unseen lexical units is difficult with neural networks. Meaning is often determined from context. With context, languages allow meaning to be conveyed even when the specific words used are not known by the reader. To model this learning process, a system has to learn from a few instances in context and be able to generalize well to unseen cases. The learning process is hindered when training data is scarce for a task. Even with sufficient data, learning patterns for the long tail of the lexical distribution is challenging. In this thesis, we focus on understanding certain potentials of contexts in neural models and design augmentation models to benefit from them. We focus on machine translation as an important instance of the more general language understanding problem. To translate from a source language to a target language, a neural model has to understand the meaning of constituents in the provided context and generate constituents with the same meanings in the target language. This task accentuates the value of capturing nuances of language and the necessity of generalization from few observations. The main problem we study in this thesis is what neural machine translation models learn from data and how we can devise more focused contexts to enhance this learning. Looking more in-depth into the role of context and the impact of data on learning models is essential to advance the NLP field. Moreover, it helps highlight the vulnerabilities of current neural networks and provides insights into designing more robust models.Comment: PhD dissertation defended on November 10th, 202
    corecore