413 research outputs found

    Representation Learning for Words and Entities

    Get PDF
    This thesis presents new methods for unsupervised learning of distributed representations of words and entities from text and knowledge bases. The first algorithm presented in the thesis is a multi-view algorithm for learning representations of words called Multiview Latent Semantic Analysis (MVLSA). By incorporating up to 46 different types of co-occurrence statistics for the same vocabulary of english words, I show that MVLSA outperforms other state-of-the-art word embedding models. Next, I focus on learning entity representations for search and recommendation and present the second method of this thesis, Neural Variational Set Expansion (NVSE). NVSE is also an unsupervised learning method, but it is based on the Variational Autoencoder framework. Evaluations with human annotators show that NVSE can facilitate better search and recommendation of information gathered from noisy, automatic annotation of unstructured natural language corpora. Finally, I move from unstructured data and focus on structured knowledge graphs. I present novel approaches for learning embeddings of vertices and edges in a knowledge graph that obey logical constraints.Comment: phd thesis, Machine Learning, Natural Language Processing, Representation Learning, Knowledge Graphs, Entities, Word Embeddings, Entity Embedding

    Representation Learning for Words and Entities

    Get PDF
    This thesis presents new methods for unsupervised learning of distributed representations of words and entities from text and knowledge bases. The first algorithm presented in the thesis is a multi-view algorithm for learning representations of words called Multiview LSA (MVLSA). Through experiments on close to 50 different views, I show that MVLSA outperforms other state-of-the-art word embedding models. After that, I focus on learning entity representations for search and recommendation and present the second algorithm of this thesis called Neural Variational Set Expansion (NVSE). NVSE is also an unsupervised learning method, but it is based on the Variational Autoencoder framework. Evaluations with human annotators show that NVSE can facilitate better search and recommendation of information gathered from noisy, automatic annotation of unstructured natural language corpora. Finally, I move from unstructured data and focus on structured knowledge graphs. Moreover, I present novel approaches for learning embeddings of vertices and edges in a knowledge graph that obey logical constraints

    A Graph-Based Approach for the Summarization of Scientific Articles

    Get PDF
    Automatic text summarization is one of the eminent applications in the field of Natural Language Processing. Text summarization is the process of generating a gist from text documents. The task is to produce a summary which contains important, diverse and coherent information, i.e., a summary should be self-contained. The approaches for text summarization are conventionally extractive. The extractive approaches select a subset of sentences from an input document for a summary. In this thesis, we introduce a novel graph-based extractive summarization approach. With the progressive advancement of research in the various fields of science, the summarization of scientific articles has become an essential requirement for researchers. This is our prime motivation in selecting scientific articles as our dataset. This newly formed dataset contains scientific articles from the PLOS Medicine journal, which is a high impact journal in the field of biomedicine. The summarization of scientific articles is a single-document summarization task. It is a complex task due to various reasons, one of it being, the important information in the scientific article is scattered all over it and another reason being, scientific articles contain numerous redundant information. In our approach, we deal with the three important factors of summarization: importance, non-redundancy and coherence. To deal with these factors, we use graphs as they solve data sparsity problems and are computationally less complex. We employ bipartite graphical representation for the summarization task, exclusively. We represent input documents through a bipartite graph that consists of sentence nodes and entity nodes. This bipartite graph representation contains entity transition information which is beneficial for selecting the relevant sentences for a summary. We use a graph-based ranking algorithm to rank the sentences in a document. The ranks are considered as relevance scores of the sentences which are further used in our approach. Scientific articles contain reasonable amount of redundant information, for example, Introduction and Methodology sections contain similar information regarding the motivation and approach. In our approach, we ensure that the summary contains sentences which are non-redundant. Though the summary should contain important and non-redundant information of the input document, its sentences should be connected to one another such that it becomes coherent, understandable and simple to read. If we do not ensure that a summary is coherent, its sentences may not be properly connected. This leads to an obscure summary. Until now, only few summarization approaches take care of coherence. In our approach, we take care of coherence in two different ways: by using the graph measure and by using the structural information. We employ outdegree as the graph measure and coherence patterns for the structural information, in our approach. We use integer programming as an optimization technique, to select the best subset of sentences for a summary. The sentences are selected on the basis of relevance, diversity and coherence measure. The computation of these measures is tightly integrated and taken care of simultaneously. We use human judgements to evaluate coherence of summaries. We compare ROUGE scores and human judgements of different systems on the PLOS Medicine dataset. Our approach performs considerably better than other systems on this dataset. Also, we apply our approach on the standard DUC 2002 dataset to compare the results with the recent state-of-the-art systems. The results show that our graph-based approach outperforms other systems on DUC 2002. In conclusion, our approach is robust, i.e., it works on both scientific and news articles. Our approach has the further advantage of being semi-supervised

    Web knowledge bases

    Get PDF
    Knowledge is key to natural language understanding. References to specific people, places and things in text are crucial to resolving ambiguity and extracting meaning. Knowledge Bases (KBs) codify this information for automated systems — enabling applications such as entity-based search and question answering. This thesis explores the idea that sites on the web may act as a KB, even if that is not their primary intent. Dedicated kbs like Wikipedia are a rich source of entity information, but are built and maintained at an ongoing cost in human effort. As a result, they are generally limited in terms of the breadth and depth of knowledge they index about entities. Web knowledge bases offer a distributed solution to the problem of aggregating entity knowledge. Social networks aggregate content about people, news sites describe events with tags for organizations and locations, and a diverse assortment of web directories aggregate statistics and summaries for long-tail entities notable within niche movie, musical and sporting domains. We aim to develop the potential of these resources for both web-centric entity Information Extraction (IE) and structured KB population. We first investigate the problem of Named Entity Linking (NEL), where systems must resolve ambiguous mentions of entities in text to their corresponding node in a structured KB. We demonstrate that entity disambiguation models derived from inbound web links to Wikipedia are able to complement and in some cases completely replace the role of resources typically derived from the KB. Building on this work, we observe that any page on the web which reliably disambiguates inbound web links may act as an aggregation point for entity knowledge. To uncover these resources, we formalize the task of Web Knowledge Base Discovery (KBD) and develop a system to automatically infer the existence of KB-like endpoints on the web. While extending our framework to multiple KBs increases the breadth of available entity knowledge, we must still consolidate references to the same entity across different web KBs. We investigate this task of Cross-KB Coreference Resolution (KB-Coref) and develop models for efficiently clustering coreferent endpoints across web-scale document collections. Finally, assessing the gap between unstructured web knowledge resources and those of a typical KB, we develop a neural machine translation approach which transforms entity knowledge between unstructured textual mentions and traditional KB structures. The web has great potential as a source of entity knowledge. In this thesis we aim to first discover, distill and finally transform this knowledge into forms which will ultimately be useful in downstream language understanding tasks

    Behavioral analysis in cybersecurity using machine learning: a study based on graph representation, class imbalance and temporal dissection

    Get PDF
    The main goal of this thesis is to improve behavioral cybersecurity analysis using machine learning, exploiting graph structures, temporal dissection, and addressing imbalance problems.This main objective is divided into four specific goals: OBJ1: To study the influence of the temporal resolution on highlighting micro-dynamics in the entity behavior classification problem. In real use cases, time-series information could be not enough for describing the entity behavior classification. For this reason, we plan to exploit graph structures for integrating both structured and unstructured data in a representation of entities and their relationships. In this way, it will be possible to appreciate not only the single temporal communication but the whole behavior of these entities. Nevertheless, entity behaviors evolve over time and therefore, a static graph may not be enoughto describe all these changes. For this reason, we propose to use a temporal dissection for creating temporal subgraphs and therefore, analyze the influence of the temporal resolution on the graph creation and the entity behaviors within. Furthermore, we propose to study how the temporal granularity should be used for highlighting network micro-dynamics and short-term behavioral changes which can be a hint of suspicious activities. OBJ2: To develop novel sampling methods that work with disconnected graphs for addressing imbalanced problems avoiding component topology changes. Graph imbalance problem is a very common and challenging task and traditional graph sampling techniques that work directly on these structures cannot be used without modifying the graph’s intrinsic information or introducing bias. Furthermore, existing techniques have shown to be limited when disconnected graphs are used. For this reason, novel resampling methods for balancing the number of nodes that can be directly applied over disconnected graphs, without altering component topologies, need to be introduced. In particular, we propose to take advantage of the existence of disconnected graphs to detect and replicate the most relevant graph components without changing their topology, while considering traditional data-level strategies for handling the entity behaviors within. OBJ3: To study the usefulness of the generative adversarial networks for addressing the class imbalance problem in cybersecurity applications. Although traditional data-level pre-processing techniques have shown to be effective for addressing class imbalance problems, they have also shown downside effects when highly variable datasets are used, as it happens in cybersecurity. For this reason, new techniques that can exploit the overall data distribution for learning highly variable behaviors should be investigated. In this sense, GANs have shown promising results in the image and video domain, however, their extension to tabular data is not trivial. For this reason, we propose to adapt GANs for working with cybersecurity data and exploit their ability in learning and reproducing the input distribution for addressing the class imbalance problem (as an oversampling technique). Furthermore, since it is not possible to find a unique GAN solution that works for every scenario, we propose to study several GAN architectures with several training configurations to detect which is the best option for a cybersecurity application. OBJ4: To analyze temporal data trends and performance drift for enhancing cyber threat analysis. Temporal dynamics and incoming new data can affect the quality of the predictions compromising the model reliability. This phenomenon makes models get outdated without noticing. In this sense, it is very important to be able to extract more insightful information from the application domain analyzing data trends, learning processes, and performance drifts over time. For this reason, we propose to develop a systematic approach for analyzing how the data quality and their amount affect the learning process. Moreover, in the contextof CTI, we propose to study the relations between temporal performance drifts and the input data distribution for detecting possible model limitations, enhancing cyber threat analysis.Programa de Doctorado en Ciencias y Tecnologías Industriales (RD 99/2011) Industria Zientzietako eta Teknologietako Doktoretza Programa (ED 99/2011

    Lifelong learning of concepts in CRAFT

    Full text link
    La planification à des niveaux d’abstraction plus élevés est essentielle lorsqu’il s’agit de résoudre des tâches à long horizon avec des complexités hiérarchiques. Pour planifier avec succès à un niveau d’abstraction donné, un agent doit comprendre le fonctionnement de l’environnement à ce niveau particulier. Cette compréhension peut être implicite en termes de politiques, de fonctions de valeur et de modèles, ou elle peut être définie explicitement. Dans ce travail, nous introduisons les concepts comme un moyen de représenter et d’accumuler explicitement des informations sur l’environnement. Les concepts sont définis en termes de transition d’état et des conditions requises pour que cette transition ait lieu. La simplicité de cette définition offre flexibilité et contrôle sur le processus d’apprentissage. Étant donné que les concepts sont de nature hautement interprétable, il est facile d’encoder les connaissances antérieures et d’intervenir au cours du processus d’apprentissage si nécessaire. Cette définition facilite également le transfert de concepts entre différents domaines. Les concepts, à un niveau d’abstraction donné, sont intimement liés aux compétences, ou actions temporellement abstraites. Toutes les transitions d’état suffisamment importantes pour être représentées par un concept se produisent après l’exécution réussie d’une compétence. En exploitant cette relation, nous introduisons un cadre qui facilite l’apprentissage tout au long de la vie et le raffinement des concepts à différents niveaux d’abstraction. Le cadre comporte trois volets: Le sytème 1 segmente un flux d’expérience (par exemple une démonstration) en une séquence de compétences. Cette segmentation peut se faire à différents niveaux d’abstraction. Le sytème 2 analyse ces segments pour affiner et mettre à niveau son ensemble de concepts, lorsqu’applicable. Le sytème 3 utilise les concepts disponibles pour générer un graphe de dépendance de sous-tâches. Ce graphe peut être utilisé pour planifier à différents niveaux d’abstraction. Nous démontrons l’applicabilité de ce cadre dans l’environnement hiérarchique 2D CRAFT. Nous effectuons des expériences pour explorer comment les concepts peuvent être appris de différents flux d’expérience et comment la qualité de la base de concepts affecte l’optimalité du plan général. Dans les tâches avec des dépendances de sous-tâches complexes, où la plupart des algorithmes ne parviennent pas à se généraliser ou prennent un temps impraticable à converger, nous démontrons que les concepts peuvent être utilisés pour simplifier considérablement la planification. Ce cadre peut également être utilisé pour comprendre l’intention d’une démonstration donnée en termes de concepts. Cela permet à l’agent de répliquer facilement la démonstration dans différents environnements. Nous montrons que cette méthode d’imitation est beaucoup plus robuste aux changements de configuration de l’environnement que les méthodes traditionnelles. Dans notre formulation du problème, nous faisons deux hypothèses: 1) que nous avons accès à un ensemble de compétences suffisamment exhaustif, et 2) que notre agent a accès à des environnements de pratique, qui peuvent être utilisés pour affiner les concepts en cas de besoin. L’objectif de ce travail est d’explorer l’aspect pratique des concepts d’apprentissage comme moyen d’améliorer la compréhension de l’environnement. Dans l’ensemble, nous démontrons que les concepts d’apprentissagePlanning at higher levels of abstraction is critical when it comes to solving long horizon tasks with hierarchical complexities. To plan successfully at a given level of abstraction, an agent must have an understanding of how the environment functions at that particular level. This understanding may be implicit in terms of policies, value functions, and world models, or it can be defined explicitly. In this work, we introduce concepts as a means to explicitly represent and accumulate information about the environment. Concepts are defined in terms of a state transition and the conditions required for that transition to take place. The simplicity of this definition offers flexibility and control over the learning process. Since concepts are highly interpretable in nature, it is easy to encode prior knowledge and intervene during the learning process if necessary. This definition also makes it relatively straightforward to transfer concepts across different domains wherever applicable. Concepts, at a given level of abstraction, are intricately linked to skills, or temporally abstracted actions. All the state transitions significant enough to be represented by a concept occur only after the successful execution of a skill. Exploiting this relationship, we introduce a framework that aids in lifelong learning and refining of concepts across different levels of abstraction. The framework has three components: - System 1 segments a stream of experience (e.g. a demonstration) into a sequence of skills. This segmentation can be done at different levels of abstraction. - System 2 analyses these segments to refine and upgrade its set of concepts, whenever applicable. - System 3 utilises the available concepts to generate a sub-task dependency graph. This graph can be used for planning at different levels of abstraction We demonstrate the applicability of this framework in the 2D hierarchical environment CRAFT. We perform experiments to explore how concepts can be learned from different streams of experience, and how the quality of the concept base affects the optimality of the overall plan. In tasks with complex sub-task dependencies, where most algorithms fail to generalise or take an impractical amount of time to converge, we demonstrate that concepts can be used to significantly simplify planning. This framework can also be used to understand the intention of a given demonstration in terms of concepts. This makes it easy for the agent to replicate a demonstration in different environments. We show that this method of imitation is much more robust to changes in the environment configurations than traditional methods. In our problem formulation, we make two assumptions: 1) that we have access to a sufficiently exhaustive set of skills, and 2) that our agent has access to practice environments, which can be used to refine concepts when needed. The objective behind this work is to explore the practicality of learning concepts as a means to improve one’s understanding about the environment. Overall, we demonstrate that learning concepts can be a light-weight yet efficient way to increase the capability of a system

    Representation Learning: A Review and New Perspectives

    Full text link
    The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, auto-encoders, manifold learning, and deep networks. This motivates longer-term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation and manifold learning
    • …
    corecore