873 research outputs found

    Open Domain Event Extraction Using Neural Latent Variable Models

    Full text link
    We consider open domain event extraction, the task of extracting unconstraint types of events from news clusters. A novel latent variable neural model is constructed, which is scalable to very large corpus. A dataset is collected and manually annotated, with task-specific evaluation metrics being designed. Results show that the proposed unsupervised model gives better performance compared to the state-of-the-art method for event schema induction.Comment: accepted by ACL 201

    Information Extraction, Data Integration, and Uncertain Data Management: The State of The Art

    Get PDF
    Information Extraction, data Integration, and uncertain data management are different areas of research that got vast focus in the last two decades. Many researches tackled those areas of research individually. However, information extraction systems should have integrated with data integration methods to make use of the extracted information. Handling uncertainty in extraction and integration process is an important issue to enhance the quality of the data in such integrated systems. This article presents the state of the art of the mentioned areas of research and shows the common grounds and how to integrate information extraction and data integration under uncertainty management cover

    Low-rank regularization for high-dimensional sparse conjunctive feature spaces in information extraction

    Get PDF
    Versió amb dues seccions retallades, per drets de l'editorOne of the challenges in Natural Language Processing (NLP) is the unstructured nature of texts, in which useful information is not easily identifiable. Information Extraction (IE) aims to alleviate it by enabling automatic extraction of structured information from such text sources. The resulting structured information will facilitate easier querying, organizing, and analyzing of data from texts. In this thesis, we are interested in two IE related tasks: (i) named entity classification and (ii) template filling. Specifically, this thesis examines the problem of learning classifiers of text spans and explore its application for extracting named entities and template slot-fillers. In general, our goal is to construct a method to learn classifiers that: (i) require less supervision, (ii) work well with high-dimensional sparse feature spaces and (iii) are able to classify unseen items (i.e. named entities/slot-fillers not observed in training data). The key idea of our contribution is the utilization of unseen conjunctive features. A conjunctive feature is a combination of features from different feature sets. For example, to classify a phrase, one might have one feature set for the context and another set for the phrase itself. When learning a classifier, only a factor of these conjunctive features will be observed in the training set, leaving the rest (i.e. unseen features) unusable for predicting items in test time. We hypothesize that utilizing such unseen conjunctions is useful to address all of the aspects of the goal. We develop a general regularization framework specifically designed for sparse conjunctive feature spaces. Our strategy is based on employing tensors to represent the conjunctive feature space, and forcing the model to induce low-dimensional embeddings of the feature vectors via low-rank regularization on the tensor parameters. Such compressed representation will help prediction by generalizing to novel examples where most of the conjunctions will be unseen in the training set. We conduct experiments on learning named entity classifiers and template filling, focusing on extracting unseen items. We show that when learning classifiers under minimal supervision, our approach is more effective in controlling model capacity than standard techniques for linear classification.Uno de los retos en Procesamiento del Lenguaje Natural (NLP, del inglés Natural Language Processing) es la naturaleza no estructurada del texto, que hace que la información útil y relevante no sea fácilmente identificable. Los métodos de Extracción de Información (IE, del inglés Information Extraction) afrontan este problema mediante la extracción automática de información estructurada de dichos textos. La estructura resultante facilita la búsqueda, la organización y el análisis datos textuales. Esta tesis se centra en dos tareas relacionadas dentro de IE: (i) clasificación de entidades nombradas (NEC, del inglés Named Entity Classification), y (ii) rellenado de plantillas (en inglés, template filling). Concretamente, esta tesis estudia el problema de aprender clasificadores de secuencias textuales y explora su aplicación a la extracción de entidades nombradas y de valores para campos de plantillas. El objetivo general es desarrollar un método para aprender clasificadores que: (i) requieran poca supervisión; (ii) funcionen bien en espacios de características de alta dimensión y dispersión; y (iii) sean capaces de clasificar elementos nunca vistos (por ejemplo entidades o valores de campos que no hayan sido vistos en fase de entrenamiento). La idea principal de nuestra contribución es la utilización de características conjuntivas que no aparecen en el conjunto de entrenamiento. Una característica conjuntiva es una conjunción de características elementales. Por ejemplo, para clasificar la mención de una entidad en una oración, se utilizan características de la mención, del contexto de ésta, y a su vez conjunciones de los dos grupos de características. Cuando se aprende un clasificador en un conjunto de entrenamiento concreto, sólo se observará una fracción de estas características conjuntivas, dejando el resto (es decir, características no vistas) sin ser utilizado para predecir elementos en fase de evaluación y explotación del modelo. Nuestra hipótesis es que la utilización de estas conjunciones nunca vistas pueden ser potencialmente muy útiles, especialmente para reconocer entidades nuevas. Desarrollamos un marco de regularización general específicamente diseñado para espacios de características conjuntivas dispersas. Nuestra estrategia se basa en utilizar tensores para representar el espacio de características conjuntivas y obligar al modelo a inducir "embeddings" de baja dimensión de los vectores de características vía regularización de bajo rango en los parámetros de tensor. Dicha representación comprimida ayudará a la predicción, generalizando a nuevos ejemplos donde la mayoría de las conjunciones no han sido vistas durante la fase de entrenamiento. Presentamos experimentos sobre el aprendizaje de clasificadores de entidades nombradas, y clasificadores de valores en campos de plantillas, centrándonos en la extracción de elementos no vistos. Demostramos que al aprender los clasificadores bajo mínima supervisión, nuestro enfoque es más efectivo en el control de la capacidad del modelo que las técnicas estándar para la clasificación linealPostprint (published version

    Low-rank regularization for high-dimensional sparse conjunctive feature spaces in information extraction

    Get PDF
    One of the challenges in Natural Language Processing (NLP) is the unstructured nature of texts, in which useful information is not easily identifiable. Information Extraction (IE) aims to alleviate it by enabling automatic extraction of structured information from such text sources. The resulting structured information will facilitate easier querying, organizing, and analyzing of data from texts. In this thesis, we are interested in two IE related tasks: (i) named entity classification and (ii) template filling. Specifically, this thesis examines the problem of learning classifiers of text spans and explore its application for extracting named entities and template slot-fillers. In general, our goal is to construct a method to learn classifiers that: (i) require less supervision, (ii) work well with high-dimensional sparse feature spaces and (iii) are able to classify unseen items (i.e. named entities/slot-fillers not observed in training data). The key idea of our contribution is the utilization of unseen conjunctive features. A conjunctive feature is a combination of features from different feature sets. For example, to classify a phrase, one might have one feature set for the context and another set for the phrase itself. When learning a classifier, only a factor of these conjunctive features will be observed in the training set, leaving the rest (i.e. unseen features) unusable for predicting items in test time. We hypothesize that utilizing such unseen conjunctions is useful to address all of the aspects of the goal. We develop a general regularization framework specifically designed for sparse conjunctive feature spaces. Our strategy is based on employing tensors to represent the conjunctive feature space, and forcing the model to induce low-dimensional embeddings of the feature vectors via low-rank regularization on the tensor parameters. Such compressed representation will help prediction by generalizing to novel examples where most of the conjunctions will be unseen in the training set. We conduct experiments on learning named entity classifiers and template filling, focusing on extracting unseen items. We show that when learning classifiers under minimal supervision, our approach is more effective in controlling model capacity than standard techniques for linear classification.Uno de los retos en Procesamiento del Lenguaje Natural (NLP, del inglés Natural Language Processing) es la naturaleza no estructurada del texto, que hace que la información útil y relevante no sea fácilmente identificable. Los métodos de Extracción de Información (IE, del inglés Information Extraction) afrontan este problema mediante la extracción automática de información estructurada de dichos textos. La estructura resultante facilita la búsqueda, la organización y el análisis datos textuales. Esta tesis se centra en dos tareas relacionadas dentro de IE: (i) clasificación de entidades nombradas (NEC, del inglés Named Entity Classification), y (ii) rellenado de plantillas (en inglés, template filling). Concretamente, esta tesis estudia el problema de aprender clasificadores de secuencias textuales y explora su aplicación a la extracción de entidades nombradas y de valores para campos de plantillas. El objetivo general es desarrollar un método para aprender clasificadores que: (i) requieran poca supervisión; (ii) funcionen bien en espacios de características de alta dimensión y dispersión; y (iii) sean capaces de clasificar elementos nunca vistos (por ejemplo entidades o valores de campos que no hayan sido vistos en fase de entrenamiento). La idea principal de nuestra contribución es la utilización de características conjuntivas que no aparecen en el conjunto de entrenamiento. Una característica conjuntiva es una conjunción de características elementales. Por ejemplo, para clasificar la mención de una entidad en una oración, se utilizan características de la mención, del contexto de ésta, y a su vez conjunciones de los dos grupos de características. Cuando se aprende un clasificador en un conjunto de entrenamiento concreto, sólo se observará una fracción de estas características conjuntivas, dejando el resto (es decir, características no vistas) sin ser utilizado para predecir elementos en fase de evaluación y explotación del modelo. Nuestra hipótesis es que la utilización de estas conjunciones nunca vistas pueden ser potencialmente muy útiles, especialmente para reconocer entidades nuevas. Desarrollamos un marco de regularización general específicamente diseñado para espacios de características conjuntivas dispersas. Nuestra estrategia se basa en utilizar tensores para representar el espacio de características conjuntivas y obligar al modelo a inducir "embeddings" de baja dimensión de los vectores de características vía regularización de bajo rango en los parámetros de tensor. Dicha representación comprimida ayudará a la predicción, generalizando a nuevos ejemplos donde la mayoría de las conjunciones no han sido vistas durante la fase de entrenamiento. Presentamos experimentos sobre el aprendizaje de clasificadores de entidades nombradas, y clasificadores de valores en campos de plantillas, centrándonos en la extracción de elementos no vistos. Demostramos que al aprender los clasificadores bajo mínima supervisión, nuestro enfoque es más efectivo en el control de la capacidad del modelo que las técnicas estándar para la clasificación linea

    A Diffusion Model for Event Skeleton Generation

    Full text link
    Event skeleton generation, aiming to induce an event schema skeleton graph with abstracted event nodes and their temporal relations from a set of event instance graphs, is a critical step in the temporal complex event schema induction task. Existing methods effectively address this task from a graph generation perspective but suffer from noise-sensitive and error accumulation, e.g., the inability to correct errors while generating schema. We, therefore, propose a novel Diffusion Event Graph Model~(DEGM) to address these issues. Our DEGM is the first workable diffusion model for event skeleton generation, where the embedding and rounding techniques with a custom edge-based loss are introduced to transform a discrete event graph into learnable latent representation. Furthermore, we propose a denoising training process to maintain the model's robustness. Consequently, DEGM derives the final schema, where error correction is guaranteed by iteratively refining the latent representation during the schema generation process. Experimental results on three IED bombing datasets demonstrate that our DEGM achieves better results than other state-of-the-art baselines. Our code and data are available at https://github.com/zhufq00/EventSkeletonGeneration

    Unsupervised Induction of Frame-Based Linguistic Forms

    Get PDF
    This thesis studies the use of bulk, structured, linguistic annotations in order to perform unsupervised induction of meaning for three kinds of linguistic forms: words, sentences, and documents. The primary linguistic annotation I consider throughout this thesis are frames, which encode core linguistic, background or societal knowledge necessary to understand abstract concepts and real-world situations. I begin with an overview of linguistically-based structured meaning representation; I then analyze available large-scale natural language processing (NLP) and linguistic resources and corpora for their abilities to accommodate bulk, automatically-obtained frame annotations. I then proceed to induce meanings of the different forms, progressing from the word level, to the sentence level, and finally to the document level. I first show how to use these bulk annotations in order to better encode linguistic- and cognitive science backed semantic expectations within word forms. I then demonstrate a straightforward approach for learning large lexicalized and refined syntactic fragments, which encode and memoize commonly used phrases and linguistic constructions. Next, I consider two unsupervised models for document and discourse understanding; one is a purely generative approach that naturally accommodates layer annotations and is the first to capture and unify a complete frame hierarchy. The other conditions on limited amounts of external annotations, imputing missing values when necessary, and can more readily scale to large corpora. These discourse models help improve document understanding and type-level understanding

    Drafting Event Schemas using Language Models

    Full text link
    Past work has studied event prediction and event language modeling, sometimes mediated through structured representations of knowledge in the form of event schemas. Such schemas can lead to explainable predictions and forecasting of unseen events given incomplete information. In this work, we look at the process of creating such schemas to describe complex events. We use large language models (LLMs) to draft schemas directly in natural language, which can be further refined by human curators as necessary. Our focus is on whether we can achieve sufficient diversity and recall of key events and whether we can produce the schemas in a sufficiently descriptive style. We show that large language models are able to achieve moderate recall against schemas taken from two different datasets, with even better results when multiple prompts and multiple samples are combined. Moreover, we show that textual entailment methods can be used for both matching schemas to instances of events as well as evaluating overlap between gold and predicted schemas. Our method paves the way for easier distillation of event knowledge from large language model into schemas

    Zero-Shot On-the-Fly Event Schema Induction

    Full text link
    What are the events involved in a pandemic outbreak? What steps should be taken when planning a wedding? The answers to these questions can be found by collecting many documents on the complex event of interest, extracting relevant information, and analyzing it. We present a new approach in which large language models are utilized to generate source documents that allow predicting, given a high-level event definition, the specific events, arguments, and relations between them to construct a schema that describes the complex event in its entirety. Using our model, complete schemas on any topic can be generated on-the-fly without any manual data collection, i.e., in a zero-shot manner. Moreover, we develop efficient methods to extract pertinent information from texts and demonstrate in a series of experiments that these schemas are considered to be more complete than human-curated ones in the majority of examined scenarios. Finally, we show that this framework is comparable in performance with previous supervised schema induction methods that rely on collecting real texts while being more general and flexible without the need for a predefined ontology
    corecore