16 research outputs found

    Low-Rank Tensors for Scoring Dependency Structures

    Get PDF
    Accurate scoring of syntactic structures such as head-modifier arcs in dependency parsing typically requires rich, high-dimensional feature representations. A small subset of such features is often selected manually. This is problematic when features lack clear linguistic meaning as in embeddings or when the information is blended across features. In this paper, we use tensors to map high-dimensional feature vectors into low dimensional representations. We explicitly maintain the parameters as a low-rank tensor to obtain low dimensional representations of words in their syntactic roles, and to leverage modularity in the tensor for easy training with online algorithms. Our parser consistently outperforms the Turbo and MST parsers across 14 different languages. We also obtain the best published UAS results on 5 languages.United States. Multidisciplinary University Research Initiative (W911NF-10-1-0533)United States. Defense Advanced Research Projects Agency. Broad Operational Language Translatio

    A Novel Neural Network Model for Joint POS Tagging and Graph-based Dependency Parsing

    Full text link
    We present a novel neural network model that learns POS tagging and graph-based dependency parsing jointly. Our model uses bidirectional LSTMs to learn feature representations shared for both POS tagging and dependency parsing tasks, thus handling the feature-engineering problem. Our extensive experiments, on 19 languages from the Universal Dependencies project, show that our model outperforms the state-of-the-art neural network-based Stack-propagation model for joint POS tagging and transition-based dependency parsing, resulting in a new state of the art. Our code is open-source and available together with pre-trained models at: https://github.com/datquocnguyen/jPTDPComment: v2: also include universal POS tagging, UAS and LAS accuracies w.r.t gold-standard segmentation on Universal Dependencies 2.0 - CoNLL 2017 shared task test data; in CoNLL 201

    Efficient Correlated Topic Modeling with Topic Embedding

    Full text link
    Correlated topic modeling has been limited to small model and problem sizes due to their high computational cost and poor scaling. In this paper, we propose a new model which learns compact topic embeddings and captures topic correlations through the closeness between the topic vectors. Our method enables efficient inference in the low-dimensional embedding space, reducing previous cubic or quadratic time complexity to linear w.r.t the topic size. We further speedup variational inference with a fast sampler to exploit sparsity of topic occurrence. Extensive experiments show that our approach is capable of handling model and data scales which are several orders of magnitude larger than existing correlation results, without sacrificing modeling quality by providing competitive or superior performance in document classification and retrieval.Comment: KDD 2017 oral. The first two authors contributed equall

    Low-rank regularization for high-dimensional sparse conjunctive feature spaces in information extraction

    Get PDF
    One of the challenges in Natural Language Processing (NLP) is the unstructured nature of texts, in which useful information is not easily identifiable. Information Extraction (IE) aims to alleviate it by enabling automatic extraction of structured information from such text sources. The resulting structured information will facilitate easier querying, organizing, and analyzing of data from texts. In this thesis, we are interested in two IE related tasks: (i) named entity classification and (ii) template filling. Specifically, this thesis examines the problem of learning classifiers of text spans and explore its application for extracting named entities and template slot-fillers. In general, our goal is to construct a method to learn classifiers that: (i) require less supervision, (ii) work well with high-dimensional sparse feature spaces and (iii) are able to classify unseen items (i.e. named entities/slot-fillers not observed in training data). The key idea of our contribution is the utilization of unseen conjunctive features. A conjunctive feature is a combination of features from different feature sets. For example, to classify a phrase, one might have one feature set for the context and another set for the phrase itself. When learning a classifier, only a factor of these conjunctive features will be observed in the training set, leaving the rest (i.e. unseen features) unusable for predicting items in test time. We hypothesize that utilizing such unseen conjunctions is useful to address all of the aspects of the goal. We develop a general regularization framework specifically designed for sparse conjunctive feature spaces. Our strategy is based on employing tensors to represent the conjunctive feature space, and forcing the model to induce low-dimensional embeddings of the feature vectors via low-rank regularization on the tensor parameters. Such compressed representation will help prediction by generalizing to novel examples where most of the conjunctions will be unseen in the training set. We conduct experiments on learning named entity classifiers and template filling, focusing on extracting unseen items. We show that when learning classifiers under minimal supervision, our approach is more effective in controlling model capacity than standard techniques for linear classification.Uno de los retos en Procesamiento del Lenguaje Natural (NLP, del ingl茅s Natural Language Processing) es la naturaleza no estructurada del texto, que hace que la informaci贸n 煤til y relevante no sea f谩cilmente identificable. Los m茅todos de Extracci贸n de Informaci贸n (IE, del ingl茅s Information Extraction) afrontan este problema mediante la extracci贸n autom谩tica de informaci贸n estructurada de dichos textos. La estructura resultante facilita la b煤squeda, la organizaci贸n y el an谩lisis datos textuales. Esta tesis se centra en dos tareas relacionadas dentro de IE: (i) clasificaci贸n de entidades nombradas (NEC, del ingl茅s Named Entity Classification), y (ii) rellenado de plantillas (en ingl茅s, template filling). Concretamente, esta tesis estudia el problema de aprender clasificadores de secuencias textuales y explora su aplicaci贸n a la extracci贸n de entidades nombradas y de valores para campos de plantillas. El objetivo general es desarrollar un m茅todo para aprender clasificadores que: (i) requieran poca supervisi贸n; (ii) funcionen bien en espacios de caracter铆sticas de alta dimensi贸n y dispersi贸n; y (iii) sean capaces de clasificar elementos nunca vistos (por ejemplo entidades o valores de campos que no hayan sido vistos en fase de entrenamiento). La idea principal de nuestra contribuci贸n es la utilizaci贸n de caracter铆sticas conjuntivas que no aparecen en el conjunto de entrenamiento. Una caracter铆stica conjuntiva es una conjunci贸n de caracter铆sticas elementales. Por ejemplo, para clasificar la menci贸n de una entidad en una oraci贸n, se utilizan caracter铆sticas de la menci贸n, del contexto de 茅sta, y a su vez conjunciones de los dos grupos de caracter铆sticas. Cuando se aprende un clasificador en un conjunto de entrenamiento concreto, s贸lo se observar谩 una fracci贸n de estas caracter铆sticas conjuntivas, dejando el resto (es decir, caracter铆sticas no vistas) sin ser utilizado para predecir elementos en fase de evaluaci贸n y explotaci贸n del modelo. Nuestra hip贸tesis es que la utilizaci贸n de estas conjunciones nunca vistas pueden ser potencialmente muy 煤tiles, especialmente para reconocer entidades nuevas. Desarrollamos un marco de regularizaci贸n general espec铆ficamente dise帽ado para espacios de caracter铆sticas conjuntivas dispersas. Nuestra estrategia se basa en utilizar tensores para representar el espacio de caracter铆sticas conjuntivas y obligar al modelo a inducir "embeddings" de baja dimensi贸n de los vectores de caracter铆sticas v铆a regularizaci贸n de bajo rango en los par谩metros de tensor. Dicha representaci贸n comprimida ayudar谩 a la predicci贸n, generalizando a nuevos ejemplos donde la mayor铆a de las conjunciones no han sido vistas durante la fase de entrenamiento. Presentamos experimentos sobre el aprendizaje de clasificadores de entidades nombradas, y clasificadores de valores en campos de plantillas, centr谩ndonos en la extracci贸n de elementos no vistos. Demostramos que al aprender los clasificadores bajo m铆nima supervisi贸n, nuestro enfoque es m谩s efectivo en el control de la capacidad del modelo que las t茅cnicas est谩ndar para la clasificaci贸n linea

    Low-rank regularization for high-dimensional sparse conjunctive feature spaces in information extraction

    Get PDF
    Versi贸 amb dues seccions retallades, per drets de l'editorOne of the challenges in Natural Language Processing (NLP) is the unstructured nature of texts, in which useful information is not easily identifiable. Information Extraction (IE) aims to alleviate it by enabling automatic extraction of structured information from such text sources. The resulting structured information will facilitate easier querying, organizing, and analyzing of data from texts. In this thesis, we are interested in two IE related tasks: (i) named entity classification and (ii) template filling. Specifically, this thesis examines the problem of learning classifiers of text spans and explore its application for extracting named entities and template slot-fillers. In general, our goal is to construct a method to learn classifiers that: (i) require less supervision, (ii) work well with high-dimensional sparse feature spaces and (iii) are able to classify unseen items (i.e. named entities/slot-fillers not observed in training data). The key idea of our contribution is the utilization of unseen conjunctive features. A conjunctive feature is a combination of features from different feature sets. For example, to classify a phrase, one might have one feature set for the context and another set for the phrase itself. When learning a classifier, only a factor of these conjunctive features will be observed in the training set, leaving the rest (i.e. unseen features) unusable for predicting items in test time. We hypothesize that utilizing such unseen conjunctions is useful to address all of the aspects of the goal. We develop a general regularization framework specifically designed for sparse conjunctive feature spaces. Our strategy is based on employing tensors to represent the conjunctive feature space, and forcing the model to induce low-dimensional embeddings of the feature vectors via low-rank regularization on the tensor parameters. Such compressed representation will help prediction by generalizing to novel examples where most of the conjunctions will be unseen in the training set. We conduct experiments on learning named entity classifiers and template filling, focusing on extracting unseen items. We show that when learning classifiers under minimal supervision, our approach is more effective in controlling model capacity than standard techniques for linear classification.Uno de los retos en Procesamiento del Lenguaje Natural (NLP, del ingl茅s Natural Language Processing) es la naturaleza no estructurada del texto, que hace que la informaci贸n 煤til y relevante no sea f谩cilmente identificable. Los m茅todos de Extracci贸n de Informaci贸n (IE, del ingl茅s Information Extraction) afrontan este problema mediante la extracci贸n autom谩tica de informaci贸n estructurada de dichos textos. La estructura resultante facilita la b煤squeda, la organizaci贸n y el an谩lisis datos textuales. Esta tesis se centra en dos tareas relacionadas dentro de IE: (i) clasificaci贸n de entidades nombradas (NEC, del ingl茅s Named Entity Classification), y (ii) rellenado de plantillas (en ingl茅s, template filling). Concretamente, esta tesis estudia el problema de aprender clasificadores de secuencias textuales y explora su aplicaci贸n a la extracci贸n de entidades nombradas y de valores para campos de plantillas. El objetivo general es desarrollar un m茅todo para aprender clasificadores que: (i) requieran poca supervisi贸n; (ii) funcionen bien en espacios de caracter铆sticas de alta dimensi贸n y dispersi贸n; y (iii) sean capaces de clasificar elementos nunca vistos (por ejemplo entidades o valores de campos que no hayan sido vistos en fase de entrenamiento). La idea principal de nuestra contribuci贸n es la utilizaci贸n de caracter铆sticas conjuntivas que no aparecen en el conjunto de entrenamiento. Una caracter铆stica conjuntiva es una conjunci贸n de caracter铆sticas elementales. Por ejemplo, para clasificar la menci贸n de una entidad en una oraci贸n, se utilizan caracter铆sticas de la menci贸n, del contexto de 茅sta, y a su vez conjunciones de los dos grupos de caracter铆sticas. Cuando se aprende un clasificador en un conjunto de entrenamiento concreto, s贸lo se observar谩 una fracci贸n de estas caracter铆sticas conjuntivas, dejando el resto (es decir, caracter铆sticas no vistas) sin ser utilizado para predecir elementos en fase de evaluaci贸n y explotaci贸n del modelo. Nuestra hip贸tesis es que la utilizaci贸n de estas conjunciones nunca vistas pueden ser potencialmente muy 煤tiles, especialmente para reconocer entidades nuevas. Desarrollamos un marco de regularizaci贸n general espec铆ficamente dise帽ado para espacios de caracter铆sticas conjuntivas dispersas. Nuestra estrategia se basa en utilizar tensores para representar el espacio de caracter铆sticas conjuntivas y obligar al modelo a inducir "embeddings" de baja dimensi贸n de los vectores de caracter铆sticas v铆a regularizaci贸n de bajo rango en los par谩metros de tensor. Dicha representaci贸n comprimida ayudar谩 a la predicci贸n, generalizando a nuevos ejemplos donde la mayor铆a de las conjunciones no han sido vistas durante la fase de entrenamiento. Presentamos experimentos sobre el aprendizaje de clasificadores de entidades nombradas, y clasificadores de valores en campos de plantillas, centr谩ndonos en la extracci贸n de elementos no vistos. Demostramos que al aprender los clasificadores bajo m铆nima supervisi贸n, nuestro enfoque es m谩s efectivo en el control de la capacidad del modelo que las t茅cnicas est谩ndar para la clasificaci贸n linealPostprint (published version
    corecore