23 research outputs found

    Contributions to language understanding: n-gram attention and alignments for interpretable similarity and inference

    Get PDF
    Tesis doctoral titulada “Hizkuntza-ulermenari Ekarpenak: N-gramen arteko Atentzio eta Lerrokatzeak Antzekotasun eta Inferentzia Interpretagarrirako / Contribuciones a la Comprensión Lectora: Mecanismos de Atención y Alineamiento entre N-gramas para Similitud e Inferencia Interpretable”, defendida por Iñigo Lopez-Gazpio en la Universidad del País Vasco (UPV/EHU) y elaborada bajo la dirección de los doctores Eneko Agirre (Departamento de Lenguajes y Sistemas Informáticos) y Montse Maritxalar (Departamento de Lenguajes y Sistemas Informáticos). La defensa tuvo lugar el 30 de octubre de 2018 ante el tribunal formado por los doctores Kepa Sarasola (Presidente, Universidad del País Vasco (UPV/EHU)), Gorka Azkune (Secretario, Universidad de Deusto) y David Martínez (Vocal, IBM). La tesis obtuvo la calificación de sobresaliente Cum Laude y mención internacional.Ph. D. thesis entitled “Hizkuntza-ulermenari Ekarpenak: N-gramen arteko Atentzio eta Lerrokatzeak Antzekotasun eta Inferentzia Interpretagarrirako / Contributions to Language Understanding: N-gram Attention and Alignments for Interpretable Similarity and Inference”, written by Iñigo Lopez-Gazpio at the University of Basque Country (UPV/EHU) under the supervision of Dr. Eneko Agirre (Languages and Computer Systems Department) and Dr. Montse Maritxalar (Languages and Computer Systems Department). The viva voce was held on October 30 2018 and the members of the commission were Dr. Kepa Sarasola (President, University of Basque Country (UPV/EHU)), Dr. Gorka Azkune (Secretary, University of Deusto) and Dr. David Martínez (Vocal, IBM). The thesis obtained Cum Laude excellent grade and international mention.Esta tesis doctoral ha sido realizada con una beca predoctoral del Ministerio de Educación, Cultura y Deporte. Referencia: MINECO FPU13/00501

    Uncovering divergent linguistic information in word embeddings with lessons for intrinsic and extrinsic evaluation

    Full text link
    Following the recent success of word embeddings, it has been argued that there is no such thing as an ideal representation for words, as different models tend to capture divergent and often mutually incompatible aspects like semantics/syntax and similarity/relatedness. In this paper, we show that each embedding model captures more information than directly apparent. A linear transformation that adjusts the similarity order of the model without any external resource can tailor it to achieve better results in those aspects, providing a new perspective on how embeddings encode divergent linguistic information. In addition, we explore the relation between intrinsic and extrinsic evaluation, as the effect of our transformations in downstream tasks is higher for unsupervised systems than for supervised ones.Comment: CoNLL 201

    SemEval-2017 Task 1: semantic textual similarity - multilingual and cross-lingual focused evaluation

    Get PDF
    Semantic Textual Similarity (STS) measures the meaning similarity of sentences. Applications include machine translation (MT), summarization, generation, question answering (QA), short answer grading, semantic search, dialog and conversational systems. The STS shared task is a venue for assessing the current state-of-the-art. The 2017 task focuses on multilingual and cross-lingual pairs with one sub-track exploring MT quality estimation (MTQE) data. The task obtained strong participation from 31 teams, with 17 participating in all language tracks. We summarize performance and review a selection of well performing methods. Analysis highlights common errors, providing insight into the limitations of existing models. To support ongoing work on semantic representations, the STS Benchmark is introduced as a new shared training and evaluation set carefully selected from the corpus of English STS shared task data (2012-2017)

    A resource-light method for cross-lingual semantic textual similarity

    Full text link
    [EN] Recognizing semantically similar sentences or paragraphs across languages is beneficial for many tasks, ranging from cross-lingual information retrieval and plagiarism detection to machine translation. Recently proposed methods for predicting cross-lingual semantic similarity of short texts, however, make use of tools and resources (e.g., machine translation systems, syntactic parsers or named entity recognition) that for many languages (or language pairs) do not exist. In contrast, we propose an unsupervised and a very resource-light approach for measuring semantic similarity between texts in different languages. To operate in the bilingual (or multilingual) space, we project continuous word vectors (i.e., word embeddings) from one language to the vector space of the other language via the linear translation model. We then align words according to the similarity of their vectors in the bilingual embedding space and investigate different unsupervised measures of semantic similarity exploiting bilingual embeddings and word alignments. Requiring only a limited-size set of word translation pairs between the languages, the proposed approach is applicable to virtually any pair of languages for which there exists a sufficiently large corpus, required to learn monolingual word embeddings. Experimental results on three different datasets for measuring semantic textual similarity show that our simple resource-light approach reaches performance close to that of supervised and resource-intensive methods, displaying stability across different language pairs. Furthermore, we evaluate the proposed method on two extrinsic tasks, namely extraction of parallel sentences from comparable corpora and cross-lingual plagiarism detection, and show that it yields performance comparable to those of complex resource-intensive state-of-the-art models for the respective tasks. (C) 2017 Published by Elsevier B.V.Part of the work presented in this article was performed during second author's research visit to the University of Mannheim, supported by Contact Fellowship awarded by the DAAD scholarship program "STIBET Doktoranden". The research of the last author has been carried out in the framework of the SomEMBED project (TIN2015-71147-C2-1-P). Furthermore, this work was partially funded by the Junior-professor funding programme of the Ministry of Science, Research and the Arts of the state of Baden-Wurttemberg (project "Deep semantic models for high-end NLP application").Glavas, G.; Franco-Salvador, M.; Ponzetto, SP.; Rosso, P. (2018). A resource-light method for cross-lingual semantic textual similarity. Knowledge-Based Systems. 143:1-9. https://doi.org/10.1016/j.knosys.2017.11.041S1914

    SemEval-2016 Task 2: Interpretable semantic textual similarity

    No full text
    Comunicació presentada al 10th International Workshop on Semantic Evaluation (SemEval-2016), celebrat els dies 16 i 17 de juny de 2016 a San Diego, Califòrnia.The final goal of Interpretable Semantic Textual Similarity (iSTS) is to build systems that explain which are the differences and commonalities between two sentences. The task adds an explanatory level on top of STS, formalized as an alignment between the chunks in the two input sentences, indicating the relation and similarity score of each alignment. The task provides train and test data on three datasets: news headlines, image captions and student answers. It attracted nine teams, totaling 20 runs. All datasets and the annotation guideline are freely available.This material is based in part upon work supported a MINECO grant to the University of the Basque Country (TUNER project TIN2015-65308-C5-1-R). Aitor Gonzalez Agirre and Inigo Lopez-Gazpio are by doctoral grants from MINECO. The IXA group is funded by the Basque Government (A type Research Group)

    Interpretable semantic textual similarity: Finding and explaining differences between sentences

    No full text
    User acceptance of artificial intelligence agents might depend on their ability to explain their reasoning to the users. We focus on a specific text processing task, the Semantic Textual Similarity task (STS), where systems need to measure the degree of semantic equivalence between two sentences. We propose to add an interpretability layer (iSTS for short) formalized as the alignment between pairs of segments across the two sentences, where the relation between the segments is labeled with a relation type and a similarity score. This way, a system performing STS could use the interpretability layer to explain to users why it returned that specific score for the given sentence pair. We present a publicly available dataset of sentence pairs annotated following the formalization. We then develop an iSTS system trained on this dataset, which given a sentence pair finds what is similar and what is different, in the form of graded and typed segment alignments. When evaluated on the dataset, the system performs better than an informed baseline, showing that the dataset and task are well-defined and feasible. Most importantly, two user studies show how the iSTS system output can be used to automatically produce explanations in natural language. Users performed the two tasks better when having access to the explanations, providing preliminary evidence that our dataset and method to automatically produce explanations do help users understand the output of STS systems better.Aitor Gonzalez-Agirre and Inigo Lopez-Gazpio are supported by doctoral grants from MINECO. The work described in this project has been partially funded by MINECO in projects MUSTER (PCIN-2015-226) and TUNER (TIN 2015-65308-C5-1-R), as well as the Basque Government (A group research team, IT344-10)

    SemEval-2016 Task 2: Interpretable semantic textual similarity

    No full text
    Comunicació presentada al 10th International Workshop on Semantic Evaluation (SemEval-2016), celebrat els dies 16 i 17 de juny de 2016 a San Diego, Califòrnia.The final goal of Interpretable Semantic Textual Similarity (iSTS) is to build systems that explain which are the differences and commonalities between two sentences. The task adds an explanatory level on top of STS, formalized as an alignment between the chunks in the two input sentences, indicating the relation and similarity score of each alignment. The task provides train and test data on three datasets: news headlines, image captions and student answers. It attracted nine teams, totaling 20 runs. All datasets and the annotation guideline are freely available.This material is based in part upon work supported a MINECO grant to the University of the Basque Country (TUNER project TIN2015-65308-C5-1-R). Aitor Gonzalez Agirre and Inigo Lopez-Gazpio are by doctoral grants from MINECO. The IXA group is funded by the Basque Government (A type Research Group)

    Analysis of protamine peptides in insulin pharmaceutical formulations by capillary electrophoresis.

    Full text link
    Protamines are a group of highly basic peptides that are sometimes added to insulin formulations to prolong the pharmacological action. In this study, different methods were investigated to identify protamine in insulin formulations. Capillary electrophoresis in aqueous and non-aqueous media was tested to separate these peptides with very close amino acid sequences. Different buffers (phosphate or formate, both acidified) and various additives (principally negatively charged and neutral surfactants) were investigated to optimize peptide separation. Finally, a micellar electrokinetic capillary chromatography method using a capillary of 120 cm effective length and an aqueous background electrolyte made up of 100 mM phosphate buffer (pH 2) and 50 mM Thesit(R) gave the best results, providing the separation of the four major protamine peptides within 25 min
    corecore