44 research outputs found

    Transfer Learning for Multi-language Twitter Election Classification

    Get PDF
    Both politicians and citizens are increasingly embracing social media as a means to disseminate information and comment on various topics, particularly during significant political events, such as elections. Such commentary during elections is also of interest to social scientists and pollsters. To facilitate the study of social media during elections, there is a need to automatically identify posts that are topically related to those elections. However, current studies have focused on elections within English-speaking regions, and hence the resultant election content classifiers are only applicable for elections in countries where the predominant language is English. On the other hand, as social media is becoming more prevalent worldwide, there is an increasing need for election classifiers that can be generalised across different languages, without building a training dataset for each election. In this paper, based upon transfer learning, we study the development of effective and reusable election classifiers for use on social media across multiple languages. We combine transfer learning with different classifiers such as Support Vector Machines (SVM) and state-of-the-art Convolutional Neural Networks (CNN), which make use of word embedding representations for each social media post. We generalise the learned classifier models for cross-language classification by using a linear translation approach to map the word embedding vectors from one language into another. Experiments conducted over two election datasets in different languages show that without using any training data from the target language, linear translations outperform a classical transfer learning approach, namely Transfer Component Analysis (TCA), by 80% in recall and 25% in F1 measure

    Selecting and Generating Computational Meaning Representations for Short Texts

    Full text link
    Language conveys meaning, so natural language processing (NLP) requires representations of meaning. This work addresses two broad questions: (1) What meaning representation should we use? and (2) How can we transform text to our chosen meaning representation? In the first part, we explore different meaning representations (MRs) of short texts, ranging from surface forms to deep-learning-based models. We show the advantages and disadvantages of a variety of MRs for summarization, paraphrase detection, and clustering. In the second part, we use SQL as a running example for an in-depth look at how we can parse text into our chosen MR. We examine the text-to-SQL problem from three perspectives—methodology, systems, and applications—and show how each contributes to a fuller understanding of the task.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/143967/1/cfdollak_1.pd

    Compounded Mediation: A Data Archaeology of the Newspaper Navigator Dataset

    Get PDF
    The increasing role of machine learning in the construction of cultural heritage and humanities datasets necessitates critical examination of the myriad biases introduced by machines, algorithms, and the humans who build and deploy them. From image classification to OCR, the effects of decisions ostensibly made by machines compound through the digitization pipeline and redouble in each step, mediating our interactions with digitally-rendered artifacts through the search and discovery process. Here, I consider the Library of Congress’s Newspaper Navigator dataset, which I created as part of the Library of Congress’s Innovator-in-Residence program. The dataset consists of visual content extracted from 16 million historic newspaper pages in the Chronicling America database using machine learning. In this data archaeology, I examine the ways in which a Chronicling America newspaper page is transmuted and decontextualized during its journey from a physical artifact to a series of probabilistic photographs, illustrations, maps, comics, cartoons, headlines, and advertisements in the Newspaper Navigator dataset. I consider the digitization journeys of four different pages in Black newspapers in Chronicling America that reproduce the same photograph of W.E.B. Du Bois. In tracing the pages’ journeys, I unpack how each step in the pipelines, such as the imaging process and the construction of training data, not only imprints bias on the resulting Newspaper Navigator dataset but also propagates the bias via the machine learning algorithms employed. I investigate the limitations of the Newspaper Navigator dataset and machine learning as it relates to cultural heritage, from marginalization and erasure via algorithmic bias to unfair labor practices in the construction of commonly-used datasets. I argue that any use of machine learning with cultural heritage must be done with an understanding of the broader socio-technical ecosystems in which the algorithms have been utilized

    Multi-Word Terminology Extraction and Its Role in Document Embedding

    Get PDF
    Automated terminology extraction is a crucial task in natural language processing and ontology construction. Termhood can be inferred using linguistic and statistic techniques. This thesis focuses on the statistic methods. Inspired by feature selection techniques in documents classification, we experiment with a variety of metrics including PMI (point-wise mutual information), MI (mutual information), and Chi-squared. We find that PMI is in favour of identifying top keywords in a domain, but Chi-squared can recognize more keywords overall. Based on this observation, we propose a hybrid approach, called HMI, that combines the best of PMI and Chi-squared. HMI outperforms both PMI and Chi-squared. The result is verified by comparing overlapping between the extracted keywords and the author-identified keywords in arXiv data. When the corpora are computer science and physics papers, the top-100 hit rate can reach 0.96 for HMI. We also demonstrate that terminologies can improve documents embeddings. In this experiment, we treat machine-identified multi-word terminologies with one word. Then we use the transformed text as input for the document embedding. Compared with the representations learnt from unigrams only, we observe a performance improvement over 9.41% for F1 score in arXiv data on document classification tasks

    Neural models for information retrieval: towards asymmetry sensitive approaches based on attention models

    Get PDF
    Ce travail se situe dans le contexte de la recherche d'information (RI) utilisant des techniques d'intelligence artificielle (IA) telles que l'apprentissage profond (DL). Il s'intéresse à des tâches nécessitant l'appariement de textes, telles que la recherche ad-hoc, le domaine du questions-réponses et l'identification des paraphrases. L'objectif de cette thèse est de proposer de nouveaux modèles, utilisant les méthodes de DL, pour construire des modèles d'appariement basés sur la sémantique de textes, et permettant de pallier les problèmes de l'inadéquation du vocabulaire relatifs aux représentations par sac de mots, ou bag of words (BoW), utilisées dans les modèles classiques de RI. En effet, les méthodes classiques de comparaison de textes sont basées sur la représentation BoW qui considère un texte donné comme un ensemble de mots indépendants. Le processus d'appariement de deux séquences de texte repose sur l'appariement exact entre les mots. La principale limite de cette approche est l'inadéquation du vocabulaire. Ce problème apparaît lorsque les séquences de texte à apparier n'utilisent pas le même vocabulaire, même si leurs sujets sont liés. Par exemple, la requête peut contenir plusieurs mots qui ne sont pas nécessairement utilisés dans les documents de la collection, notamment dans les documents pertinents. Les représentations BoW ignorent plusieurs aspects, tels que la structure du texte et le contexte des mots. Ces caractéristiques sont très importantes et permettent de différencier deux textes utilisant les mêmes mots et dont les informations exprimées sont différentes. Un autre problème dans l'appariement de texte est lié à la longueur des documents. Les parties pertinentes peuvent être réparties de manières différentes dans les documents d'une collection. Ceci est d'autant vrai dans les documents volumineux qui ont tendance à couvrir un grand nombre de sujets et à inclure un vocabulaire variable. Un document long pourrait ainsi comporter plusieurs passages pertinents qu'un modèle d'appariement doit capturer. Contrairement aux documents longs, les documents courts sont susceptibles de concerner un sujet spécifique et ont tendance à contenir un vocabulaire plus restreint. L'évaluation de leur pertinence est en principe plus simple que celle des documents plus longs. Dans cette thèse, nous avons proposé différentes contributions répondant chacune à l'un des problèmes susmentionnés. Tout d'abord, afin de résoudre le problème d'inadéquation du vocabulaire, nous avons utilisé des représentations distribuées des mots (plongement lexical) pour permettre un appariement basé sur la sémantique entre les différents mots. Ces représentations ont été utilisées dans des applications de RI où la similarité document-requête est calculée en comparant tous les vecteurs de termes de la requête avec tous les vecteurs de termes du document, indifféremment. Contrairement aux modèles proposés dans l'état-de-l'art, nous avons étudié l'impact des termes de la requête concernant leur présence/absence dans un document. Nous avons adopté différentes stratégies d'appariement document/requête. L'intuition est que l'absence des termes de la requête dans les documents pertinents est en soi un aspect utile à prendre en compte dans le processus de comparaison. En effet, ces termes n'apparaissent pas dans les documents de la collection pour deux raisons possibles : soit leurs synonymes ont été utilisés ; soit ils ne font pas partie du contexte des documents en questions. Les méthodes que nous avons proposé permettent, d'une part d'effectuer un appariement inexact entre le document et la requête, et d'une autre part évaluer l'impact des différents termes d'une requête dans le processus d'appariement. Bien que l'utilisation du plongement lexical des mots permet d'effectuer un appariement basé sur la sémantique entre différentes séquences de texte, ces représentations combinées avec les modèles classiques considèrent toujours le texte comme une liste d'éléments indépendants (sac de vecteurs au lieux de sac de mots). Or, la structure du texte aussi bien que l'ordre des mots est très importante. Tout changement dans la structure du texte et/ou l'ordre des mots altère l'information exprimée. Afin de résoudre ce problème, les modèles neuronaux ont été utilisés dans l'appariement de texte. Dans notre cas, nous avons d'abord étudié différents modèles neuronaux de l'état-de-l'art pour la comparaison de textes, ensuite nous avons proposé deux approches principales. Dans un premier temps, nous avons construit un modèle qui tient compte de la structure d'un texte et de l'importance de ses mots. Plus précisément, nous avons combiné un modèle basé sur la position avec un modèle basé sur l'attention pour construire une approche d'appariement de texte exploitant des représentations basées sur la position en combinaison avec une pondération basée sur l'attention des mots. Nous croyons que lorsque le modèle est conscient de la position et de l'importance des mots, les représentations apprises fourniront des caractéristiques plus pertinentes pour le processus de comparaison. Nous avons conclu que la position combinée, dans une configuration asymétrique, à l'attention portée à un mot d'une séquence, permet d'améliorer de façon significative les résultats. Dans un deuxième temps, nous avons analysé différentes applications d'appariement neuronal de texte et les avons regroupé en deux grandes catégories. (1) les problèmes d'appariement symétrique qui consiste à identifier si deux textes, de même nature, sont sémantiquement similaires ; (2) les problèmes d'appariement asymétrique qui consiste à évaluer si un texte d'entrée fournit les informations recherchées dans un autre texte de nature différente. En étudiant les différents modèles neuronaux existants, nous avons constaté que tous les modèles proposés se basent sur une architecture Siamoise globale où les différentes entrées du modèle subissent le même traitement quelque soit la nature de la tâche, (1) ou (2). Afin de prendre en considération la nature de la tâche, nous avons proposé une architecture sensible à l'asymétrie pour l'appariement neuronal de textes. Particulièrement, nous avons utilisé un modèle d'attention pour construire une architecture générale qui étend différents modèles neuronaux de l'état de l'art. Enfin, pour faire face aux problèmes liés à la taille des documents dans la recherche ad-hoc en utilisant les réseaux de neurones, nous avons proposé une approche pour extraire des signaux de pertinence à différents niveaux dans un document long. Notamment, au niveau des mots, des passages et du document complet. Plus précisément, nous avons proposé une architecture globale multi-couche permettant de mesurer la pertinence à différent niveaux, en utilisant les modèles d'attention. Cette architecture est ensuite utilisée pour étendre plusieurs modèles de l'état de l'art et d'examiner l'apport de la pertinence mesurée à différents niveaux. Par ailleurs, nous avons proposé un modèle basé sur l'architecture générale proposée. Il utilise un réseau récurrent afin d'effectuer une sorte d'interaction compétitive entre les passages susceptible d'être pertinent dans un document, et qui sont préalablement sélectionnés.This work is situated in the context of information retrieval (IR) using machine learning (ML) and deep learning (DL) techniques. It concerns different tasks requiring text matching, such as ad-hoc research, question answering and paraphrase identification. The objective of this thesis is to propose new approaches, using DL methods, to construct semantic-based models for text matching, and to overcome the problems of vocabulary mismatch related to the classical bag of word (BoW) representations used in traditional IR models. Indeed, traditional text matching methods are based on the BoW representation, which considers a given text as a set of independent words. The process of matching two sequences of text is based on the exact matching between words. The main limitation of this approach is related to the vocabulary mismatch. This problem occurs when the text sequences to be matched do not use the same vocabulary, even if their subjects are related. For example, the query may contain several words that are not necessarily used in the documents of the collection, including relevant documents. BoW representations ignore several aspects about a text sequence, such as the structure the context of words. These characteristics are important and make it possible to differentiate between two texts that use the same words but expressing different information. Another problem in text matching is related to the length of documents. The relevant parts can be distributed in different ways in the documents of a collection. This is especially true in large documents that tend to cover a large number of topics and include variable vocabulary. A long document could thus contain several relevant passages that a matching model must capture. Unlike long documents, short documents are likely to be relevant to a specific subject and tend to contain a more restricted vocabulary. Assessing their relevance is in principle simpler than assessing the one of longer documents. In this thesis, we have proposed different contributions, each addressing one of the above-mentioned issues. First, in order to solve the problem of vocabulary mismatch, we used distributed representations of words (word embedding) to allow a semantic matching between the different words. These representations have been used in IR applications where document/query similarity is computed by comparing all the term vectors of the query with all the term vectors of the document, regardless. Unlike the models proposed in the state-of-the-art, we studied the impact of query terms regarding their presence/absence in a document. We have adopted different document/query matching strategies. The intuition is that the absence of the query terms in the relevant documents is in itself a useful aspect to be taken into account in the matching process. Indeed, these terms do not appear in documents of the collection for two possible reasons: either their synonyms have been used or they are not part of the context of the considered documents. The methods we have proposed make it possible, on the one hand, to perform an inaccurate matching between the document and the query, and on the other hand, to evaluate the impact of the different terms of a query in the matching process. Although the use of word embedding allows semantic-based matching between different text sequences, these representations combined with classical matching models still consider the text as a list of independent elements (bag of vectors instead of bag of words). However, the structure of the text as well as the order of the words is important. Any change in the structure of the text and/or the order of words alters the information expressed. In order to solve this problem, neural models were used in text matching. In our case, we first studied different neural models from the state-of-the-art of text matching, then we proposed two main approaches. First, we built a model that takes into account the structure of a text and the importance of its words. Specifically, we combined a position-based model with an attention-based model to build a text matching approach using position-based representations combined with attention-based weights of words. We believe that when the model is aware of the position and importance of words, the representations learned will provide more relevant characteristics for the comparison process. We concluded that the combined position, in an asymmetric configuration, with the attention given to a word in a sequence, significantly improves the results. In a second step, we analyzed different neural text matching applications and grouped them into two main categories. (1) symmetric matching problems which consists in identifying if two texts, of the same nature, are semantically similar; (2) asymmetric matching problems which consists in evaluating if an input text provides the information sought in another text of a different nature. By studying the various existing neural models, we have found that all the models proposed are based on a global Siamese architecture where the different inputs of the model undergo the same processing, whatever the nature of the task (1) or (2). In order to take into consideration the nature of the matching task, we proposed an asymmetry sensitive architecture for neural text matching. In particular, we used an attention model to build a general architecture that extends different neural models of the state-of-the-art. Finally, to address problems related to document size in ad-hoc search using neural networks, we proposed an approach to extract relevance signals at different levels in a long document. In particular, at the level of words, passages and the complete document. More precisely, we proposed a global multi-layer architecture to measure relevance at different levels, using attention models. This architecture is then used to extend several state of the art models and to examine the contribution of relevance measured at different levels. Based on this general architecture, we proposed a model that uses a recurrent layer to perform a kind of competitive interactions between the passages that are likely to be relevant in a document, and which are previously selected

    EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020

    Get PDF
    Welcome to EVALITA 2020! EVALITA is the evaluation campaign of Natural Language Processing and Speech Tools for Italian. EVALITA is an initiative of the Italian Association for Computational Linguistics (AILC, http://www.ai-lc.it) and it is endorsed by the Italian Association for Artificial Intelligence (AIxIA, http://www.aixia.it) and the Italian Association for Speech Sciences (AISV, http://www.aisv.it)

    Proceedings of the Eighth Italian Conference on Computational Linguistics CliC-it 2021

    Get PDF
    The eighth edition of the Italian Conference on Computational Linguistics (CLiC-it 2021) was held at UniversitĂ  degli Studi di Milano-Bicocca from 26th to 28th January 2022. After the edition of 2020, which was held in fully virtual mode due to the health emergency related to Covid-19, CLiC-it 2021 represented the first moment for the Italian research community of Computational Linguistics to meet in person after more than one year of full/partial lockdown
    corecore