27 research outputs found
Reversing uncertainty sampling to improve active learning schemes
Active learning provides promising methods to optimize the cost of manually annotating a dataset. However, practitioners in many areas do not massively resort to such methods because they present technical difficulties and do not provide a guarantee of good performance, especially in skewed distributions with scarcely populated minority classes and an undefined, catch-all majority class, which are very common in human-related phenomena like natural language.
In this paper we present a comparison of the simplest active learning technique, pool-based uncertainty sampling, and its opposite, which we call reversed uncertainty sampling. We show that both obtain results comparable to the random, arguing for a more insightful approach to active learning.Sociedad Argentina de Inform谩tica e Investigaci贸n Operativa (SADIO
Disjoint Semi-supervised Spanish Verb Sense Disambiguation Using Word Embeddings
This work explores the use of word embeddings, also known as word vectors, trained on Spanish corpora, to use as features for Spanish verb sense disambiguation (VSD).
This type of learning technique is named disjoint semisupervised learning [1]: an unsupervised algorithm is trained on unlabeled data separately as a first step, and then its results (i.e. the word embeddings) are fed to a supervised classifier. Throughout this paper we try to assert two hypothesis: (i) representations of training instances based on word embeddings improve the performance of supervised models for VSD, in contrast to more standard feature engineering techniques based on information taken from the training data; (ii) using word embeddings trained on a specific domain, in this case the same domain the labeled data is gathered from, has a positive impact on the model鈥檚 performance, when compared to general domain鈥檚 word embeddings. The performance of a model over the data is not only measured using standard metric techniques (e.g. accuracy or precision/recall) but also measuring the model tendency to overfit the available data by analyzing the learning curve. Measuring this overfitting tendency is important as there is a small amount of available data, thus we need to find models to generalize better the VSD problem. For the task we use SenSem [2], a corpus and lexicon of Spanish and Catalan disambiguated verbs, as our base resource for experimentation.Sociedad Argentina de Inform谩tica e Investigaci贸n Operativ
SuFLexQA: an approach to Question Answering from the lexicon
We present SuFLexQA, a system for Question Answering that integrates deep linguistic information from verbal lexica into Quepy, a generic framework for translating natural language questions into a query language. We are participating in the QALD-3 contest to assess the main achievements and shortcomings of the system.Sociedad Argentina de Inform谩tica e Investigaci贸n Operativ
SuFLexQA: an approach to Question Answering from the lexicon
We present SuFLexQA, a system for Question Answering that integrates deep linguistic information from verbal lexica into Quepy, a generic framework for translating natural language questions into a query language. We are participating in the QALD-3 contest to assess the main achievements and shortcomings of the system.Sociedad Argentina de Inform谩tica e Investigaci贸n Operativ
Disjoint Semi-supervised Spanish Verb Sense Disambiguation Using Word Embeddings
This work explores the use of word embeddings, also known as word vectors, trained on Spanish corpora, to use as features for Spanish verb sense disambiguation (VSD).
This type of learning technique is named disjoint semisupervised learning [1]: an unsupervised algorithm is trained on unlabeled data separately as a first step, and then its results (i.e. the word embeddings) are fed to a supervised classifier. Throughout this paper we try to assert two hypothesis: (i) representations of training instances based on word embeddings improve the performance of supervised models for VSD, in contrast to more standard feature engineering techniques based on information taken from the training data; (ii) using word embeddings trained on a specific domain, in this case the same domain the labeled data is gathered from, has a positive impact on the model鈥檚 performance, when compared to general domain鈥檚 word embeddings. The performance of a model over the data is not only measured using standard metric techniques (e.g. accuracy or precision/recall) but also measuring the model tendency to overfit the available data by analyzing the learning curve. Measuring this overfitting tendency is important as there is a small amount of available data, thus we need to find models to generalize better the VSD problem. For the task we use SenSem [2], a corpus and lexicon of Spanish and Catalan disambiguated verbs, as our base resource for experimentation.Sociedad Argentina de Inform谩tica e Investigaci贸n Operativ
Reversing uncertainty sampling to improve active learning schemes
Active learning provides promising methods to optimize the cost of manually annotating a dataset. However, practitioners in many areas do not massively resort to such methods because they present technical difficulties and do not provide a guarantee of good performance, especially in skewed distributions with scarcely populated minority classes and an undefined, catch-all majority class, which are very common in human-related phenomena like natural language.
In this paper we present a comparison of the simplest active learning technique, pool-based uncertainty sampling, and its opposite, which we call reversed uncertainty sampling. We show that both obtain results comparable to the random, arguing for a more insightful approach to active learning.Sociedad Argentina de Inform谩tica e Investigaci贸n Operativa (SADIO
Learning the costs for a string edit distance-based similarity measure for abbreviated language
We present work in progress on word normalization for user-generated content. The approach is simple and helps in reducing the amount of manual annotation characteristic of more classical approaches. First, ortographic variants of a word, mostly abbreviations, are grouped together. From these manually grouped examples, we learn an automated classifier that, given a previously unseen word, determines whether it is an ortographic variant of a known word or an entirely new word. To do that, we calculate the similarity between the unseen word and all known words, and classify the new word as an ortographic variant of its most similar word. The classifier applies a string similarity measure based on the Levenshtein edit distance. To improve the accuracy of this measure, we assign edit operations an error-based cost. This scheme of cost assigning aims to maximize the distance between similar strings that are variants of different words. This custom similarity measure achieves an accuracy of .68, an important improvement if we compare it with the .54 obtained by the Levenshtein distance.Sociedad Argentina de Inform谩tica e Investigaci贸n Operativ
Combining semi-supervised and active learning to recognize minority senses in a new corpus
Ponencia presentada en la 24th International Joint Conference on Artificial Intelligence. Workshop on Replicability and Reproducibility in Natural Language Processing: adaptive methods, resources and software. Buenos Aires, Argentina, del 25 al 31 de julio de 2015.In this paper we study the impact of combining active learning with bootstrapping to grow a small annotated corpus from a different, unannotated corpus. The intuition underlying our approach is that bootstrapping includes instances that are closer to the generative centers of the data, while discriminative approaches to active learning include instances that are closer to the decision boundaries of classifiers. We build an initial model from the original annotated corpus, which is then iteratively enlarged by including both manually annotated examples and automatically labelled examples as training examples for the following iteration. Examples to be annotated are selected in each iteration by applying active learning techniques. We show that intertwining an active learning component in a bootstrapping approach helps to overcome an initial bias towards a majority class, thus facilitating adaptation of a starting dataset towards the real distribution of a different, unannotated corpus.Fil: Cardellino, Cristian Adri谩n. Universidad Nacional de C贸rdoba. Facultad de Matem谩tica, Astronom铆a y F铆sica; Argentina.Fil: Teruel, Milagro. Universidad Nacional de C贸rdoba. Facultad de Matem谩tica, Astronom铆a y F铆sica; Argentina.Fil: Alonso i Alemany, Laura. Universidad Nacional de C贸rdoba. Facultad de Matem谩tica, Astronom铆a y F铆sica; Argentina.Otras Ciencias de la Computaci贸n e Informaci贸
Reversing uncertainty sampling to improve active learning schemes
Ponencia presentada en el 16潞 Simposio Argentino de Inteligencia Artificial. 44 Jornadas Argentinas de Inform谩tica. Rosario, Argentina, del 31 de agosto al 4 de septiembre de 2015.Active learning provides promising methods to optimize the cost of manually annotating a dataset. However, practitioners in many areas do not massively resort to such methods because they present technical difficulties and do not provide a guarantee of good performance, especially in skewed distributions with scarcely populated minority classes and an undefined, catch-all majority class, which are very common in human-related phenomena like natural language. In this paper we present a comparison of the simplest active learning technique, pool-based uncertainty sampling, and its opposite, which we call reversed uncertainty sampling. We show that both obtain results comparable to the random, arguing for a more insightful approach to active learning.http://44jaiio.sadio.org.ar/asaiFil: Cardellino, Cristian Adri谩n. Universidad Nacional de C贸rdoba. Facultad de Matem谩tica, Astronom铆a y F铆sica; Argentina.Fil: Teruel, Milagro. Universidad Nacional de C贸rdoba. Facultad de Matem谩tica, Astronom铆a y F铆sica; Argentina.Fil: Alonso i Alemany, Laura. Universidad Nacional de C贸rdoba. Facultad de Matem谩tica, Astronom铆a y F铆sica; Argentina.Ciencias de la Computaci贸
Reversing uncertainty sampling to improve active learning schemes
Active learning provides promising methods to optimize the cost of manually annotating a dataset. However, practitioners in many areas do not massively resort to such methods because they present technical difficulties and do not provide a guarantee of good performance, especially in skewed distributions with scarcely populated minority classes and an undefined, catch-all majority class, which are very common in human-related phenomena like natural language.
In this paper we present a comparison of the simplest active learning technique, pool-based uncertainty sampling, and its opposite, which we call reversed uncertainty sampling. We show that both obtain results comparable to the random, arguing for a more insightful approach to active learning.Sociedad Argentina de Inform谩tica e Investigaci贸n Operativa (SADIO