Search CORE

795 research outputs found

External Lexical Information for Multilingual Part-of-Speech Tagging

Author: Sagot Benoît
Publication venue
Publication date: 01/06/2016
Field of study

Morphosyntactic lexicons and word vector representations have both proven useful for improving the accuracy of statistical part-of-speech taggers. Here we compare the performances of four systems on datasets covering 16 languages, two of these systems being feature-based (MEMMs and CRFs) and two of them being neural-based (bi-LSTMs). We show that, on average, all four approaches perform similarly and reach state-of-the-art results. Yet better performances are obtained with our feature-based models on lexically richer datasets (e.g. for morphologically rich languages), whereas neural-based results are higher on datasets with less lexical variability (e.g. for English). These conclusions hold in particular for the MEMM models relying on our system MElt, which benefited from newly designed features. This shows that, under certain conditions, feature-based approaches enriched with morphosyntactic lexicons are competitive with respect to neural methods

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Hal-Diderot

Creating language resources for under-resourced languages: methodologies, and experiments with Arabic

Author: A Roberts
A Schalley
Chris Fox
D Radev
D Wang
E Benmamoun
E Lloret
F Diehl
G Giannakopoulos
H Luhn
I Foster
I Hmeidi
J Yeh
K Dukes
L Abouenour
L Al-Sulaiti
M Baroni
M Diab
M Fattah
M Outahajala
M Poesio
M Sawalha
Mahmoud El-Haj
Udo Kruschwitz
W Banzhaf
Y Benajiba
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Language resources are important for those working on computational methods to analyse and study languages. These resources are needed to help advancing the research in fields such as natural language processing, machine learning, information retrieval and text analysis in general. We describe the creation of useful resources for languages that currently lack them, taking resources for Arabic summarisation as a case study. We illustrate three different paradigms for creating language resources, namely: (1) using crowdsourcing to produce a small resource rapidly and relatively cheaply; (2) translating an existing gold-standard dataset, which is relatively easy but potentially of lower quality; and (3) using manual effort with appropriately skilled human participants to create a resource that is more expensive but of high quality. The last of these was used as a test collection for TAC-2011. An evaluation of the resources is also presented

University of Essex Research Repository

University of Regensburg Publication Server

Crossref

Lancaster E-Prints

Overlaps in Maltese conversational and task-oriented dialogues

Author: 1th European Symposium on Multimodal Communication
Paggio Patrizia
Vella Alexandra
Publication venue: Linköping Electronic Conference Proceedings
Publication date: 01/01/2014
Field of study

This paper deals with overlaps in spoken Maltese. Overlaps are studied in two different corpora recorded in different communicative situations. One is a multimodal corpus involving first acquaintance conversations; the other consists of Map Task dialogues. The results show that the number of overlaps is larger in the free conversations, where it varies depending on specific aspects of the interaction. They also show that overlaps in the MapTask dialogues tend to be longer, serving the function of establishing common understanding to achieve optimal task completion.peer-reviewe

OAR@UM

Crowdsourcing for Language Resource Development: Criticisms About Amazon Mechanical Turk Overpowering Use

Author: Adda Gilles
Couillault Alain
Fort Karen
Mariani Joseph
Sagot Benoît
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 25/07/2014
Field of study

International audienceThis article is a position paper about Amazon Mechanical Turk, the use of which has been steadily growing in language processing in the past few years. According to the mainstream opinion expressed in articles of the domain, this type of on-line working platforms allows to develop quickly all sorts of quality language resources, at a very low price, by people doing that as a hobby. We shall demonstrate here that the situation is far from being that ideal. Our goal here is manifold: 1- to inform researchers, so that they can make their own choices, 2- to develop alternatives with the help of funding agencies and scientific associations, 3- to propose practical and organizational solutions in order to improve language resources development, while limiting the risks of ethical and legal issues without letting go price or quality, 4- to introduce an Ethics and Big Data Charter for the documentation of language resourc

CiteSeerX

Crossref

INRIA a CCSD electronic archive server

Hal-Diderot

HAL-Rennes 1

A context based model for sentiment analysis in twitter for the italian language

Author: Basili Roberto
Castellucci Giuseppe
Croce Danilo
VANZO ANDREA
Publication venue: Pisa University Press srl
Publication date: 01/01/2014
Field of study

Studi recenti per la Sentiment Analysis in Twitter hanno tentato di creare modelli per caratterizzare la polarit´a di un tweet osservando ciascun messaggio in isolamento. In realt`a, i tweet fanno parte di conversazioni, la cui natura pu`o essere sfruttata per migliorare la qualit`a dell’analisi da parte di sistemi automatici. In (Vanzo et al., 2014) `e stato proposto un modello basato sulla classificazione di sequenze per la caratterizzazione della polarit` a dei tweet, che sfrutta il contesto in cui il messaggio `e immerso. In questo lavoro, si vuole verificare l’applicabilit`a di tale metodologia anche per la lingua Italiana.Recent works on Sentiment Analysis over Twitter leverage the idea that the sentiment depends on a single incoming tweet. However, tweets are plunged into streams of posts, thus making available a wider context. The contribution of this information has been recently investigated for the English language by modeling the polarity detection as a sequential classification task over streams of tweets (Vanzo et al., 2014). Here, we want to verify the applicability of this method even for a morphological richer language, i.e. Italian

Archivio della ricerca- Università di Roma La Sapienza

UT-DB: an experimental study on sentiment analysis in twitter

Author: Apers P.M.G.
Hiemstra D.
Wombacher A.
Zhu Z.
Publication venue: Association for Computational Linguistics
Publication date: 01/01/2013
Field of study

This paper describes our system for participating SemEval2013 Task2-B (Kozareva et al., 2013): Sentiment Analysis in Twitter. Given a message, our system classifies whether the message is positive, negative or neutral sentiment. It uses a co-occurrence rate model. The training data are constrained to the data provided by the task organizers (No other tweet data are used). We consider 9 types of features and use a subset of them in our submitted system. To see the contribution of each type of features, we do experimental study on features by leaving one type of features out each time. Results suggest that unigrams are the most important features, bigrams and POS tags seem not helpful, and stopwords should be retained to achieve the best results. The overall results of our system are promising regarding the constrained features and data we use

CiteSeerX

Radboud Repository

University of Twente Research Information

Learning languages from parallel corpora

Author: Graën Johannes
Publication venue: Ljubljana University Press
Publication date: 29/12/2022
Field of study

This work describes a blueprint for an application that generates language learning exercises from parallel corpora. Word alignment and parallel structures allow for the automatic assessment of sentence pairs in the source and target languages, while users of the application continuously improve the quality of the data with their interactions, thus crowdsourcing parallel language learning material. Through triangulation, their assessment can be transferred to language pairs other than the original ones if multiparallel corpora are used as a source. Several challenges need to be addressed for such an application to work, and we will discuss three of them here. First, the question of how adequate learning material can be identified in corpora has received some attention in the last decade, and we will detail what the structure of parallel corpora implies for that selection. Secondly, we will consider which type of exercises can be generated automatically from parallel corpora such that they foster learning and keep learners motivated. And thirdly, we will highlight the potential of employing users, that is both teachers and learners, as crowdsourcers to help improve the material

ZORA

A multilingual collection of CoNLL-U-compatible morphological lexicons

Author: Sagot Benoît
Publication venue: HAL CCSD
Publication date: 07/05/2018
Field of study

International audienceWe introduce UDLexicons, a multilingual collection of morphological lexicons that follow the guidelines and format of the Universal Dependencies initiative. We describe the three approaches we use to create 53 morphological lexicons covering 38 languages, based on existing resources. These lexicons, which are freely available, have already proven useful for improving part-of-speech tagging accuracy in state-of-the-art architectures

INRIA a CCSD electronic archive server