61,113 research outputs found
Language identification of multilingual posts from Twitter: a case study
The final publication is available at Springer via http://dx.doi.org/10.1007/s10115-016-0997-xThis paper describes a method for handling multi-class and multi-label classification problems based on the support vector machine formalism. This method has been applied to the language identification problem in Twitter. The system evaluation was performed mainly on a Twitter data set developed in the TweetLID workshop. This data set contains bilingual tweets written in the most commonly used Iberian languages (i.e., Spanish, Portuguese, Catalan, Basque, and Galician) as well as the English language. We address the following problems: (1) social media texts. We propose a suitable tokenization that processes the peculiarities of Twitter; (2) multilingual tweets. Since a tweet can belong to more than one language, we need to use a multi-class and multi-label classifier; (3) similar languages. We study the main confusions among similar languages; and (4) unbalanced classes. We propose threshold-based strategy to favor classes with less data. We have also studied the use of Wikipedia and the addition of new tweets in order to increase the training data set. Additionally, we have tested our system on Bergsma corpus, a collection of tweets in nine languages, focusing on confusable languages using the Cyrillic, Arabic, and Devanagari alphabets. To our knowledge, we obtained the best results published on the TweetLID data set and results that are in line with the best results published on Bergsma data set.This work has been partially funded by the project ASLP-MULAN: Audio, Speech and Language Processing for Multimedia Analytics (MINECO TIN2014-54288-C4-3-R).Pla SantamarĂa, F.; Hurtado Oliver, LF. (2016). Language identification of multilingual posts from Twitter: a case study. Knowledge and Information Systems. 51(3):965-989. https://doi.org/10.1007/s10115-016-0997-xS965989513Baldwin T, Lui M (2010) Language identification: the long and the short of the matter. In: Human language technologies: the 2010 annual conference of the North American chapter of the association for computational linguistics, HLT â10. Association for Computational Linguistics, Stroudsburg, PA, pp 229â237Bergsma S, McNamee P, Bagdouri M, Fink C, Wilson T (2012) Language identification for creating language-specific twitter collections. In: Proceedings of the second workshop on language in social media, LSM â12. Association for Computational Linguistics, Stroudsburg, PA, pp 65â74Carter S, Weerkamp W, Tsagkias M (2013) Microblog language identification: overcoming the limitations of short, unedited and idiomatic text. Lang Resour Eval 47(1):195â215Cavnar WB, Trenkle JM (1994) N-gram-based text categorization. In: Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval, pp. 161â175Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273â297Gamallo P, GarcĂa M, Sotelo S, Campos JRP (2014) Comparing ranking-based and naive bayes approaches to language detection on tweets. âTweetLID@SEPLNâ, pp 12â16Goldszmidt M, Najork M, Paparizos S (2013) Boot-strapping language identifiers for short colloquial postings. In: Proceeding of the European conference on machine learning and principles and practice of knowledge discovery in databases (ECMLPKDD 2013). SpringerGrefenstette G (1995) Comparing two language identification schemes. In: 3rd international conference on statistical analysis of textural dataHurtado LF, Pla F, GimĂ©nez M, Arnal ES (2014) Elirf-upv en tweetlid: IdentificaciĂłn del idioma en twitter, In: Proceedings of the Tweet language identification workshop co-located with 30th conference of the Spanish society for natural language processing, TweetLID@SEPLN 2014, Girona, 16 Sept 2014, pp 35â38Jauhiainen T, LindĂ©n K, Jauhiainen H (2015) Language set identification in noisy synthetic multilingual documents. In: Gelbukh A (ed) Computational linguistics and intelligent text processing, vol 9041 of lecture notes in computer science. Springer International Publishing, pp 633â643Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: NĂ©dellec C, Rouveirol C (eds) Proceedings of ECML-98, 10th European conference on machine learning, no. 1398. Springer, Heidelberg, pp 137â142Liu B (2012) Sentiment analysis and opinion mining. A comprehensive introduction and survey. Morgan & Claypool Publishers, San RafaelLjubeĆĄiÄ N, MikeliÄ N, Boras D (2007) Language identification: How to distinguish similar languages, In: LuĆŸar-Stifter V, Hljuz DobriÄ V (eds), Proceedings of the 29th international conference on information technology interfaces. SRCE University Computing Centre, Zagreb, pp 541â546Lui M, Baldwin T (2014) Accurate language identification of twitter messages. In: Proceedings of the EACL 2014 workshop on language analysis in social media (LASM 2014), pp 17â25Lui M, Lau JH, Baldwin T (2014) Automatic detection and language identification of multilingual documents. Trans Assoc Comput Linguist 2:27â40Nguyen D, Dogruoz AS (2014) Word level language identification in online multilingual communication. In: Proceedings of the 2013 conference on empirical methods in natural language processingOâConnor B, Krieger M, Ahn D (2010) Tweetmotif: exploratory search and topic summarization for twitter. In: Cohen WW, Gosling S (eds) Proceedings of the fourth international conference on weblogs and social media, ICWSM 2010, Washington, DC. The AAAI Press, 23â26 May 2010Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825â2830Pla F, Hurtado L-F (2014) Political tendency identification in twitter using sentiment analysis techniques. In: Proceedings of COLING 2014, the 25th international conference on computational linguistics: technical papers. Dublin City University and Association for Computational Linguistics, Dublin, pp 183â192Prager JM (1999) Linguini: language identification for multilingual documents. J Manage Inf Syst 16(3):71â101RamĂłn Quevedo J, Luaces O, Bahamonde A (2012) Multilabel classifiers with a probabilistic thresholding strategy. Pattern Recogn 45(2):876â883Rao D, Yarowsky D, Shreevats A, Gupta M (2010) Classifying latent user attributes in twitter. In: Proceedings of the 2nd international workshop on search and mining user-generated contents, SMUC â10. ACM, New York, NY, pp 37â44Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1â47Tsoumakas G, Katakis I (2007) Multi-label classification: an overview. Int J Data Warehous Min 2007:1â13Zubiaga A, Vicente IS, Gamallo P, Campos JRP, Loinaz IA, Aranberri N, Ezeiza A Fresno-FernĂĄndez V (2014) Overview of tweetlid: Tweet language identification at SEPLN 2014. In: Proceedings of the Tweet language identification workshop co-located with 30th conference of the Spanish society for natural language processing. TweetLID@SEPLN 2014, Girona, Spain, 16 Sept 2014, pp 1â11Zubiaga A, San Vicente I, Gamallo P, Pichel JR, Alegria I, Aranberri N, Ezeiza A, Fresno V (2015) TweetLID: a benchmark for tweet language identification. J Lang Res Eval. Springer, pp 1â38. doi: 10.1007/s10579-015-9317-
A Deep Network Model for Paraphrase Detection in Short Text Messages
This paper is concerned with paraphrase detection. The ability to detect
similar sentences written in natural language is crucial for several
applications, such as text mining, text summarization, plagiarism detection,
authorship authentication and question answering. Given two sentences, the
objective is to detect whether they are semantically identical. An important
insight from this work is that existing paraphrase systems perform well when
applied on clean texts, but they do not necessarily deliver good performance
against noisy texts. Challenges with paraphrase detection on user generated
short texts, such as Twitter, include language irregularity and noise. To cope
with these challenges, we propose a novel deep neural network-based approach
that relies on coarse-grained sentence modeling using a convolutional neural
network and a long short-term memory model, combined with a specific
fine-grained word-level similarity matching model. Our experimental results
show that the proposed approach outperforms existing state-of-the-art
approaches on user-generated noisy social media data, such as Twitter texts,
and achieves highly competitive performance on a cleaner corpus
Computational Sociolinguistics: A Survey
Language is a social phenomenon and variation is inherent to its social
nature. Recently, there has been a surge of interest within the computational
linguistics (CL) community in the social dimension of language. In this article
we present a survey of the emerging field of "Computational Sociolinguistics"
that reflects this increased interest. We aim to provide a comprehensive
overview of CL research on sociolinguistic themes, featuring topics such as the
relation between language and social identity, language use in social
interaction and multilingual communication. Moreover, we demonstrate the
potential for synergy between the research communities involved, by showing how
the large-scale data-driven methods that are widely used in CL can complement
existing sociolinguistic studies, and how sociolinguistics can inform and
challenge the methods and assumptions employed in CL studies. We hope to convey
the possible benefits of a closer collaboration between the two communities and
conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication:
18th February, 201
Modeling Global Syntactic Variation in English Using Dialect Classification
This paper evaluates global-scale dialect identification for 14 national
varieties of English as a means for studying syntactic variation. The paper
makes three main contributions: (i) introducing data-driven language mapping as
a method for selecting the inventory of national varieties to include in the
task; (ii) producing a large and dynamic set of syntactic features using
grammar induction rather than focusing on a few hand-selected features such as
function words; and (iii) comparing models across both web corpora and social
media corpora in order to measure the robustness of syntactic variation across
registers
A Semi-automatic Method for Efficient Detection of Stories on Social Media
Twitter has become one of the main sources of news for many people. As
real-world events and emergencies unfold, Twitter is abuzz with hundreds of
thousands of stories about the events. Some of these stories are harmless,
while others could potentially be life-saving or sources of malicious rumors.
Thus, it is critically important to be able to efficiently track stories that
spread on Twitter during these events. In this paper, we present a novel
semi-automatic tool that enables users to efficiently identify and track
stories about real-world events on Twitter. We ran a user study with 25
participants, demonstrating that compared to more conventional methods, our
tool can increase the speed and the accuracy with which users can track stories
about real-world events.Comment: ICWSM'16, May 17-20, Cologne, Germany. In Proceedings of the 10th
International AAAI Conference on Weblogs and Social Media (ICWSM 2016).
Cologne, German
A Continuously Growing Dataset of Sentential Paraphrases
A major challenge in paraphrase research is the lack of parallel corpora. In
this paper, we present a new method to collect large-scale sentential
paraphrases from Twitter by linking tweets through shared URLs. The main
advantage of our method is its simplicity, as it gets rid of the classifier or
human in the loop needed to select data before annotation and subsequent
application of paraphrase identification algorithms in the previous work. We
present the largest human-labeled paraphrase corpus to date of 51,524 sentence
pairs and the first cross-domain benchmarking for automatic paraphrase
identification. In addition, we show that more than 30,000 new sentential
paraphrases can be easily and continuously captured every month at ~70%
precision, and demonstrate their utility for downstream NLP tasks through
phrasal paraphrase extraction. We make our code and data freely available.Comment: 11 pages, accepted to EMNLP 201
- âŠ