288 research outputs found

    Fine-Grained Analysis of Language Varieties and Demographics

    Full text link
    [EN] The rise of social media empowers people to interact and communicate with anyone anywhere in the world. The possibility of being anonymous avoids censorship and enables freedom of expression. Nevertheless, this anonymity might lead to cybersecurity issues, such as opinion spam, sexual harassment, incitement to hatred or even terrorism propaganda. In such cases, there is a need to know more about the anonymous users and this could be useful in several domains beyond security and forensics such as marketing, for example. In this paper, we focus on a fine-grained analysis of language varieties while considering also the authors¿ demographics. We present a Low-Dimensionality Statistical Embedding method to represent text documents. We compared the performance of this method with the best performing teams in the Author Profiling task at PAN 2017. We obtained an average accuracy of 92.08% versus 91.84% for the best performing team at PAN 2017. We also analyse the relationship of the language variety identification with the authors¿ gender. Furthermore, we applied our proposed method to a more fine-grained annotated corpus of Arabic varieties covering 22 Arab countries and obtained an overall accuracy of 88.89%. We have also investigated the effect of the authors¿ age and gender on the identification of the different Arabic varieties, as well as the effect of the corpus size on the performance of our method.This publication was made possible by NPRP grant 9-175-1-033 from the Qatar National Research Fund (a member of Qatar Foundation). The statements made herein are solely the responsibility of the authors.Rangel, F.; Rosso, P.; Zaghouani, W.; Charfi, A. (2020). Fine-Grained Analysis of Language Varieties and Demographics. Natural Language Engineering. 26(6):641-661. https://doi.org/10.1017/S1351324920000108S641661266Kestemont, M. , Tschuggnall, M. , Stamatatos, E. , Daelemans, W. , Specht, G. , Stein, B. and Potthast, M. (2018). Overview of the Author Identification Task at PAN-2018: Cross-domain Authorship Attribution and Style Change Detection. CLEF 2018 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org.McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2), 153-157. doi:10.1007/bf02295996Lui, M. and Cook, P. (2013). Classifying english documents by national dialect. In Proceedings of the Australasian Language Technology Association Workshop, Citeseer pp. 5–15.Basile, A. , Dwyer, G. , Medvedeva, M. , Rawee, J. , Haagsma, H. and Nissim, M. (2017). Is there life beyond n-grams? A simple SVM-based author profiling system. In Cappellato L., Ferro N., Goeuriot L. and Mandl T. (eds), CLEF 2017 Working Notes. CEUR Workshop Proceedings (CEUR-WS.org), ISSN 1613-0073, http://ceur-ws.org/Vol-/. CLEF and CEUR-WS.org.Elfardy, H. and Diab, M.T. (2013). Sentence level dialect identification in arabic. In Association for Computational Linguistics (ACL), pp. 456–461.Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513-523. doi:10.1016/0306-4573(88)90021-0Zaghouani, W. and Charfi, A. (2018a). ArapTweet: A large MultiDialect Twitter corpus for gender, age and language variety identification. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC), Miyazaki, Japan.Zampieri, M. , Tan, L. , Ljubešić, N. , Tiedemann, J. and Nakov, P. (2015). Overview of the DSL shared task 2015. In Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, pp. 1–9.Huang, C.-R. and Lee, L.-H. (2008). Contrastive approach towards text source classification based on top-bag-of-word similarity. In PACLIC, pp. 404–410.Zaidan, O. F., & Callison-Burch, C. (2014). Arabic Dialect Identification. Computational Linguistics, 40(1), 171-202. doi:10.1162/coli_a_00169Grouin, C. , Forest, D. , Paroubek, P. and Zweigenbaum, P. (2011). Présentation et résultats du défi fouille de texte DEFT2011 Quand un article de presse a t-il été écrit? À quel article scientifique correspond ce résumé? Actes du septième Défi Fouille de Textes, p. 3.Martinc, M. , Skrjanec, I. , Zupan, K. and Pollak, S. Pan (2017). Author profiling – gender and language variety prediction. In Cappellato L., Ferro N., Goeuriot L. and Mandl T. (eds), CLEF 2017 Working Notes. CEUR Workshop Proceedings (CEUR-WS.org), ISSN 1613-0073, http://ceur-ws.org/Vol-/. CLEF and CEUR-WS.org.Rangel, F. , Rosso, P. and Franco-Salvador, M. (2016b). A low dimensionality representation for language variety identification. In 17th International Conference on Intelligent Text Processing and Computational Linguistics, CICLing, LNCS. Springer-Verlag, arxiv:1705.10754.Hagen, M. , Potthast, M. and Stein, B. (2018). Overview of the Author Obfuscation Task at PAN 2018. CLEF 2018 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org.Zampieri, M. and Gebre, B.G. (2012). Automatic identification of language varieties: The case of portuguese. In The 11th Conference on Natural Language Processing (KONVENS), pp. 233–237 (2012)Rangel, F. , Rosso, P. , Montes-y-Gómez, M. , Potthast, M. and Stein, B. (2018). Overview of the 6th Author Profiling Task at PAN 2018: Multimodal Gender Identification in Twitter. In CLEF 2018 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org.Heitele, D. (1975). An epistemological view on fundamental stochastic ideas. Educational Studies in Mathematics, 6(2), 187-205. doi:10.1007/bf00302543Inches, G. and Crestani, F. (2012). Overview of the International Sexual Predator Identification Competition at PAN-2012. CLEF Online working notes/labs/workshop, vol. 30.Rosso, P. , Rangel Pardo, F.M. , Ghanem, B. and Charfi, A. (2018b). ARAP: Arabic Author Profiling Project for Cyber-Security. Sociedad Española para el Procesamiento del Lenguaje Natural (SEPLN).Agić, Ž. , Tiedemann, J. , Dobrovoljc, K. , Krek, S. , Merkler, D. , Može, S. , Nakov, P. , Osenova, P. and Vertan, C. (2014). Proceedings of the EMNLP 2014 Workshop on Language Technology for Closely Related Languages and Language Variants. Association for Computational Linguistics.Sadat, F., Kazemi, F., & Farzindar, A. (2014). Automatic Identification of Arabic Language Varieties and Dialects in Social Media. Proceedings of the Second Workshop on Natural Language Processing for Social Media (SocialNLP). doi:10.3115/v1/w14-5904Franco-Salvador, M., Rangel, F., Rosso, P., Taulé, M., & Antònia Martít, M. (2015). Language Variety Identification Using Distributed Representations of Words and Documents. Experimental IR Meets Multilinguality, Multimodality, and Interaction, 28-40. doi:10.1007/978-3-319-24027-5_3Rosso, P., Rangel, F., Farías, I. H., Cagnina, L., Zaghouani, W., & Charfi, A. (2018). A survey on author profiling, deception, and irony detection for the Arabic language. Language and Linguistics Compass, 12(4), e12275. doi:10.1111/lnc3.12275Malmasi, S. , Zampieri, M. , Ljubešić, N. , Nakov, P. , Ali, A. and Tiedemann, J. (2016). Discriminating between similar languages and arabic dialect identification: A report on the third DSL shared task. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 1–14.Rangel, F. , Rosso, P. , Potthast, M. and Stein, B. (2017). Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter. In Cappellato L., Ferro N., Goeuriot, L. and Mandl T. (eds), Working Notes Papers of the CLEF 2017 Evaluation Labs, p. 1613–0073, CLEF and CEUR-WS.org.Zampieri, M. , Malmasi, S. , Ljubešić, N. , Nakov, P. , Ali, A. , Tiedemann, J. , Scherrer, Y. , Aepli, N. (2017). Findings of the vardial evaluation campaign 2017. In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 1–15.Bogdanova, D., Rosso, P., & Solorio, T. (2014). Exploring high-level features for detecting cyberpedophilia. Computer Speech & Language, 28(1), 108-120. doi:10.1016/j.csl.2013.04.007Maier, W. and Gómez-Rodríguez, C. (2014). Language Variety Identification in Spanish Tweets. LT4CloseLang.Castro, D. , Souza, E. , de Oliveira, A.L.I. (2016). Discriminating between Brazilian and European Portuguese national varieties on Twitter texts. In 5th Brazilian Conference on Intelligent Systems (BRACIS), pp. 265–270.Zaghouani, W. and Charfi, A. (2018b). Guidelines and annotation framework for Arabic author profiling. In Proceedings of the 3rd Workshop on Open-Source Arabic Corpora and Processing Tools, 11th International Conference on Language Resources and Evaluation (LREC), Miyazaki, Japan.Hernández Fusilier, D., Montes-y-Gómez, M., Rosso, P., & Guzmán Cabrera, R. (2015). Detecting positive and negative deceptive opinions using PU-learning. Information Processing & Management, 51(4), 433-443. doi:10.1016/j.ipm.2014.11.001Tellez, E.S. , Miranda-Jiménez, S. , Graff, M. and Moctezuma, D. (2017). Gender and language variety identification with microtc. In Cappellato L., Ferro N., Goeuriot L. and Mandl T. (eds). CLEF 2017 Working Notes. CEUR Workshop Proceedings (CEUR-WS.org), ISSN 1613-0073, http://ceur-ws.org/Vol-/. CLEF and CEUR-WS.org.Kandias, M., Stavrou, V., Bozovic, N., & Gritzalis, D. (2013). Proactive insider threat detection through social media. Proceedings of the 12th ACM workshop on Workshop on privacy in the electronic society. doi:10.1145/2517840.251786

    Natural language processing for similar languages, varieties, and dialects: A survey

    Get PDF
    There has been a lot of recent interest in the natural language processing (NLP) community in the computational processing of language varieties and dialects, with the aim to improve the performance of applications such as machine translation, speech recognition, and dialogue systems. Here, we attempt to survey this growing field of research, with focus on computational methods for processing similar languages, varieties, and dialects. In particular, we discuss the most important challenges when dealing with diatopic language variation, and we present some of the available datasets, the process of data collection, and the most common data collection strategies used to compile datasets for similar languages, varieties, and dialects. We further present a number of studies on computational methods developed and/or adapted for preprocessing, normalization, part-of-speech tagging, and parsing similar languages, language varieties, and dialects. Finally, we discuss relevant applications such as language and dialect identification and machine translation for closely related languages, language varieties, and dialects.Non peer reviewe

    Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

    Get PDF
    Peer reviewe

    A survey on author profiling, deception, and irony detection for the Arabic language

    Full text link
    "This is the peer reviewed version of the following article: [FULL CITE], which has been published in final form at [Link to final article using the DOI]. This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Self-Archiving."[EN] The possibility of knowing people traits on the basis of what they write is a field of growing interest named author profiling. To infer a user's gender, age, native language, language variety, or even when the user lies, simply by analyzing her texts, opens a wide range of possibilities from the point of view of security. In this paper, we review the state of the art about some of the main author profiling problems, as well as deception and irony detection, especially focusing on the Arabic language.Qatar National Research Fund, Grant/Award Number: NPRP 9-175-1-033Rosso, P.; Rangel-Pardo, FM.; Hernandez-Farias, DI.; Cagnina, L.; Zaghouani, W.; Charfi, A. (2018). A survey on author profiling, deception, and irony detection for the Arabic language. Language and Linguistics Compass. 12(4):1-20. https://doi.org/10.1111/lnc3.12275S120124Abuhakema , G. Faraj , R. Feldman , A. Fitzpatrick , E. 2008 Annotating an arabic learner corpus for error Proceedings of The sixth international conference on Language Resources and Evaluation, LREC 2008Adouane , W. Dobnik , S. 2017 Identification of languages in algerian arabic multilingual documents Proceedings of The Third Arabic Natural Language Processing Workshop (WANLP)Adouane , W. Semmar , N. Johansson , R 2016a Romanized berber and romanized arabic automatic language identification using machine learning Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects; COLING 53 61Adouane , W. Semmar , N. Johansson , R. 2016b ASIREM participation at the discriminating similar languages shared task 2016 Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects; COLING 163 169Adouane , W. Semmar , N. Johansson , R. Bobicev , V. 2016c Automatic detection of arabicized berber and arabic varieties Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects; COLING 63 72Alfaifi , A. Atwell , E. Hedaya , I. 2014 Arabic learner corpus (ALC) v2: A new written and spoken corpus of Arabic learnersAlharbi , K. 2015 The irony volcano explodes black comedyAli , A. Bell , P. Renals , S. 2015 Automatic dialect detection in Arabic broadcast speechAlmeman , K. Lee , M. 2013 Automatic building of Arabic multi dialect text corpora by bootstrapping dialect words 1 6Aloshban , N. Al-Dossari , H. 2016 A new approach for group spam detection in social media for Arabic language (AGSD) 20 23Al-Sabbagh , R. Girju , R. 2012 YADAC: Yet another dialectal Arabic corpusAlsmearat , K. Al-Ayyoub , M. Al-Shalabi , R. 2014 An extensive study of the bag-of-words approach for gender identification of Arabic articlesAlsmearat , K. Shehab , M. Al-Ayyoub , M. Al-Shalabi , R. Kanaan , G. 2015 Emotion analysis of Arabic articles and its impact on identifying the authors genderArfath , P. Al-Badrashiny , M. Diab , M. El Kholy , A. Eskander , R. Habash , N. Pooleery , M. Rambow , O. Roth , R. M. 2014 MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of ArabicBarbieri , F. Basile , V. Croce , D. Nissim , M. Novielli , N. Patti , V. 2016 Overview of the Evalita 2016 sentiment polarity classification taskBarbieri , F. Saggion , H 2014 Modelling irony in twitter 56 64Barbieri , F. Saggion , H. Ronzano , F 2014 Modelling sarcasm in Twitter, a novel approachBasile , V. Bolioli , A. Nissim , M. Patti , V. Rosso , P. 2014 Overview of the Evalita 2014 sentiment polarity classification taskBlanchard, D., Tetreault, J., Higgins, D., Cahill, A., & Chodorow, M. (2013). TOEFL11: A CORPUS OF NON-NATIVE ENGLISH. ETS Research Report Series, 2013(2), i-15. doi:10.1002/j.2333-8504.2013.tb02331.xBosco, C., Patti, V., & Bolioli, A. (2013). Developing Corpora for Sentiment Analysis: The Case of Irony and Senti-TUT. IEEE Intelligent Systems, 28(2), 55-63. doi:10.1109/mis.2013.28Bouamor , H. Habash , N. Salameh , M. Zaghouani , W. Rambow , O. Abdulrahim , D. Oflazer , K. 2018 The MADAR Arabic Dialect Corpus and LexiconBouchlaghem , R. Elkhlifi , A. Faiz , R. 2014 Tunisian dialect Wordnet creation and enrichment using web resources and other Wordnets 104 113 https://doi.org/10.3115/v1/W14-3613Boujelbane , R. BenAyed , S. Belguith , L. H. 2013 Building bilingual lexicon to create dialect Tunisian corpora and adapt language modelCagnina L. Rosso , P 2015 Classification of deceptive opinions using a low dimensionality representationCavalli-Sforza , V. Saddiki , H. Bouzoubaa , K. Abouenour , L. Maamouri , M. Goshey , E. 2013 Bootstrapping a Wordnet for an Arabic dialect from other Wordnets and dictionary resourcesCotterell , R. Callison-Burch , C. 2014 A multi-dialect, multi-genre corpus of informal written ArabicDahlmeier , D. Tou Ng , H. Mei Wu , S. 2013 Building a large annotated corpus of learner English: the NUS corpus of learner English 22 31Darwish , K. Sajjad , H. Mubarak , H. 2014 Verifiably effective Arabic dialect identification 1465 1468Duh , K. Kirchhoff , K. 2006 Lexicon acquisition for dialectal Arabic using transductive learningElfardy , E. Diab , M. T. 2013 Sentence level dialect identification in Arabic 456 461Estival , D. Gaustad , T. Hutchinson , B. Bao-Pham , S. Radford , W. 2008 Author profiling for English and Arabic emailsFitzpatrick, E., Bachenko, J., & Fornaciari, T. (2015). Automatic Detection of Verbal Deception. Synthesis Lectures on Human Language Technologies, 8(3), 1-119. doi:10.2200/s00656ed1v01y201507hlt029Franco-Salvador, M., Rangel, F., Rosso, P., Taulé, M., & Antònia Martít, M. (2015). Language Variety Identification Using Distributed Representations of Words and Documents. Experimental IR Meets Multilinguality, Multimodality, and Interaction, 28-40. doi:10.1007/978-3-319-24027-5_3Ghosh , A. Li , G. Veale , T. Rosso , P. Shutova , E. Barnden , J. Reyes , A. 2015 Semeval-2015 task 11: Sentiment analysis of figurative language in twitter 470 478Graff , D. Maamouri , M. 2012 Developing LMF-XML bilingual dictionaries for colloquial Arabic dialects 269 274Habash , N. Khalifa , S. Eryani , F. Rambow , O. Abdulrahim , D. Erdmann , A. Saddiki , H. 2018 Unified Guidelines and Resources for Arabic Dialect OrthographyHabash , N. Rambow , O. Kiraz , G. 2005 Morphological analysis and generation for Arabic dialectsHaggan, M. (1991). Spelling errors in native Arabic-speaking English majors: A comparison between remedial students and fourth year students. System, 19(1-2), 45-61. doi:10.1016/0346-251x(91)90007-cHassan , H. Daud , N. M. 2011 Corpus analysis of conjunctions: Arabic learners difficulties with collocationsHayes-Harb, R. (2006). Native Speakers of Arabic and ESL Texts: Evidence for the Transfer of Written Word Identification Processes. TESOL Quarterly, 40(2), 321. doi:10.2307/40264525Hernández-Farías, I., Benedí, J.-M., & Rosso, P. (2015). Applying Basic Features from Sentiment Analysis for Automatic Irony Detection. Lecture Notes in Computer Science, 337-344. doi:10.1007/978-3-319-19390-8_38Hernández Fusilier, D., Montes-y-Gómez, M., Rosso, P., & Guzmán Cabrera, R. (2015). Detecting positive and negative deceptive opinions using PU-learning. Information Processing & Management, 51(4), 433-443. doi:10.1016/j.ipm.2014.11.001Karoui , J. Benamara , F. Moriceau , V. Aussenac-Gilles , N. Hadrich Belguith , L. 2015 Towards a contextual pragmatic model to detect irony in tweetsKaroui , J. Zitoune , F. B. Moriceau , V. 2017 SOUKHRIA: Towards an irony detection system for Arabic in social mediaLjubesic , N. Mikelic , N. Boras , D. 2007 Language identification: How to distinguish similar languagesLópez-Monroy, A. P., Montes-y-Gómez, M., Escalante, H. J., Villaseñor-Pineda, L., & Stamatatos, E. (2015). Discriminative subprofile-specific representations for author profiling in social media. Knowledge-Based Systems, 89, 134-147. doi:10.1016/j.knosys.2015.06.024Magdy, W., Darwish, K., & Weber, I. (2016). #FailedRevolutions: Using Twitter to study the antecedents of ISIS support. First Monday. doi:10.5210/fm.v21i2.6372Maier , W. Gomez-Rodriguez , C. 2014 Language variety identification in Spanish tweetsMalmasi , S. Dras , M. 2014 Arabic native language identificationMechti , S. Abbassi , A. Belguith , L. H. Faiz , R. 2016 An empirical method using features combination for Arabic native language identificationMukherjee, A., Liu, B., & Glance, N. (2012). Spotting fake reviewer groups in consumer reviews. Proceedings of the 21st international conference on World Wide Web - WWW ’12. doi:10.1145/2187836.2187863Proceedings of the EMNLP’2014 Workshop on Language Technology for Closely Related Languages and Language Variants. (2014). doi:10.3115/v1/w14-42Pennebaker , J. W. Chung , C. K. Ireland , M. E. Gonzales , A. L. Booth , R. J. 2007 The development and psychometric properties of LIWC2007 http://www.liwc.net/LIWC2007LanguageManual.pdf http://liwc.netPotthast , M. Rangel , F. Tschuggnall , M. Stamatatos , E. Rosso , P. Stein , B. 2017 Overview of PAN'17 G. Jones 10456 Springer, ChamRandall M. Groom , N. 2009 The BUiD Arab learner corpus: a resource for studying the acquisition of l2 English spellingRangel , F. Rosso , P. 2015 On the multilingual and genre robustness of emographs for author profiling in social media 274 280 Springer-Verlag, LNCSRangel, F., & Rosso, P. (2016). On the impact of emotions on author profiling. Information Processing & Management, 52(1), 73-92. doi:10.1016/j.ipm.2015.06.003Rangel , F. Rosso , P. Koppel , M. Stamatatos , E. Inches , G. 2013 Overview of the author profiling task at PAN 2013 P. Forner R. Navigli D. TufisRangel , F. Rosso , P. Potthast , M. Stein , B. Daelemans , W. 2015 Overview of the 3rd author profiling task at PAN 2015 L. Cappellato N. Ferro G. Jones E. San JuanRangel , F. Rosso , P. Verhoeven , B. Daelemans , W. Potthast , M. Stein , B. 2016 Overview of the 4th author profiling task at PAN 2016: Cross-genre evaluationsRefaee , E. Rieser , V. 2014 An Arabic twitter corpus for subjectivity and sentiment analysis 2268 2273Reyes, A., Rosso, P., & Buscaldi, D. (2012). From humor recognition to irony detection: The figurative language of social media. Data & Knowledge Engineering, 74, 1-12. doi:10.1016/j.datak.2012.02.005Reyes, A., Rosso, P., & Veale, T. (2012). A multidimensional approach for detecting irony in Twitter. Language Resources and Evaluation, 47(1), 239-268. doi:10.1007/s10579-012-9196-xRosso, P., & Cagnina, L. C. (2017). Deception Detection and Opinion Spam. Socio-Affective Computing, 155-171. doi:10.1007/978-3-319-55394-8_8Saâdane , H. 2015 Traitement Automatique de L'Arabe Dialectalise: Aspects Methodologiques et AlgorithmiquesSaâdane , H. Nouvel , D. Seffih , H. Fluhr , C. 2017 Une approche linguistique pour la détection des dialectes arabesSadat , F. Kazemi , F. Farzindar , A. 2014 Automatic identification of Arabic language varieties and dialects in social mediaSadhwani , P. 2005 Phonological and orthographic knowledge: An Arab-Emirati perspectiveSchler , J. Koppel , M. Argamon , S. Pennebaker , J. W. 2006 Effects of age and gender on blogging 199 205Shoufan , A. Al-Ameri , S. 2015 Natural language processing for dialectical Arabic: A surveySoliman , T. Elmasry , M. Hedar , A-R. Doss , M. 2013 MINING SOCIAL NETWORKS' ARABIC SLANG COMMENTSSulis, E., Irazú Hernández Farías, D., Rosso, P., Patti, V., & Ruffo, G. (2016). Figurative messages and affect in Twitter: Differences between #irony, #sarcasm and #not. Knowledge-Based Systems, 108, 132-143. doi:10.1016/j.knosys.2016.05.035Tetreault , J. Blanchard , D. Cahill , A. 2013 A report on the first native language identification shared task Proceedings of the 8th Workshop on Innovative Use of NLP for Building Educational Applications 48 57Tillmann , C. Mansour , S. Al Onaizan , Y. 2014 Improved sentence-level Arabic dialect classification Proceedings of the VarDia006C Workshop 110 119Tono, Y. (2012). International Corpus of Crosslinguistic Interlanguage: Project overview and a case study on the acquisition of new verb co-occurrence patterns. Tokyo University of Foreign Studies, 27-46. doi:10.1075/tufs.4.07tonWahsheh , H. A. Al-Kabi , M. N. Alsmadi , I. M. 2013b SPAR: A system to detect spam in Arabic opinionsZaghouani , W. Charfi , A. 2018a Arap-Tweet: A Large Multi-Dialect Twitter Corpus for Gender, Age and Language Variety Identification Miyazaki, JapanZaghouani , W. Charfi , A. 2018b Guidelines and Annotation Framework for Arabic Author Profiling Miyazaki, JapanZaghouani , W. Mohit , B. Habash , N. Obeid , O. Tomeh , N. Rozovskaya , A. Farra , N. Alkuhlani , S. Oflazer , K. 2014 Large scale Arabic error annotation: Guidelines and frameworkZaghouani , W. Habash , N. Bouamor , H. Rozovskaya , A. Mohit , B. Heider , A. Oflazer , K. 2015 Correction annotation for non-native Arabic texts: Guidelines and corpus Proceedings of the Association for Computational Linguistics, Fourth Linguistic Annotation Workshop 129 139Zaidan , O. F. Callison-Burch , C 2011 The Arabic online commentary dataset: An annotated dataset of informal Arabic with high dialectal content Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers -Volume 2 Association for Computational Linguistics 37 41Zaidan, O. F., & Callison-Burch, C. (2014). Arabic Dialect Identification. Computational Linguistics, 40(1), 171-202. doi:10.1162/coli_a_00169Zampieri , M. Gebre , B. G. 2012 Automatic identification of language varieties: The case of PortugueseZampieri , M. Tan , L. Ljubesic , N. Tiedemann , J. 2014 A report on the DSL shared task 2014 Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects 58 67Zampieri , M. Tan , L. Ljubesic , N. Tiedemann , J. Nakov , P. 2015 Overview of the DSL shared task 2015 1Zbib , R. Malchiodi , E. Devlin , J. Stallard , D. Matsoukas , S. Schwartz , R. Makhoul , J. Zaidan , O. F. Callison Burch , C. 2012 Machine translation of Arabic dialects Proceedings of the 2012 conference of the North American chapter of the Association for Computational Linguistics: Human language technologies Association for Computational Linguistics 49 5

    Language identification in texts

    Get PDF
    This work investigates the task of identifying the language of digitally encoded text. Automatic methods for language identification have been developed since the 1960s. During the years, the significance of language identification as an important preprocessing element has grown at the same time as other natural language processing systems have become mainstream in day-to-day applications. The methods used for language identification are mostly shared with other text classification tasks as almost any modern machine learning method can be trained to distinguish between different languages. We begin the work by taking a detailed look at the research so far conducted in the field. As part of this work, we provide the largest survey on language identification available so far. Comparing the performance of different language identification methods presented in the literature has been difficult in the past. Before the introduction of a series of language identification shared tasks at the VarDial workshops, there were no widely accepted standard datasets which could be used to compare different methods. The shared tasks mostly concentrated on the issue of distinguishing between similar languages, but other open issues relating to language identification were addressed as well. In this work, we present the methods for language identification we have developed while participating in the shared tasks from 2015 to 2017. Most of the research for this work was accomplished within the Finno-Ugric Languages and the Internet project. In the project, our goal was to find and collect texts written in rare Uralic languages on the Internet. In addition to the open issues addressed at the shared tasks, we dealt with issues concerning domain compatibility and the number of languages. We created an evaluation set-up for addressing short out-of-domain texts in a large number of languages. Using the set-up, we evaluated our own method as well as other promising methods from the literature. The last issue we address in this work is the handling of multilingual documents. We developed a method for language set identification and used a previously published dataset to evaluate its performance.Tässä väitöskirjassa tutkitaan digitaalisessa muodossa olevan tekstin kielen automaattista tunnistamista. Tekstin kielen tunnistamisen automaattisia menetelmiä on kehitetty jo 1960-luvulta lähtien. Kuluneiden vuosikymmenien aikana kielentunnistamisen merkitys osana laajempia tietojärjestelmiä on vähitellen kasvanut. Tekstin kieli on tarpeellista tunnistaa, jotta tekstin jatkokäsittelyssä osataan käyttää sopivia kieliteknologisia menetelmiä. Tekstin kielentunnistus on kieleltään tai kieliltään tuntemattoman tekstin kielen tai kielien määrittämistä. Suurimmaksi osaksi kielentunnistukseen käytettyjä menetelmiä käytetään tai voidaan käyttää tekstin luokitteluun myös tekstin muiden ominaisuuksien, kuten aihealueen, perusteella. Tähän artikkeliväitöskirjaan kuuluvassa katsausartikkelissa esittelemme laajasti kielentunnistuksen tähänastista tutkimusta ja käymme kattavasti lävitse kielentunnistukseen tähän mennessä käytetyt menetelmät. Seuraavat kolme väistöskirjan artikkelia esittelevät ne kielentunnistuksen menetelmät joita käytimme VarDial työpajojen yhteydessä järjestetyissä kansainvälisissä kielentunnistuskilpailuissa vuodesta 2015 vuoteen 2017. Suurin osa tämän väitöskirjan tutkimuksesta on tehty osana Koneen säätiön rahoittamaa suomalais-ugrilaiset kielet ja internet -hanketta. Hankkeen päämääränä oli löytää internetistä tekstejä, jotka olivat kirjoitettu harvinaisemmilla uralilaisilla kielillä ja väitöskirjan viides artikkeli keskittyy projektin alkuvaiheiden kuvaamiseen. Väitöskirjan kuudes artikkeli kertoo miten hankkeen verkkoharavaan liitetty kielentunnistin evaluoitiin vaativasssa testiympäristössä, joka sisälsi tekstejä kirjoitettuna 285 eri kielellä. Seitsemäs ja viimeinen artikkeli käsittelee monikielisten tekstien kielivalikoiman selvittämistä

    Medidas de distância entre línguas baseadas em corpus. Aplicação à linguística histórica do galego, português, espanhol e inglês

    Get PDF
    xiv, 225 p.El objetivo de esta tesis es plantear y verificar una metodología basada en corpus paracuantificar automáticamente la distancia entre lenguas y variantes de lenguas. Para ello se ha partido de las técnicas usadas y contrastadas en identificación de idiomas, buscando aquellasque son más robustas y pueden cuantificar cuánto se acerca un texto a un modelo de lenguaje.También como objetivo secundario hemos investigado el papel que juega la ortografía comofactor de divergencia y convergencia entre lenguas.El método elegido es no-supervisado y puede aplicarse al cálculo de la distancia entre idiomas,entre períodos históricos de lenguas o entre variantes de lenguas

    TweetLID : a benchmark for tweet language identification

    Get PDF
    Language identification, as the task of determining the language a given text is written in, has progressed substantially in recent decades. However, three main issues remain still unresolved: (1) distinction of similar languages, (2) detection of multilingualism in a single document, and (3) identifying the language of short texts. In this paper, we describe our work on the development of a benchmark to encourage further research in these three directions, set forth an evaluation framework suitable for the task, and make a dataset of annotated tweets publicly available for research purposes. We also describe the shared task we organized to validate and assess the evaluation framework and dataset with systems submitted by seven different participants, and analyze the performance of these systems. The evaluation of the results submitted by the participants of the shared task helped us shed some light on the shortcomings of state-of-the-art language identification systems, and gives insight into the extent to which the brevity, multilingualism, and language similarity found in texts exacerbate the performance of language identifiers. Our dataset with nearly 35,000 tweets and the evaluation framework provide researchers and practitioners with suitable resources to further study the aforementioned issues on language identification within a common setting that enables to compare results with one another
    • …
    corecore