137 research outputs found

    Towards Understanding Egyptian Arabic Dialogues

    Full text link
    Labelling of user's utterances to understanding his attends which called Dialogue Act (DA) classification, it is considered the key player for dialogue language understanding layer in automatic dialogue systems. In this paper, we proposed a novel approach to user's utterances labeling for Egyptian spontaneous dialogues and Instant Messages using Machine Learning (ML) approach without relying on any special lexicons, cues, or rules. Due to the lack of Egyptian dialect dialogue corpus, the system evaluated by multi-genre corpus includes 4725 utterances for three domains, which are collected and annotated manually from Egyptian call-centers. The system achieves F1 scores of 70. 36% overall domains.Comment: arXiv admin note: substantial text overlap with arXiv:1505.0308

    Author Profiling in Social Media: The Impact of Emotions on Discourse Analysis

    Full text link
    [EN] In this paper we summarise the content of the keynote that will be given at the 5th International Conference on Statistical Language and Speech Processing (SLSP) in Le Mans, France in October 23¿25, 2017. In the keynote we will address the importance of inferring demographic information for marketing and security reasons. The aim is to model how language is shared in gender and age groups taking into account its statistical usage. We will see how a shallow discourse analysis can be done on the basis of a graph-based representation in order to extract information such as how complicated the discourse is (i.e., how connected the graph is), how much interconnected grammatical categories are, how far a grammatical category is from others, how different grammatical categories are related to each other, how the discourse is modelled in different structural or stylistic units, what are the grammatical categories with the most central use in the discourse of a demographic group, what are the most common connectors in the linguistic structures used, etc. Moreover, we will see also the importance to consider emotions in the shallow discourse analysis and the impact that this has. We carried out some experiments for identifying gender and age, both in Spanish and in English, using PAN-AP-13 and PAN-PC-14 corpora, obtaining comparable results to the best performing systems of the PAN Lab at CLEF.The research work described in this paper was partially carried out in the framework of the SomEMBED project (TIN2015-71147-C2-1-P), funded by the Spanish Ministry of Economy, Industry and Competitiveness (MINECO).Rosso, P.; Rangel-Pardo, FM. (2017). Author Profiling in Social Media: The Impact of Emotions on Discourse Analysis. Lecture Notes in Computer Science. 10583:3-18. https://doi.org/10.1007/978-3-319-68456-7_1S31810583Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech.: Theory Exp. 2008(10), 10008 (2008)Bonacich, P.: Factoring and weighting approaches to clique identification. J. Math. Soc. 2(1), 113–120 (1972)Brandes, U.: A faster algorithm for betweenness centrality. J. Math. Soc. 25(2), 163–177 (2001)Carreras, X., Chao, I., Padró, L., Padró, M.: FreeLing : an open-source suite of language analyzers. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004) (2004)Díaz Rangel, I., Sidorov, G., Suárez-Guerra, S.: Creación y evaluación de un diccionario marcado con emociones y ponderado para el español. Onomazein 29, 23 (2014). (in Spanish)Ekman, P.: Universals and cultural differences in facial expressions of emotion. In: Symposium on Motivation, Nebraska, pp. 207–283 (1972)Forner, P., Navigli, R., Tufis, D. (eds.): CLEF 2013 Evaluation Labs and Workshop, Working Notes Papers, September 2013, Valencia, Spain, vol. 1179, pp. 23–26. CEUR-WS.org (2013)Koppel, M., Argamon, S., Shimoni, A.: Automatically categorizing written texts by author gender. Literay Linguist. Comput. 17(4), 401–412 (2003)Latapy, M.: Main-memory triangle computations for very large (sparse (power-law)) graphs. Theor. Comput. Sci. (TCS) 407(1–3), 458–473 (2008)Levin, B.: English Verb Classes and Alternations. University of Chicago Press, Chicago (1993)Mann, W.C., Thompson, S.A.: Rhetorical structure theory: toward a functional theory of text organization. Text-Interdiscip. J. Study Discourse 8(3), 243–281 (1988)Meina, M., Brodzinska, K., Celmer, B., Czokow, M., Patera, M., Pezacki, J., Wilk, M.: Ensemble-based classification for author profiling using various features notebook for PAN at CLEF 2013. In: Forner et al. [7]Padró, L., Stanilovsky, E.: FreeLing 3.0: towards wider multilinguality. In: Proceedings of the Language Resources and Evaluation Conference (LREC 2012) (2012)Lopez-Monroy, A.P., Montes-Gomez, M., Jair Escalante, H., Villasenor-Pineda, L., Villatoro-Tello, E.: INAOEs participation at PAN13: author profiling task. Notebook for PAN at CLEF 2013. In: Forner et al. [7]Pennebaker, J.W., Mehl, M.R., Niederhoffer, K.: Psychological aspects of natural language use: our words, our selves. Annu. Rev. Psychol. 54, 547–577 (2003)Pennebaker, J.W.: The Secret Life of Pronouns: What Our Words Say About Us. Bloomsbury Press, London (2011)Rangel, F., Hernández, I., Rosso, P., Reyes, A.: Emotions and irony per gender in Facebook. In: Proceedings of the Workshop on Emotion, Social Signals, Sentiment & Linked Open Data (ES3LOD), LREC-2014, Reykjavik, Iceland, 26–31 May 2014, pp. 68–73 (2014)Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G.: Overview of the author profiling task at PAN 2013. In: Forner et al. [7]Rangel, F., Rosso, P., Chugur, I., Potthast, M., Trenkmann, M., Stein, B., Verhoeven, B., Daelemans, W.: Overview of the 2nd author profiling task at PAN 2014. In: Cappellato, L., Ferro, N., Halvey, M., Kraaij, W. (eds.) Notebook Papers of CLEF 2014 LABs and Workshops, vol. 1180, pp. 951–957. CEUR-WS.org (2014)Rangel, F., Rosso, P.: On the multilingual and genre robustness of EmoGraphs for author profiling in social media. In: Mothe, J., Savoy, J., Kamps, J., Pinel-Sauvagnat, K., Jones, G.J.F., SanJuan, E., Cappellato, L., Ferro, N. (eds.) CLEF 2015. LNCS, vol. 9283, pp. 274–280. Springer, Cham (2015). doi: 10.1007/978-3-319-24027-5_28Rangel, F., Rosso, P.: On the impact of emotions on author profiling. Inf. Process. Manag. 52(1), 73–92 (2016)Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of age and gender on blogging. In: AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, AAAI, pp. 199–205 (2006)Soler-Company, J. Wanner, L.: Use of discourse and syntactic features for gender identification. In: The Eighth Starting Artificial Intelligence Research Symposium. Collocated with the 22nd European Conference on Artificial Intelligence, pp. 215–220 (2016)Soler-Company, J., Wanner, L.: On the relevance of syntactic and discourse features for author profiling and identification. In: 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Valencia, Spain, pp. 681–687 (2017)Strapparava, C., Valitutti, A.: WordNet affect: an affective extension of WordNet. In: Proceedings of the 4th International Conference on Language Resources and Evaluation, Lisboa, pp. 1083–1086 (2004)Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature 393(6684), 409–410 (1998)Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the Fourteenth International Conference on Machine Learning, ICML, pp. 412–420 (1997

    A Trie-Structured Bayesian Model for Unsupervised Morphological Segmentation

    Full text link
    In this paper, we introduce a trie-structured Bayesian model for unsupervised morphological segmentation. We adopt prior information from different sources in the model. We use neural word embeddings to discover words that are morphologically derived from each other and thereby that are semantically similar. We use letter successor variety counts obtained from tries that are built by neural word embeddings. Our results show that using different information sources such as neural word embeddings and letter successor variety as prior information improves morphological segmentation in a Bayesian model. Our model outperforms other unsupervised morphological segmentation models on Turkish and gives promising results on English and German for scarce resources.Comment: 12 pages, accepted and presented at the CICLING 2017 - 18th International Conference on Intelligent Text Processing and Computational Linguistic

    ACTIV-ES: a novel Spanish-language corpus for linguistic and cultural comparisons between communities of the Hispanic world

    Get PDF
    This proposal requests Level 1 funding to develop a novel Spanish-language corpus, ACTIV-ES. This electronic resource will be the first to compile the language of common, everyday life for three linguistically, culturally, and geographically distinct communities— Spain, Mexico, and Argentina. It will provide scholars, instructors, students, and other interested parties with a unique perspective, enabling for the first time a rich cross-linguistic and cross-cultural analysis of current patterns and themes in the Hispanic world. A series of planning sessions among experts in linguistics, pedagogy, computer science, and psychology will guide the technical and theoretical steps to optimize ACTIV-ES for applications in second-language pedagogy and enable heretofore impossible contemporary humanistic understanding. Insights gained from the project will inform a Level 2 proposal aimed at adding size, attributes, and a web interface to enable flexible public and scholarly access to the corpus

    Preliminary Experiments on Unsupervised Word Discovery in Mboshi

    No full text
    International audienceThe necessity to document thousands of endangered languages encourages the collaboration between linguists and computer scientists in order to provide the documentary linguistics community with the support of automatic processing tools. The French-German ANR-DFG project Breaking the Unwritten Language Barrier (BULB) aims at developing such tools for three mostly unwritten African languages of the Bantu family. For one of them, Mboshi, a language originating from the " Cu-vette " region of the Republic of Congo, we investigate unsuper-vised word discovery techniques from an unsegmented stream of phonemes. We compare different models and algorithms, both monolingual and bilingual, on a new corpus in Mboshi and French, and discuss various ways to represent the data with suitable granularity. An additional French-English corpus allows us to contrast the results obtained on Mboshi and to experiment with more data

    Innovative technologies for under-resourced language documentation: The BULB Project

    No full text
    International audienceThe project Breaking the Unwritten Language Barrier (BULB), which brings together linguists and computer scientists, aims at supporting linguists in documenting unwritten languages. In order to achieve this we will develop tools tailored to the needs of documentary linguists by building upon technology and expertise from the area of natural language processing, most prominently automatic speech recognition and machine translation. As a development and test bed for this we have chosen three less-resourced African languages from the Bantu family: Basaa, Myene and Embosi. Work within the project is divided into three main steps: 1) Collection of a large corpus of speech (100h per language) at a reasonable cost. After initial recording, the data is re-spoken by a reference speaker to enhance the signal quality and orally translated into French. 2) Automatic transcription of the Bantu languages at phoneme level and the French translation at word level. The recognized Bantu phonemes and French words will then be automatically aligned. 3) Tool development. In close cooperation and discussion with the linguists, the speech and language technologists will design and implement tools that will support the linguists in their work, taking into account the linguists' needs and technology's capabilities. The data collection has begun for the three languages. For this we use standard mobile devices and a dedicated software—LIG-AIKUMA, which proposes a range of different speech collection modes (recording, respeaking, translation and elicitation). LIG-AIKUMA 's improved features include a smart generation and handling of speaker metadata as well as respeaking and parallel audio data mapping
    corecore