107 research outputs found
Projection multilingue d'annotations pour dialogues avancés
Depuis quelques années, les applications intégrant un module de dialogues avancés sont en plein essor. En revanche, le processus d’universalisation de ces systèmes est rapidement décourageant : ceux-ci étant naturellement dépendants de la langue pour laquelle ils ont été conçus, chaque nouveau langage à intégrer requiert son propre temps de développement. Un constat qui ne s’améliore pas en considérant que la qualité est souvent tributaire de la taille de l’ensemble d’entraînement.
Ce projet cherche donc à accélérer le processus. Il rend compte de différentes méthodes permettant de générer des versions polyglottes d’un premier système fonctionnel, à l’aide de la traduction statistique. L’information afférente aux données sources est projetée afin de générer des données cibles parentes, qui diminuent d’autant le temps de développement subséquent.
En ce sens, plusieurs approches ont été expérimentées et analysées. Notamment, une méthode qui regroupe les données avant de réordonner les différents candidats de traduction permet d’obtenir de bons résultats.For a few years now, there has been an increasing number of applications allowing advanced dialog interactions with the user. However, the universalization of those systems quickly becomes painful : since they are highly dependent on the original development language, each new language to integrate requires an additionnal and significative time investment. A matter that only gets worse considering quality usually rests on the size of training set.
This project tries to speed up the overall process. It presents various methods to generate multilingual versions of a first functionnal system, using statistical machine translation. Information from the source data is projected to another language in order to create similar target data, which then reduces the upcoming development time.
Many approaches were tested and analysed. In particular, a method that regroups data in clusters before reordering the associated translation candidates shows promising results
Multi-word unit processing in machine translation. Developing and using language resources for multi-word unit processing in machine translation
2011 - 2012XI n.s
Sentiment Analysis for the Low-Resourced Latinised Arabic "Arabizi"
The expansion of digital communication mediums from private mobile messaging into the public through social media presented an opportunity for the data science research and industry to mine the generated big data for artificial information extraction. A popular information extraction task is sentiment analysis, which aims at extracting polarity opinions, positive, negative, or neutral, from the written natural language. This science helped organisations better understand the public’s opinion towards events, news, public figures, and products.
However, sentiment analysis has advanced for the English language ahead of Arabic. While sentiment analysis for Arabic is developing in the literature of Natural Language Processing (NLP), a popular variety of Arabic, Arabizi, has been overlooked for sentiment analysis advancements.
Arabizi is an informal transcription of the spoken dialectal Arabic in Latin script used for social texting. It is known to be common among the Arab youth, yet it is overlooked in efforts on Arabic sentiment analysis for its linguistic complexities.
As to Arabic, Arabizi is rich in inflectional morphology, but also codeswitched with English or French, and distinctively transcribed without adhering to a standard orthography. The rich morphology, inconsistent orthography, and codeswitching challenges are compounded together to have a multiplied effect on the lexical sparsity of the language, where each Arabizi word becomes eligible to be spelled in many ways, that, in addition to the mixing of other languages within the same textual context. The resulting high degree of lexical sparsity defies the very basics of sentiment analysis, classification of positive and negative words. Arabizi is even faced with a severe shortage of data resources that are required to set out any sentiment analysis approach.
In this thesis, we tackle this gap by conducting research on sentiment analysis for Arabizi. We addressed the sparsity challenge by harvesting Arabizi data from multi-lingual social media text using deep learning to build Arabizi resources for sentiment analysis. We developed six new morphologically and orthographically rich Arabizi sentiment lexicons and set the baseline for Arabizi sentiment analysis on social media
The European Language Resources and Technologies Forum: Shaping the Future of the Multilingual Digital Europe
Proceedings of the 1st FLaReNet Forum on the European Language Resources and Technologies, held in Vienna, at the Austrian Academy of Science, on 12-13 February 2009
Recommended from our members
The impact of social factors on the use of Arabic-French code-switching in speech and IM in Morocco
textThe use of French in code-switching (CS) with Moroccan Colloquial Arabic (MCA) has been explored qualitatively in a number of studies, but quantitative methods have rarely been applied to CS in this language pair. Research on CS patterns as a function of extra-linguistic factors has similarly received little attention, despite the implication in many studies that these factors are significant in the use of CS. This dissertation seeks to address these gaps in the literature by quantitatively examining the use of Arabic-French CS by young adult speakers of MCA in spoken and written information communication. This study examines three extra-linguistic factors in speech and Instant Messaging (IM): Sex, French Proficiency, and Language Attitude. The analysis reveals that male speakers are significantly more French in written IM. Positive attitude toward French and MCA-French CS has a highly significant impact on the rate of French employed in spoken conversation. Meaningful results are also found for the French constituents employed in CS with regard to each of the extra-linguistic factors. Notable differences are found between sexes in the types of French constituents used in both communication modes, as well as for speakers of different French proficiency levels. The categorization of French-origin nouns as instances of CS or borrowing is also explored by considering multiple aspects of use of these lexical items. A number of French-origin nouns, absent from dictionaries of MCA, are proposed to now be borrowed into the dialect. The analysis also reveals a number of French-origin words that are used by a number of speakers, but remain instances of CS. The results of this investigation highlight the importance of quantification in studies of CS and provide data for comparison with other corpora from this and other language pairs. The differences identified in CS by communication mode indicate that there is a need for a model of written CS that accounts for the unique characteristics of this mode. Finally, little work has been published on the relationship between extra-linguistic factors and structural patterns in CS, but the current results suggest that the impact of social factors should not be ignored when considering structural aspects of CS.French and Italia
Alternancia de lenguas en la comunicación mediada por ordenador entre las personas del Congo
Tesis de la Universidad Complutense de Madrid, Facultad de Filología, leída el 12/11/2018Language research in Computer-mediated communication (hereafter CMC) is a relatively new and dynamically evolving field (Herring et al. 2013). Unlike offline (or face-to-face) communication, CMC is, according to Herring (1996), a communication that takes place between human beings via the instrumentality of computers or other devices (e.g. Smartphones, tablets, etc.) that allow users to connect to the Internet. CMC implies the use of the Web 2.0 as a medium of communication. Understood as an umbrella term covering different phenomena – e.g. social networking communication, netspeak and so on – CMC includes different channels such as instant messaging, email, chatrooms, online forums, social networking services, and so on. CMC is characterised by two fundamental and opposing modes (Crystal 2001, 2003). The synchronous mode (or real-time conversation) takes place as all participants (senders and receivers) are simultaneously online during text message exchange (i.e. chat rooms). The asynchronous mode, on the other hand, requires the messages to be stored in the addressees’ inbox until they can be read (e.g. email). Nevertheless, Facebook, on which the present thesis is based, is a CMC channel that involves both synchronous and asynchronous modes (Pérez- Sabater 2012; Maíz-Arévalo 2015). While the literature on CMC is fast-growing, much evidence from many other languages and cultures is still needed (Herring 2010; Thurlow & Puff 2013). Hundreds of languages notably used in CMC remain under-investigated around the world. In the particular case of Congo- Brazzaville, no attempt to investigate the nature of the impact which CMC is making on language(s) has been undertaken so far, though online materials have increasingly penetrated the country...El cambio de código o alternancia de lenguas (Codeswitching), el préstamo, la transferencia cultural y lingüística, la convergencia y el calco lingüístico, generalmente conocidos como fenómenos lingüísticos, son los resultados inherentes del contacto de lenguas. Según los datos, estos fenómenos se producen tanto en la comunicación presencial (o frente a frente) como en la comunicación online (Blom and Gumperz 1972; Poplack, 2001; Gumperz 1961; 1982a; Myers-Scotton 1992, 1993a, 1993b, 2006; Cardenas-Chloros 2009; Bullock & Toribio 2009). Por ello, el presente estudio se centra en el análisis de algunos de estos resultados en un contexto muy específico de la comunicación online (o comunicación mediada por ordenador; CMC en inglés), que es la comunicación mediante la red social Facebook. A pesar de todos los intentos de investigar lenguas en la comunicación online, existe un gran número de idiomas que sigue estando insuficientemente investigado en el contexto de la comunicación mediada por ordenador. En el caso particular del Congo-Brazzaville, no se ha llevado a cabo hasta la fecha ningún intento de investigar la naturaleza de la incidencia que la CMC está teniendo en los usos lingüísticos, aunque la comunicación online en estas lenguas es diaria. La presente tesis tiene la intención de corregir este desequilibrio mediante el análisis del cambio de código (o alternancia de lenguas) en la comunicación online en Congo-Brazzaville. Según los datos, en 2017 el número de usuarios activos de internet es de 400.000 personas son usuarios activos de Internet en la actualidad, y unos sesenta idiomas se hablan dentro de las fronteras nacionales. Por ello, el estudio del cambio de código entre los usuarios de Facebook en Congo es obviamente importante no sólo para investigar el fenómeno como tal, sino también para proveer datos sobre el impacto de CMC (sobre todo Facebook) sobre los idiomas en Congo-Brazzaville. Así pues, el objetivo del presente estudio es doble: (1) evaluar los diferentes idiomas involucrados en el discurso en Facebook de los congoleños y (2) examinar las motivaciones sociolingüísticas del cambio de código, así como la estructura sintáctica en la que se produce...Fac. de FilologíaTRUEunpu
Cross-Platform Text Mining and Natural Language Processing Interoperability - Proceedings of the LREC2016 conference
No abstract available
Cross-Platform Text Mining and Natural Language Processing Interoperability - Proceedings of the LREC2016 conference
No abstract available
Zināšanās bāzētu un korpusā bāzētu metožu kombinētā izmantošanas mašīntulkošanā
ANOTĀCIJA.
Mašīntulkošanas (MT) sistēmas tiek būvētas izmantojot dažādas metodes (zināšanās un korpusā bāzētas). Zināšanās bāzēta MT tulko tekstu, izmantojot cilvēka rakstītus likumus. Korpusā bāzēta MT izmanto no tulkojumu piemēriem automātiski izgūtus modeļus. Abām metodēm ir gan priekšrocības, gan trūkumi. Šajā darbā tiek meklēta kombināta metode MT kvalitātes uzlabošanai, kombinējot abas metodes.
Darbā tiek pētīta metožu piemērotība latviešu valodai, kas ir maza, morfoloģiski bagāta valoda ar ierobežotiem resursiem. Tiek analizētas esošās metodes un tiek piedāvātas vairākas kombinētās metodes. Metodes ir realizētas un novērtētas, izmantojot gan automātiskas, gan cilvēka novērtēšanas metodes. Faktorēta statistiskā MT ar zināšanās balstītu morfoloģisko analizatoru ir piedāvāta kā perspektīvākā. Darbā aprakstīts arī metodes praktiskais pielietojums.
Atslēgas vārdi: mašīntulkošana (MT), zināšanās balstīta MT, korpusā balstīta MT, kombinēta metodeABSTRACT.
Machine Translation (MT) systems are built using different methods (knowledge-based and corpus-based). Knowledge-based MT translates text using human created rules. Corpus-based MT uses models which are automatically built from translation examples. Both methods have their advantages and disadvantages. This work aims to find a combined method to improve the MT quality combining both methods.
An applicability of the methods for Latvian (a small, morphologically rich, under-resourced language) is researched. The existing MT methods have been analyzed and several combined methods have been proposed. Methods have been implemented and evaluated using an automatic and human evaluation. The factored statistical MT with a rule-based morphological analyzer is proposed to be the most promising. The practical application of methods is described.
Keywords: Machine Translation (MT), Rule-based MT, Statistical MT, Combined approac
A Strategy for Multilingual Spoken Language Understanding Based on Graphs of Linguistic Units
[EN] In this thesis, the problem of multilingual spoken language understanding is addressed using graphs to model and combine the different knowledge sources that take part in the understanding process. As a result of this work, a full multilingual spoken language understanding system has been developed, in which statistical models and graphs of linguistic units are used. One key feature of this system is its ability to combine and process multiple inputs provided by one or more sources such as speech recognizers or machine translators.
A graph-based monolingual spoken language understanding system was developed as a starting point. The input to this system is a set of sentences that is provided by one or more speech recognition systems. First, these sentences are combined by means of a grammatical inference algorithm in order to build a graph of words. Next, the graph of words is processed to construct a graph of concepts by using a dynamic programming algorithm that identifies the lexical structures that represent the different concepts of the task. Finally, the graph of concepts is used to build the best sequence of concepts.
The multilingual case happens when the user speaks a language different to the one natively supported by the system. In this thesis, a test-on-source approach was followed. This means that the input sentences are translated into the system's language, and then they are processed by the monolingual system. For this purpose, two speech translation systems were developed. The output of these speech translation systems are graphs of words that are then processed by the monolingual graph-based spoken language understanding system.
Both in the monolingual case and in the multilingual case, the experimental results show that a combination of several inputs allows to improve the results obtained with a single input. In fact, this approach outperforms the current state of the art in many cases when several inputs are combined.[ES] En esta tesis se aborda el problema de la comprensión multilingüe del habla utilizando grafos para modelizar y combinar las diversas fuentes de conocimiento que intervienen en el proceso. Como resultado se ha desarrollado un sistema completo de comprensión multilingüe que utiliza modelos estadísticos y grafos de unidades lingüísticas. El punto fuerte de este sistema es su capacidad para combinar y procesar múltiples entradas proporcionadas por una o varias fuentes, como reconocedores de habla o traductores automáticos.
Como punto de partida se desarrolló un sistema de comprensión multilingüe basado en grafos. La entrada a este sistema es un conjunto de frases obtenido a partir de uno o varios reconocedores de habla. En primer lugar, se aplica un algoritmo de inferencia gramatical que combina estas frases y obtiene un grafo de palabras. A continuación, se analiza el grafo de palabras mediante un algoritmo de programación dinámica que identifica las estructuras léxicas correspondientes a los distintos conceptos de la tarea, de forma que se construye un grafo de conceptos. Finalmente, se procesa el grafo de conceptos para encontrar la mejo secuencia de conceptos.
El caso multilingüe ocurre cuando el usuario habla una lengua distinta a la original del sistema. En este trabajo se ha utilizado una estrategia test-on-source, en la cual las frases de entrada se traducen al lenguaje del sistema y éste las trata de forma monolingüe. Para ello se han propuesto dos sistemas de traducción del habla cuya salida son grafos de palabras, los cuales son procesados por el algoritmo de comprensión basado en grafos.
Tanto en la configuración monolingüe como en la multilingüe los resultados muestran que la combinación de varias entradas permite mejorar los resultados obtenidos con una sola entrada. De hecho, esta aproximación consigue en muchos casos mejores resultados que el actual estado del arte cuando se utiliza una combinación de varias entradas.[CA] Aquesta tesi tracta el problema de la comprensió multilingüe de la parla utilitzant grafs per a modelitzar i combinar les diverses fonts de coneixement que intervenen en el procés. Com a resultat s'ha desenvolupat un sistema complet de comprensió multilingüe de la parla que utilitza models estadístics i grafs d'unitats lingüístiques. El punt fort d'aquest sistema és la seua capacitat per combinar i processar múltiples entrades proporcionades per una o diverses fonts, com reconeixedors de la parla o traductors automàtics.
Com a punt de partida, es va desenvolupar un sistema de comprensió monolingüe basat en grafs. L'entrada d'aquest sistema és un conjunt de frases obtingut a partir d'un o més reconeixedors de la parla. En primer lloc, s'aplica un algorisme d'inferència gramatical que combina aquestes frases i obté un graf de paraules. A continuació, s'analitza el graf de paraules mitjançant un algorisme de programació dinàmica que identifica les estructures lèxiques corresponents als distints conceptes de la tasca, de forma que es construeix un graf de conceptes. Finalment, es processa aquest graf de conceptes per trobar la millor seqüència de conceptes.
El cas multilingüe ocorre quan l'usuari parla una llengua diferent a l'original del sistema. En aquest treball s'ha utilitzat una estratègia test-on-source, en la qual les frases d'entrada es tradueixen a la llengua del sistema, i aquest les tracta de forma monolingüe. Per a fer-ho es proposen dos sistemes de traducció de la parla l'eixida dels quals són grafs de paraules. Aquests grafs són posteriorment processats per l'algorisme de comprensió basat en grafs.
Tant per la configuració monolingüe com per la multilingüe els resultats mostren que la combinació de diverses entrades és capaç de millorar el resultats obtinguts utilitzant una sola entrada. De fet, aquesta aproximació aconsegueix en molts casos millors resultats que l'actual estat de l'art quan s'utilitza una combinació de diverses entrades.Calvo Lance, M. (2016). A Strategy for Multilingual Spoken Language Understanding Based on Graphs of Linguistic Units [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/62407TESI
- …