22 research outputs found

    Maghrebi Arabic dialect processing: an overview

    Get PDF
    International audienceNatural Language Processing for Arabic dialects has grown widely these last years. Indeed, several works were proposed dealing with all aspects of Natural Language Processing. However , some AD varieties have received more attention and have a growing collection of resources. Others varieties, such as Maghrebi, still lag behind in that respect. Maghrebi Arabic is the family of Arabic dialects spoken in the Maghreb region (principally Algeria, Tunisia and Morocco). In this work we are interested in these three languages. This paper presents a review of natural language processing for Maghrebi Arabic dialects

    The morphosyntactic structure of number in Italian and Albanian : high and low plurals

    Get PDF
    I adopt the view that there are two number positions, including a lower Class position also hosting gender and a higher Num position. Italian -a plurals and Albanian neuters are associated with a cluster of properties often thought to characterize low plurals: application to a restricted set of lexical bases, meaning idiosyncrasies, association with (feminine) gender and agreement in the singular with the finite verb. Current analyses associate count Ns (both singular and plural) with a specialized node while treating mass Ns as default. I argue that mass Ns are associated with a specialized feature [aggr] (Albanian neuter) - and that a divisibility feature [part] for plural can attach to both count and mass bases (Italian -a). The properties of low number depend on the properties of the Class position, including the fact that it is low enough to select gender and also to combine with a different Num, yielding mixed agreement.Adopto la postura que hi ha dues posicions per al nombre: una posició Classe baixa que també alberga el gènere i una posició Nombre més alta. Els plurals italians en -a i els neutres albanesos s'associen a un cúmul de propietats que sovint es considera que caracteritzen els plurals baixos: l'aplicació a un conjunt restringit de bases lèxiques, els significats idiosincràtics, l'associació amb el gènere (femení) i la concordança en singular amb el verb finit. Les anàlisis actuals associen el Ns comptables (tant singulars com plurals) amb un node especialitzat mentre tracten els Ns de massa com a opció per defecte. Argumento que els Ns de massa estan associats amb una característica especialitzada [agreg] (el neutre albanès) i que hi ha un tret de divisibilitat [part] per al plural que pot adjuntar-se tant a bases comptables com de massa (l'italià -a). Les propietats del Nombre baix depenen de les propietats de la posició Classe, inclòs el fet que sigui prou baix per seleccionar el gènere i també per combinar-se amb un Nombre diferent, la qual cosa comporta una concordança mixta

    Morphology and linguistic typology : on-line-proceedings of the Fourth Mediterranean Morphology Meeting (MMM4)21-23 September 2003

    No full text

    Proceedings of the COLING 2004 Post Conference Workshop on Multilingual Linguistic Ressources MLR2004

    No full text
    International audienceIn an ever expanding information society, most information systems are now facing the "multilingual challenge". Multilingual language resources play an essential role in modern information systems. Such resources need to provide information on many languages in a common framework and should be (re)usable in many applications (for automatic or human use). Many centres have been involved in national and international projects dedicated to building har- monised language resources and creating expertise in the maintenance and further development of standardised linguistic data. These resources include dictionaries, lexicons, thesauri, word-nets, and annotated corpora developed along the lines of best practices and recommendations. However, since the late 90's, most efforts in scaling up these resources remain the responsibility of the local authorities, usually, with very low funding (if any) and few opportunities for academic recognition of this work. Hence, it is not surprising that many of the resource holders and developers have become reluctant to give free access to the latest versions of their resources, and their actual status is therefore currently rather unclear. The goal of this workshop is to study problems involved in the development, management and reuse of lexical resources in a multilingual context. Moreover, this workshop provides a forum for reviewing the present state of language resources. The workshop is meant to bring to the international community qualitative and quantitative information about the most recent developments in the area of linguistic resources and their use in applications. The impressive number of submissions (38) to this workshop and in other workshops and conferences dedicated to similar topics proves that dealing with multilingual linguistic ressources has become a very hot problem in the Natural Language Processing community. To cope with the number of submissions, the workshop organising committee decided to accept 16 papers from 10 countries based on the reviewers' recommendations. Six of these papers will be presented in a poster session. The papers constitute a representative selection of current trends in research on Multilingual Language Resources, such as multilingual aligned corpora, bilingual and multilingual lexicons, and multilingual speech resources. The papers also represent a characteristic set of approaches to the development of multilingual language resources, such as automatic extraction of information from corpora, combination and re-use of existing resources, online collaborative development of multilingual lexicons, and use of the Web as a multilingual language resource. The development and management of multilingual language resources is a long-term activity in which collaboration among researchers is essential. We hope that this workshop will gather many researchers involved in such developments and will give them the opportunity to discuss, exchange, compare their approaches and strengthen their collaborations in the field. The organisation of this workshop would have been impossible without the hard work of the program committee who managed to provide accurate reviews on time, on a rather tight schedule. We would also like to thank the Coling 2004 organising committee that made this workshop possible. Finally, we hope that this workshop will yield fruitful results for all participants

    Time, events and temporal relations: an empirical model for temporal processing of Italian texts

    Get PDF
    The aim of this work is the elaboration a computational model for the identification of temporal relations in text/discourse to be used as a component in more complex systems for Open-Domain Question-Answers, Information Extraction and Summarization. More specifically, the thesis will concentrate on the relationships between the various elements which signal temporal relations in Italian texts/discourses, on their roles and how they can be exploited. Time is a pervasive element of human life. It is the primary element thanks to which we are able to observe, describe and reason about what surrounds us and the world. The absence of a correct identification of the temporal ordering of what is narrated and/or described may result in a bad comprehension, which can lead to a misunderstanding. Normally, texts/discourses present situations standing in a particular temporal ordering. Whether these situations precede, or overlap or are included one within the other is inferred during the general process of reading and understanding. Nevertheless, to perform this seemingly easy task, we are taking into account a set of complex information involving different linguistic entities and sources of knowledge. A wide variety of devices is used in natural languages to convey temporal information. Verb tense, temporal prepositions, subordinate conjunctions, adjectival phrases are some of the most obvious. Nevertheless even these obvious devices have different degrees of temporal transparency, which may sometimes be not so obvious as it can appear at a quick and superficial analysis. One of the main shortcomings of previous research on temporal relations is represented by the fact that they concentrated only on a particular discourse segment, namely narrative discourse, disregarding the fact that a text/discourse is composed by different types of discourse segments and relations. A good theory or framework for temporal analysis must take into account all of them. In this work, we have concentrated on the elaboration of a framework which could be applied to all text/discourse segments, without paying too much attention to their type, since we claim that temporal relations can be recovered in every kind of discourse segments and not only in narrative ones. The model we propose is obtained by mixing together theoretical assumptions and empirical data, collected by means of two tests submitted to a total of 35 subjects with different backgrounds. The main results we have obtained from these empirical studies are: (i.) a general evaluation of the difficulty of the task of recovering temporal relations; (ii.) information on the level of granularity of temporal relations; (iii.) a saliency-based order of application of the linguistic devices used to express the temporal relations between two eventualities; (iv.) the proposal of tense temporal polysemy, as a device to identify the set of preferences which can assign unique values to possibly multiple temporal relations. On the basis of the empirical data, we propose to enlarge the set of classical finely grained interval relations (Allen, 1983) by including also coarse-grained temporal relations (Freska, 1992). Moreover, there could be cases in which we are not able to state in a reliable way if there exists a temporal relation or what the particular relation between two entities is. To overcome this issue we have adopted the proposal by Mani (2007) which allows the system to have differentiated levels of temporal representation on the basis of the temporal granularity associated with each discourse segment. The lack of an annotated corpus for eventualities, temporal expressions and temporal relations in Italian represents the biggest shortcomings of this work which has prevented the implementation of the model and its evaluation. Nevertheless, we have been able to conduct a series of experiments for the validation of procedures for the further realization of a working prototype. In addition to this, we have been able to implement and validate a working prototype for the spotting of temporal expressions in texts/discourses

    Arabic and contact-induced change

    Get PDF
    This volume offers a synthesis of current expertise on contact-induced change in Arabic and its neighbours, with thirty chapters written by many of the leading experts on this topic. Its purpose is to showcase the current state of knowledge regarding the diverse outcomes of contacts between Arabic and other languages, in a format that is both accessible and useful to Arabists, historical linguists, and students of language contact

    Arabic and contact-induced change

    Get PDF
    This volume offers a synthesis of current expertise on contact-induced change in Arabic and its neighbours, with thirty chapters written by many of the leading experts on this topic. Its purpose is to showcase the current state of knowledge regarding the diverse outcomes of contacts between Arabic and other languages, in a format that is both accessible and useful to Arabists, historical linguists, and students of language contact

    Arabic and contact-induced change

    Get PDF
    This volume offers a synthesis of current expertise on contact-induced change in Arabic and its neighbours, with thirty chapters written by many of the leading experts on this topic. Its purpose is to showcase the current state of knowledge regarding the diverse outcomes of contacts between Arabic and other languages, in a format that is both accessible and useful to Arabists, historical linguists, and students of language contact

    The effect of printed word attributes on Arabic reading

    Get PDF
    Printed Arabic texts usually contain no short vowels and therefore a single letter string can often be associated with two or more distinct pronunciations and meanings. The high level of homography is believed to present difficulties for the skilled reader. However, this is the first study to gather empirical evidence on what readers know about the different words that can be associated with each homograph. There are few studies of the effects of psycholinguistic variables on Arabic word naming and lexical decision. The present work therefore involved the creation of a database of 1,474 unvowelised letter strings, which was used to undertake four studies. The first study presented lists of unvowelised letter strings and asked participants to produce the one or more word forms (with short vowels) evoked by each target. Responses to 1,474 items were recorded from 445 adult speakers of Arabic. The number of different vowelised forms associated with each letter string and the percentage agreement were calculated. The second study collected subjective Age-of-Acquisition ratings from 89 different participants for the agreed vowelised form of each letter string. The third study asked 38 participants to produce pronunciation responses to 1,474 letter strings. Finally, 40 different participants were asked to produce lexical decisions to 1,352 letter strings and 1,352 matched non-word letter strings. Mixed-effects models showed that orthographic frequency, Age-of-Acquisition and name agreement influenced word naming, while lexical decision was not affected by name agreement. Findings indicate that lexical decision in Arabic requires recognition of a basic shared morphemic structure, whereas word naming requires identification of a unique phonological representation. It takes longer to name a word when there are more possible pronunciations. The Age-of-Acquisition effect is consistent with a developmental theory of reading
    corecore