2,704 research outputs found
Proceedings of the COLING 2004 Post Conference Workshop on Multilingual Linguistic Ressources MLR2004
International audienceIn an ever expanding information society, most information systems are now facing the "multilingual challenge". Multilingual language resources play an essential role in modern information systems. Such resources need to provide information on many languages in a common framework and should be (re)usable in many applications (for automatic or human use). Many centres have been involved in national and international projects dedicated to building har- monised language resources and creating expertise in the maintenance and further development of standardised linguistic data. These resources include dictionaries, lexicons, thesauri, word-nets, and annotated corpora developed along the lines of best practices and recommendations. However, since the late 90's, most efforts in scaling up these resources remain the responsibility of the local authorities, usually, with very low funding (if any) and few opportunities for academic recognition of this work. Hence, it is not surprising that many of the resource holders and developers have become reluctant to give free access to the latest versions of their resources, and their actual status is therefore currently rather unclear. The goal of this workshop is to study problems involved in the development, management and reuse of lexical resources in a multilingual context. Moreover, this workshop provides a forum for reviewing the present state of language resources. The workshop is meant to bring to the international community qualitative and quantitative information about the most recent developments in the area of linguistic resources and their use in applications. The impressive number of submissions (38) to this workshop and in other workshops and conferences dedicated to similar topics proves that dealing with multilingual linguistic ressources has become a very hot problem in the Natural Language Processing community. To cope with the number of submissions, the workshop organising committee decided to accept 16 papers from 10 countries based on the reviewers' recommendations. Six of these papers will be presented in a poster session. The papers constitute a representative selection of current trends in research on Multilingual Language Resources, such as multilingual aligned corpora, bilingual and multilingual lexicons, and multilingual speech resources. The papers also represent a characteristic set of approaches to the development of multilingual language resources, such as automatic extraction of information from corpora, combination and re-use of existing resources, online collaborative development of multilingual lexicons, and use of the Web as a multilingual language resource. The development and management of multilingual language resources is a long-term activity in which collaboration among researchers is essential. We hope that this workshop will gather many researchers involved in such developments and will give them the opportunity to discuss, exchange, compare their approaches and strengthen their collaborations in the field. The organisation of this workshop would have been impossible without the hard work of the program committee who managed to provide accurate reviews on time, on a rather tight schedule. We would also like to thank the Coling 2004 organising committee that made this workshop possible. Finally, we hope that this workshop will yield fruitful results for all participants
TectoMT â a deep-Âlinguistic core of the combined Chimera MT system
Chimera is a machine translation system that combines the TectoMT deep-linguistic core with phrase-based MT system Moses. For EnglishâCzech pair it also uses the Depfix post-correction system. All the components run on Unix/Linux platform and are open source (available from Perl repository CPAN and the LINDAT/CLARIN repository). The main website is https://ufal.mff.cuni.cz/tectomt. The development is currently supported by the QTLeap 7th FP project (http://qtleap.eu)
Modeling information structure in a cross-linguistic perspective
This study makes substantial contributions to both the theoretical and computational treatment of information structure, with a specific focus on creating natural language processing applications such as multilingual machine translation systems. The present study first provides cross-linguistic findings in regards to information structure meanings and markings. Building upon such findings, the current model represents information structure within the HPSG/MRS framework using Individual Constraints. The primary goal of the present study is to create a multilingual grammar model of information structure for the LinGO Grammar Matrix system. The present study explores the construction of a grammar library for creating customized grammar incorporating information structure and illustrates how the information structure-based model improves performance of transfer-based machine translation
Getting Past the Language Gap: Innovations in Machine Translation
In this chapter, we will be reviewing state of the art machine translation systems, and will discuss innovative methods for machine translation, highlighting the most promising techniques and applications. Machine translation (MT) has benefited from a revitalization in the last 10 years or so, after a period of relatively slow activity. In 2005 the field received a jumpstart when a powerful complete experimental package for building MT systems from scratch became freely available as a result of the unified efforts of the MOSES international consortium. Around the same time, hierarchical methods had been introduced by Chinese researchers, which allowed the introduction and use of syntactic information in translation modeling. Furthermore, the advances in the related field of computational linguistics, making off-the-shelf taggers and parsers readily available, helped give MT an additional boost. Yet there is still more progress to be made. For example, MT will be enhanced greatly when both syntax and semantics are on board: this still presents a major challenge though many advanced research groups are currently pursuing ways to meet this challenge head-on. The next generation of MT will consist of a collection of hybrid systems. It also augurs well for the mobile environment, as we look forward to more advanced and improved technologies that enable the working of Speech-To-Speech machine translation on hand-held devices, i.e. speech recognition and speech synthesis. We review all of these developments and point out in the final section some of the most promising research avenues for the future of MT
Feature-based Transfer of Multilingual Sentence Representations to Cross-lingual Tasks
Universella meningsrepresentationer och flersprÄkig sprÄkmodellering Àr heta Àmnen inom sprÄkteknologi, specifikt omrÄdet som berör förstÄelse för naturligt sprÄk (natural language understanding). En meningsinbÀddning (sentence embedding) Àr en numerisk skildring av en följd ord som motsvaras av en hel fras eller mening, speficikt som ett resultat av en omkodare (encoder) inom maskininlÀrning. Dessa representationer behövs för automatiska uppgifter inom sprÄkteknologi som krÀver förstÄelse för betydelsen av en hel mening, till skillnad frÄn kombinationer av enskilda ords betydelser. Till sÄdana uppgifter kan rÀknas till exempel inferens (huruvida ett par satser Àr logiskt anknutna, natural language inference) samt Äsiktsanalys (sentiment analysis). Med universalitet avses kodad betydelse som Àr tillrÀckligt allmÀn för att gynna andra relaterade uppgifter, som till exempel klassificering. Det efterfrÄgas tydligare samförstÄnd kring strategier som anvÀnds för att bedöma kvaliteten pÄ dessa inbÀddningar, antingen genom att direkt undersöka deras lingvistiska egenskaper eller genom att anvÀnda dem som oberoende variabler (features) i relaterade modeller.
PÄ grund av att det Àr kostsamt att skapa resurser av hög kvalitet och upprÀtthÄlla sofistikerade system pÄ alla sprÄk som anvÀnds i vÀrlden finns det Àven ett stort intresse för uppskalering av moderna system till sprÄk med knappa resurser. Tanken med detta Àr sÄ kallad överföring (transfer) av kunskap inte bara mellan olika uppgifter, utan Àven mellan olika sprÄk. Trots att behovet av tvÀrsprÄkiga överföringsmetoder erkÀnns i forskningssamhÀllet Àr utvÀrderingsverktyg och riktmÀrken fortfarande i ett tidigt skede.
SentEval Ă€r ett existerande verktyg för utvĂ€rdering av meningsinbĂ€ddningar med speciell betoning pĂ„ deras universalitet. Syftet med detta avhandlingsprojekt Ă€r ett försök att utvidga detta verktyg att stödja samtidig bedömning pĂ„ nya uppgifter som omfattar flera olika sprĂ„k. BedömningssĂ€ttet bygger pĂ„ strategin att lĂ„ta kodade meningar fungera som variabler i sĂ„ kallade downstream-uppgifter och observera huruvida resultaten förbĂ€ttras. En modern mĂ„ngsprĂ„kig modell baserad pĂ„ sĂ„ kallad transformers-arkitektur utvĂ€rderas pĂ„ en etablerad inferensuppgift sĂ„vĂ€l som en ny kĂ€nsloanalyssuppgift (emotion detection), av vilka bĂ„da omfattar data pĂ„ en mĂ€ngd olika sprĂ„k. Ăven om det praktiska genomförandet i stor utstrĂ€ckning förblev experimentellt rapporteras vissa tentativa resultat i denna avhandling
Martial arts fiction : translational migrations east and west
This thesis was motivated by Robert Chard's puzzlement over the translational
phenomenon of martial arts fiction in the West. It proposes to address how the
translational migration of martial arts fiction took place, first to other Asian countries in the
1920's, but to the West only after a lapse of a few decades beginning in the early 1990's.
Adopting a descriptive approach as described by Gideon Toury, the thesis is intended to
add further to the limited inventory of case studies in urgent demand to test the polysystem
theory propounded by Even-Zohar.
The thesis is made up of two parts. Part I is a macro-level study of martial arts fiction,
intended to contribute to testing the limits of the polysystem theory. After examining
Chinese fiction as a low form in the Chinese literary polysystem and its weak function as
translated literature in the Western literary polysystem, the study explores the translational
phenomenon of martial arts fiction in the West as well as the concurrent phenomenon as to
why so little of martial arts fiction has been translated into Western languages, compared to
the copious amount into other Asian languages, to the extent of stimulating a new literary
genre or (re)writing martial arts fiction in indigenous languages in Indonesia, Vietnam and
Korea, sinicized countries or countries boasting large overseas Chinese communities.
Issues and problems related to these translational activities and cultural phenomena are
presented as tools to test the limits of the polysystem theory.
Part II is a micro-level study focussing on the specifics of rendering Fox Volant of the
Snowy Mountain by Jin Yong into English. I will argue, in the main, that many difficulties,
inherent in both the translating and reading processes, can be constructed within the
theoretical framework of Andre Lefevere's concept of "constraint", particularly that of the
universe of discourse. Lefevere's connotation of the universe of discourse will be expanded
to embrace different cultural presuppositions and literary assumptions underlying two
divergent world cultures, hence different reader expectations in the reading process.
It is hoped that the findings and results of this descriptive case history of martial arts fiction
as a literary genre in translational migrations will contribute to the accumulation of
knowledge
Recommended from our members
Cross-generational linguistic variation in the Canberra Vietnamese heritage language community: A corpus-centred investigation
This dissertation investigates cross-generational linguistic differences in the Canberra Vietnamese bilingual community, with a particular focus on Vietnamese as the heritage language. Specifically, it documents the vernacular and considers key aspects of this data from different theoretical perspectives. Its main contribution is an insight into a rarely studied heritage language variety in a contact community that has never been examined.
The dissertation consists of five core chapters, organised into two parts. In the first part (Chapters 2â3), I describe how I documented the vernacular and created the Canberra Vietnamese English Corpus (CanVEC), an original corpus compiled specifically for this study that is also the first to be freely available for research purposes. The corpus consists of over ten hours of spontaneous speech produced by 45 Vietnamese-English bilingual speakers across two generations living in Canberra. In the second part of the study (Chapters 4â6), I put the corpus to use and investigate aspects of the cross-generational differences in Vietnamese as the heritage language in this community.
In particular, I first probe the Vietnamese heritage language via its participation in the code-switching discourse (Chapter 4). In doing so, I focus on the applicability of the Matrix Language Framework (MLF) (Myers-Scotton, 1993, 2002) and its associated Matrix Language (ML) Turnover Hypothesis (Myers-Scotton, 1998) to the code-switching data in CanVEC. Since support for this prominent model has mainly come from language pairs that have different clausal word order or vastly different inventories of inflectional morphology, Vietnamese-English as a pair in which both languages are SVO and essentially isolating offers a tantalising testing ground for its application. Results show that the universal claims of this model do not hold so straight-forwardly. CanVEC data challenges several assumptions of the MLF, with the model ultimately only being able to account for around half of the CanVEC code-switching data. I further demonstrate that even when the ML is putatively identifiable and a cross-generational ML âturnoverâ is quantitatively observed, the predictions do not reflect the direction of structural influence that we see in CanVEC. The MLF approach therefore sheds only limited light on cross-generational language shift and variation in this community.
Given that null elements emerge as a distinct area of difficulty in Chapter 4, I take this aspect as the focal point for the next part of the investigation (Chapter 5), where I use the variationist approach (Labov, 1972 et seq.) to explore three cases where null and overt realisation alternates in Vietnamese: subjects, objects, and copulas. In doing so, I move away from the bilingual portion of CanVEC to examine the monolingual heritage Vietnamese subset directly. Results show that Vietnamese null subjects vary significantly across generations, while null objects and copulas remain stable in terms of use. As speakers also overwhelmingly prefer overt forms over null forms (âŒ70:30) across all the three of the variables of interest, I appeal to the generative interface-oriented approach (Sorace & Filiaci, 2006 et seq.) to next examine the distribution of overt subjects, objects, copulas (Chapter 6). These results converge with what was found for null forms: cross-generational effects were observed for pronominal subjects, but not pronominal objects and copulas. This finding also supports the importance of a distinction drawn in previous works between internal (syntax-semantics) and external (syntax-discourse/pragmatics) interface phenomena, with the latter being seemingly more susceptible to change.
Ultimately, this dissertation highlights the empirical and theoretical value of studying rarely considered contact varieties, while deploying an integrated approach that acknowledges the multi-faceted complexity of the contact communities where these varieties are spoken.Cambridge Trust International Scholarshi
- âŠ