702 research outputs found

    Using Uplug and SiteSeeker to construct a cross language search engine for Scandinavian languages

    Get PDF
    Proceedings of the 17th Nordic Conference of Computational Linguistics NODALIDA 2009. Editors: Kristiina Jokinen and Eckhard Bick. NEALT Proceedings Series, Vol. 4 (2009), 26-33. © 2009 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/9206

    JTEC panel report on machine translation in Japan

    Get PDF
    The goal of this report is to provide an overview of the state of the art of machine translation (MT) in Japan and to provide a comparison between Japanese and Western technology in this area. The term 'machine translation' as used here, includes both the science and technology required for automating the translation of text from one human language to another. Machine translation is viewed in Japan as an important strategic technology that is expected to play a key role in Japan's increasing participation in the world economy. MT is seen in Japan as important both for assimilating information into Japanese as well as for disseminating Japanese information throughout the world. Most of the MT systems now available in Japan are transfer-based systems. The majority of them exploit a case-frame representation of the source text as the basis of the transfer process. There is a gradual movement toward the use of deeper semantic representations, and some groups are beginning to look at interlingua-based systems

    Effective techniques for Indonesian text retrieval

    Get PDF
    The Web is a vast repository of data, and information on almost any subject can be found with the aid of search engines. Although the Web is international, the majority of research on finding of information has a focus on languages such as English and Chinese. In this thesis, we investigate information retrieval techniques for Indonesian. Although Indonesia is the fourth most populous country in the world, little attention has been given to search of Indonesian documents. Stemming is the process of reducing morphological variants of a word to a common stem form. Previous research has shown that stemming is language-dependent. Although several stemming algorithms have been proposed for Indonesian, there is no consensus on which gives better performance. We empirically explore these algorithms, showing that even the best algorithm still has scope for improvement. We propose novel extensions to this algorithm and develop a new Indonesian stemmer, and show that these can improve stemming correctness by up to three percentage points; our approach makes less than one error in thirty-eight words. We propose a range of techniques to enhance the performance of Indonesian information retrieval. These techniques include: stopping; sub-word tokenisation; and identification of proper nouns; and modifications to existing similarity functions. Our experiments show that many of these techniques can increase retrieval performance, with the highest increase achieved when we use grams of size five to tokenise words. We also present an effective method for identifying the language of a document; this allows various information retrieval techniques to be applied selectively depending on the language of target documents. We also address the problem of automatic creation of parallel corpora --- collections of documents that are the direct translations of each other --- which are essential for cross-lingual information retrieval tasks. Well-curated parallel corpora are rare, and for many languages, such as Indonesian, do not exist at all. We describe algorithms that we have developed to automatically identify parallel documents for Indonesian and English. Unlike most current approaches, which consider only the context and structure of the documents, our approach is based on the document content itself. Our algorithms do not make any prior assumptions about the documents, and are based on the Needleman-Wunsch algorithm for global alignment of protein sequences. Our approach works well in identifying Indonesian-English parallel documents, especially when no translation is performed. It can increase the separation value, a measure to discriminate good matches of parallel documents from bad matches, by approximately ten percentage points. We also investigate the applicability of our identification algorithms for other languages that use the Latin alphabet. Our experiments show that, with minor modifications, our alignment methods are effective for English-French, English-German, and French-German corpora, especially when the documents are not translated. Our technique can increase the separation value for the European corpus by up to twenty-eight percentage points. Together, these results provide a substantial advance in understanding techniques that can be applied for effective Indonesian text retrieval

    Improved cross-language information retrieval via disambiguation and vocabulary discovery

    Get PDF
    Cross-lingual information retrieval (CLIR) allows people to find documents irrespective of the language used in the query or document. This thesis is concerned with the development of techniques to improve the effectiveness of Chinese-English CLIR. In Chinese-English CLIR, the accuracy of dictionary-based query translation is limited by two major factors: translation ambiguity and the presence of out-of-vocabulary (OOV) terms. We explore alternative methods for translation disambiguation, and demonstrate new techniques based on a Markov model and the use of web documents as a corpus to provide context for disambiguation. This simple disambiguation technique has proved to be extremely robust and successful. Queries that seek topical information typically contain OOV terms that may not be found in a translation dictionary, leading to inappropriate translations and consequent poor retrieval performance. Our novel OOV term translation method is based on the Chinese authorial practice of including unfamiliar English terms in both languages. It automatically extracts correct translations from the web and can be applied to both Chinese-English and English-Chinese CLIR. Our OOV translation technique does not rely on prior segmentation and is thus free from seg mentation error. It leads to a significant improvement in CLIR effectiveness and can also be used to improve Chinese segmentation accuracy. Good quality translation resources, especially bilingual dictionaries, are valuable resources for effective CLIR. We developed a system to facilitate construction of a large-scale translation lexicon of Chinese-English OOV terms using the web. Experimental results show that this method is reliable and of practical use in query translation. In addition, parallel corpora provide a rich source of translation information. We have also developed a system that uses multiple features to identify parallel texts via a k-nearest-neighbour classifier, to automatically collect high quality parallel Chinese-English corpora from the web. These two automatic web mining systems are highly reliable and easy to deploy. In this research, we provided new ways to acquire linguistic resources using multilingual content on the web. These linguistic resources not only improve the efficiency and effectiveness of Chinese-English cross-language web retrieval; but also have wider applications than CLIR

    Student's Understanding Of English Expletives Words and Phrases

    Get PDF
    Expletives are words or phrases that do not add any structural or grammatical meaning to the sentence. These words and phrases are often referred to as empty words, meaningless phrases, or redundant pairs because they do not add any information to the sentence. The aim of this study is to observe the student's understanding regarding their comprehension of English expletives words and phrases. This research belongs to descriptive qualitative research. The researcher uses observation strategy along with worksheet of English expletives words and phrases that given to the students in the third semester of pre-advanced structure subject. Besides that, there is also questionnaire given to the students in order to know their understanding of English expletives. The result of the study shows that the form of 'redundant pairs' such as past history, future plans, etc. are more understandable by students besides another form of expletives words and phrases (empty words and meaningless phrases). The student's acquisition of the first language is regarded as one of the factors that influences student's understanding of English expletives words and phrases

    Papers in Austronesian linguistics No. 1

    Get PDF

    From Comparison to Collaboration: Experiments with a New Scholarly and Political Form

    Get PDF
    Society and the workplace are two factors that are important for the individual's health status. It is important that the individuals has the right skills to take care of their health. For organizations, it is important to strive for the welfare of their employees. This has proven to have a positive impact on work performance, reduced absenteeism and reduced costs for rehabilitation. In 2007, the local authorities in Umeå implemented a wellness offering for all employees working in the municipality administration. They later saw a need to assist employees who needed help getting started with new exercise habits. This study aims to examine how the participants in the "Get Started Programme", succeeded in creating lasting exercise habits , 3-4 years after completing the program. Research questions are: How have the participants increased their knowledge practically and theoretically after the programme has finished? How have the participants succeeded in creating the content of the programme in their daily lives? How do the participants assess their health compared to before they participated in the programme? Are there any beneficial factors highlighted by the participants as during the program? The study was conducted on the basis of semi-structured interviews with eight voluntary participants who previously participated in the Get Started Programme. The results show that six of the eight participants succeeded to get started with the goals for behavioral change, and still maintain a sufficient physical activity level today. Participants who do not consider themselves to have succeeded in reaching the goals they set up in the beginning of the program, point out that they have the tools needed to go on and continue the behavioral change they strive for

    Perspectives on information structure in Austronesian languages

    Get PDF
    Information structure is a relatively new field to linguistics and has only recently been studied for smaller and less described languages. This book is the first of its kind that brings together contributions on information structure in Austronesian languages. Current approaches from formal semantics, discourse studies, and intonational phonology are brought together with language specific and cross-linguistic expertise of Austronesian languages. The 13 chapters in this volume cover all subgroups of the large Austronesian family, including Formosan, Central Malayo-Polynesian, South Halmahera-West New Guinea, and Oceanic. The major focus, though, lies on Western Malayo-Polynesian languages. Some chapters investigate two of the largest languages in the region (Tagalog and different varieties of Malay), others study information-structural phenomena in small, underdescribed languages. The three overarching topics that are covered in this book are NP marking and reference tracking devices, syntactic structures and information-structural categories, and the interaction of information structure and prosody. Various data types build the basis for the different studies compiled in this book. Some chapters investigate written texts, such as modern novels (cf. Djenar’s chapter on modern, standard Indonesian), or compare different text genres, such as, for example, oral narratives and translations of biblical narratives (cf. De Busser’s chapter on Bunun). Most contributions, however, study natural spoken speech and make use of spoken corpora which have been compiled by the authors themselves. The volume comprises a number of different methods and theoretical frameworks. Two chapters make use of the Question Under Discussion approach, developed in formal semantics (cf. the chapters by Latrouite & Riester; Shiohara & Riester). Riesberg et al. apply the recently developed method of Rapid Prosody Transcription (RPT) to investigate native speakers’ perception of prosodic prominences and boundaries in Papuan Malay. Other papers discuss theoretical consequences of their findings. Thus, for example, Himmelmann takes apart the most widespread framework for intonational phonology (ToBI) and argues that the analysis of Indonesian languages requires much simpler assumptions than the ones underlying the standard model. Arka & Sedeng ask the question how fine-grained information structure space should be conceptualized and modelled, e.g. in LFG. Schnell argues that elements that could be analysed as “topic” and “focus” categories, should better be described in terms of ‘packaging’ and do not necessarily reflect any pragmatic roles in the first place