69 research outputs found

    Parallel Strands: A Preliminary Investigation into Mining the Web for Bilingual Text

    Get PDF
    Parallel corpora are a valuable resource for machine translation, but at present their availability and utility is limited by genre- and domain-specificity, licensing restrictions, and the basic difficulty of locating parallel texts in all but the most dominant of the world's languages. A parallel corpus resource not yet explored is the World Wide Web, which hosts an abundance of pages in parallel translation, offering a potential solution to some of these problems and unique opportunities of its own. This paper presents the necessary first step in that exploration: a method for automatically finding parallel translated documents on the Web. The technique is conceptually simple, fully language independent, and scalable, and preliminary evaluation results indicate that the method may be accurate enough to apply without human intervention.Comment: LaTeX2e, 11 pages, 7 eps figures; uses psfig, llncs.cls, theapa.sty. An Appendix at http://umiacs.umd.edu/~resnik/amta98/amta98_appendix.html contains test dat

    A Web browser and editor

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1996.Includes bibliographical references (leaves 60-61).by Jason A. Wilson.M.Eng

    An Information Extraction Approach to Reorganizing and Summarizing Specifications

    Get PDF
    Materials and Process Specifications are complex semi-structured documents containing numeric data, text, and images. This article describes a coarse-grain extraction technique to automatically reorganize and summarize spec content. Specifically, a strategy for semantic-markup, to capture content within a semantic ontology, relevant to semi-automatic extraction, has been developed and experimented with. The working prototypes were built in the context of Cohesia\u27s existing software infrastructure, and use techniques from Information Extraction, XML technology, etc

    A Rules Based System for Named Entity Recognition in Modern Standard Arabic

    Get PDF
    The amount of textual information available electronically has made it difficult formany users to find and access the right information within acceptable time. Researchcommunities in the natural language processing (NLP) field are developing tools andtechniques to alleviate these problems and help users in exploiting these vast resources.These techniques include Information Retrieval (IR) and Information Extraction (IE). Thework described in this thesis concerns IE and more specifically, named entity extraction inArabic. The Arabic language is of significant interest to the NLP community mainly due toits political and economic significance, but also due to its interesting characteristics.Text usually contains all kinds of names such as person names, company names,city and country names, sports teams, chemicals and lots of other names from specificdomains. These names are called Named Entities (NE) and Named Entity Recognition(NER), one of the main tasks of IE systems, seeks to locate and classify automaticallythese names into predefined categories. NER systems are developed for differentapplications and can be beneficial to other information management technologies as it canbe built over an IR system or can be used as the base module of a Data Mining application.In this thesis we propose an efficient and effective framework for extracting Arabic NEsfrom text using a rule based approach. Our approach makes use of Arabic contextual andmorphological information to extract named entities. The context is represented by meansof words that are used as clues for each named entity type. Morphological information isused to detect the part of speech of each word given to the morphological analyzer.Subsequently we developed and implemented our rules in order to recognise each positionof the named entity. Finally, our system implementation, evaluation metrics andexperimental results are presented
    • …
    corecore