2,859 research outputs found

    Automatic multi-label subject indexing in a multilingual environment

    Get PDF
    This paper presents an approach to automatically subject index fulltext documents with multiple labels based on binary support vector machines(SVM). The aim was to test the applicability of SVMs with a real world dataset. We have also explored the feasibility of incorporating multilingual background knowledge, as represented in thesauri or ontologies, into our text document representation for indexing purposes. The test set for our evaluations has been compiled from an extensive document base maintained by the Food and Agriculture Organization (FAO) of the United Nations (UN). Empirical results show that SVMs are a good method for automatic multi- label classification of documents in multiple languages

    Cross-Lingual Adaptation using Structural Correspondence Learning

    Full text link
    Cross-lingual adaptation, a special case of domain adaptation, refers to the transfer of classification knowledge between two languages. In this article we describe an extension of Structural Correspondence Learning (SCL), a recently proposed algorithm for domain adaptation, for cross-lingual adaptation. The proposed method uses unlabeled documents from both languages, along with a word translation oracle, to induce cross-lingual feature correspondences. From these correspondences a cross-lingual representation is created that enables the transfer of classification knowledge from the source to the target language. The main advantages of this approach over other approaches are its resource efficiency and task specificity. We conduct experiments in the area of cross-language topic and sentiment classification involving English as source language and German, French, and Japanese as target languages. The results show a significant improvement of the proposed method over a machine translation baseline, reducing the relative error due to cross-lingual adaptation by an average of 30% (topic classification) and 59% (sentiment classification). We further report on empirical analyses that reveal insights into the use of unlabeled data, the sensitivity with respect to important hyperparameters, and the nature of the induced cross-lingual correspondences

    Classification of Under-Resourced Language Documents Using English Ontology

    Get PDF
    Automatic documents classification is an important task due to the rapid growth of the number of electronic documents, which aims automatically assign the document to a predefined category based on its contents. The use of automatic document classification has been plays an important role in information extraction, summarization, text retrieval, question answering, e-mail spam detection, web page content filtering, automatic message routing , etc.Most existing methods and techniques in the field of document classification are keyword based, but due to lack of semantic consideration of this technique, it incurs low performance. In contrast, documents also be classified by taking their semantics using ontology as a knowledge base for classification; however, it is very challenging of building ontology with under-resourced language. Hence, this approach is only limited to resourced language (i.e. English) support. As a result, under-resourced language written documents are not benefited such ontology based classification approach. This paper describes the design of automatic document classification of under-resourced language written documents. In this work, we propose an approach that performs classification of under-resourced language written documents on top of English ontology. We used a bilingual dictionary with Part of Speech feature for word-by-word text translation to enable the classification of document without any language barrier. The design has a concept-mapping component, which uses lexical and semantic features to map the translated sense along the ontology concepts. Beside this, the design also has a categorization component, which determines a category of a given document based on weight of mapped concept. To evaluate the performance of the proposed approach 20-test documents for Amharic and Tigrinya and 15-test document for Afaan Oromo in each news category used. In order to observe the effect of incorporated features (i.e. lemma based index term selection, pre-processing strategies during concept mapping, lexical and semantics based concept mapping) five experimental techniques conducted. The experimental result indicated that the proposed approach with incorporation of all features and components achieved an average F-measure of 92.37%, 86.07% and 88.12% for Amharic, Afaan Oromo and Tigrinya documents respectively. Keywords: under-resourced language, Multilingual, Documents or text Classification, knowledge base, Ontology based text categorization, multilingual text classification, Ontology. DOI: 10.7176/CEIS/10-6-02 Publication date:July 31st 201
    • …
    corecore