1,697 research outputs found

    Text segmentation for analysing different languages

    Get PDF
    Over the past several years, researchers have applied different methods of text segmentation. Text segmentation is defined as a method of splitting a document into smaller segments, assuming with its own relevant meaning. Those segments can be classified into the tag, word, sentence, topic, phrase and any information unit. Firstly, this study reviews the different types of text segmentation methods used in different types of documentation, and later discusses the various reasons for utilizing it in opinion mining. The main contribution of this study includes a summarisation of research papers from the past 10 years that applied text segmentation as their main approach in text analysing. Results show that word segmentation was successfully and widely used for processing different languages

    Text segmentation techniques: A critical review

    Get PDF
    Text segmentation is widely used for processing text. It is a method of splitting a document into smaller parts, which is usually called segments. Each segment has its relevant meaning. Those segments categorized as word, sentence, topic, phrase or any information unit depending on the task of the text analysis. This study presents various reasons of usage of text segmentation for different analyzing approaches. We categorized the types of documents and languages used. The main contribution of this study includes a summarization of 50 research papers and an illustration of past decade (January 2007- January 2017)’s of research that applied text segmentation as their main approach for analysing text. Results revealed the popularity of using text segmentation in different languages. Besides that, the “word” seems to be the most practical and usable segment, as it is the smaller unit than the phrase, sentence or line

    Building Cendana: a Treebank for Informal Indonesian

    Get PDF

    Malay Lexical Analysis Through Corpus-Based Approach.

    Get PDF
    Due to the growth of electronic documents and the incessant increase of the power and capacity of computers, we propose the use of Malay corpora to extract automatically lexical information required by any Malay text processing system. That lexical information can be used to update machine-readable Malay dictionaries and to help lexicographers in their tasks of writing Malay dictionary

    Tagging narrator’s names in Hadith text

    Get PDF
    No AbstractKeywords: tagging; hadith text; nam

    Identifying And Classifying Unknown Words In Malay Texts.

    Get PDF
    In this paper, we propose a method based on a chain of filters to handle the problem of identifying and classifying unknown words in Malay texts. A word is identified as unknown when it is not listed in the lexicon

    Optimal Word Segmentation for Neural Machine Translation into Dravidian Languages

    Get PDF

    Optimal Word Segmentation for Neural Machine Translation into Dravidian Languages

    Get PDF
    Dravidian languages, such as Kannada and Tamil, are notoriously difficult to translate by state-of-the-art neural models. This stems from the fact that these languages are morphologically very rich as well as being low-resourced. In this paper, we focus on subword segmentation and evaluate Linguistically Motivated Vocabulary Reduction (LMVR) against the more commonly used SentencePiece (SP) for the task of translating from English into four different Dravidian languages. Additionally we investigate the optimal subword vocabulary size for each language. We find that SP is the overall best choice for segmentation, and that larger dictionary sizes lead to higher translation quality
    corecore