184 research outputs found

    From Frequency to Meaning: Vector Space Models of Semantics

    Full text link
    Computers understand very little of the meaning of human language. This profoundly limits our ability to give instructions to computers, the ability of computers to explain their actions to us, and the ability of computers to analyse and process text. Vector space models (VSMs) of semantics are beginning to address these limits. This paper surveys the use of VSMs for semantic processing of text. We organize the literature on VSMs according to the structure of the matrix in a VSM. There are currently three broad classes of VSMs, based on term-document, word-context, and pair-pattern matrices, yielding three classes of applications. We survey a broad range of applications in these three categories and we take a detailed look at a specific open source project in each category. Our goal in this survey is to show the breadth of applications of VSMs for semantics, to provide a new perspective on VSMs for those who are already familiar with the area, and to provide pointers into the literature for those who are less familiar with the field

    A stemming algorithm for Latvian

    Get PDF
    The thesis covers construction, application and evaluation of a stemming algorithm for advanced information searching and retrieval in Latvian databases. Its aim is to examine the following two questions: Is it possible to apply for Latvian a suffix removal algorithm originally designed for English? Can stemming in Latvian produce the same or better information retrieval results than manual truncation? In order to achieve these aims, the role and importance of automatic word conflation both for document indexing and information retrieval are characterised. A review of literature, which analyzes and evaluates different types of stemming techniques and retrospective development of stemming algorithms, justifies the necessity to apply this advanced IR method also for Latvian. Comparative analysis of morphological structure both for English and Latvian language determined the selection of Porter's suffix removal algorithm as a basis for the Latvian sternmer. An extensive list of Latvian stopwords including conjunctions, particles and adverbs, was designed and added to the initial sternmer in order to eliminate insignificant words from further processing. A number of specific modifications and changes related to the Latvian language were carried out to the structure and rules of the original stemming algorithm. Analysis of word stemming based on Latvian electronic dictionary and Latvian text fragments confirmed that the suffix removal technique can be successfully applied also to Latvian language. An evaluation study of user search statements revealed that the stemming algorithm to a certain extent can improve effectiveness of information retrieval

    Persian Text Classification using naive Bayes algorithms and Support Vector Machine algorithm

    Get PDF
    One of the several benefits of text classification is to automatically assign document in predefined category is one of the primary steps toward knowledge extraction from the raw textual data. In such tasks, words are dealt with as a set of features. Due to high dimensionality and sparseness of feature vector results from traditional feature selection methods, most of the proposed text classification methods for this purpose lack performance and accuracy. Many algorithms have been implemented to the problem of Automatic Text Categorization that’s why, we tried to use new methods like Information Extraction, Natural Language Processing, and Machine Learning. This paper proposes an innovative approach to improve the classification performance of the Persian text. Naive Bayes classifiers which are widely used for text classification in machine learning are based on the conditional probability. we have compared the Gaussian, Multinomial and Bernoulli methods of naive Bayes algorithms with SVM algorithm. for statistical text representation, TF and TF-IDF and character-level 3 (3-Gram) [6,9] were used. Finally, experimental results on 10 newsgroups

    First large-scale information retrieval experiments on Turkish texts

    Get PDF
    We present the results of the first large-scale Turkish information retrieval experiments performed on a TREC-like test collection. The test bed, which has been created for this study, contains 95.5 million words, 408,305 documents, 72 ad hoc queries and has a size of about 800MB. All documents come from the Turkish newspaper Milliyet. We implement and apply simple to sophisticated stemmers and various query-document matching fonctions and show that truncating words at a prefix length of 5 creates an effective retrieval environment in Turkish. However, a lemmatizer-based stemmer provides significantly better effectiveness over a variety of matching functions

    Stemming Words dengan N-Gram dan Lexeme Based untuk Teks Berbahasa Korea

    Get PDF
    Bahasa Korea termasuk ke dalam bahasa aglutinatif yang termasuk unik dan memiliki berbagai jenis pelekatan morfem, dengan kondisi ini, pengaplikasian teknik stemming words dianggap sedikit sulit untuk dilakukan. Beberapa penelitian sudah dilakukan, namun masih ditemui beberapa kesalahan dikarenakan adanya keunikan dari karakter kata dalam Bahasa Korea. Dalam penelitian kali ini akan dibahas teknik baru untuk melakukan stemming words atau pencarian kata dasar disertai dengan deteksi imbuhannya. Penelitian ini bertujuan untuk membentuk kata dasar dari kata kerja berimbuhan pada bahasa Korea dan mencari jenis dan arti dari imbuhan yang melekat pada kata tersebut. Penelitian ini dilakukan dengan menggabungkan metode N-gram dan Lexeme Based. Dalam pencarian kata dasar ini sejumlah kata kerja yang mendapat imbuhan dalam tata bahasa tertentu dipecah untuk menghasilkan kata dasar dan imbuhan yang sesuai. Pemecahan kata berimbuhan dilakukan dengan metode N-gram dan dilanjutkan dengan pengaplikasian metode Lexeme Based untuk pencarian kata dasar serta jenis dan arti imbuhan. Hasil yang didapatkan pada penelitian ini adalah pembentukan kata dasar dan imbuhan yang disertai dengan jenis imbuhan serta arti dari imbuhan tersebut. Kata kunci : stemming words, Bahasa Korea, N-gram, Lexeme Based, aglutinatif

    Combining Multiple Strategies for Effective Monolingual and Cross-Language Retrieval

    Full text link

    A case study in decompounding for Bengali information retrieval

    Get PDF
    Decompounding has been found to improve information retrieval (IR) effectiveness for compounding languages such as Dutch, German, or Finnish. No previous studies, however, exist on the effect of decomposition of compounds in IR for Indian languages. In this case study, we investigate the effect of decompounding for Bengali, a highly agglutinative Indian language. Some unique characteristics of Bengali compounding are: i) only one constituent may be a valid word in contrast to the stricter requirement of both being so; and ii) the first character of the right constituent can be modified by the rules of sandhi in contrast to simple concatenation. While the standard approach of decompounding based on maximization of the total frequency of the constituents formed by candidate split positions has proven beneficial for European languages, our reported experiments in this paper show that such a standard approach does not work particularly well for Bengali IR. As a solution, we firstly propose a more relaxed decompounding where a compound word can be decomposed into only one constituent if the other constituent is not a valid word, and secondly we perform selective decompounding by employing a co-occurrence threshold to ensure that the constituent often co-occurs with the compound word, which in this case is representative of how related are the constituents with the compound. We perform experiments on Bengali ad-hoc IR collections from FIRE 2008 to 2012. Our experiments show that both the relaxed decomposition and the co-occurrence-based constituent selection proves more effective than the standard frequency-based decomposition. improving MAP up to 2:72% and recall up to 1:8%

    Information retrieval on turkish texts

    Get PDF
    In this study, we investigate information retrieval (IR) on Turkish texts using a large-scale test collection that contains 408,305 documents and 72 ad hoc queries. We examine the effects of several stemming options and query-document matching functions on retrieval performance. We show that a simple word truncation approach, a word truncation approach that uses language-dependent corpus statistics, and an elaborate lemmatizer-based stemmer provide similar retrieval effectiveness in Turkish IR. We investigate the effects of a range of search conditions on the retrieval performance; these include scalability issues, query and document length effects, and the use of stop-word list in indexing. © 2007 Wiley Periodicals, Inc
    corecore