415,795 research outputs found

    Investigating sentence weighting components for automatic summarisation

    No full text
    The work described here initially formed part of a triangulation exercise to establish the effectiveness of the Query Term Order algorithm. The methodology produced subsequently proved to be a reliable indicator of quality for summarising English web documents. We utilised the human summaries from the Document Understanding Conference data, and generated queries automatically for testing the QTO algorithm. Six sentence weighting schemes that made use of Query Term Frequency and QTO were constructed to produce system summaries, and this paper explains the process of combining and balancing the weighting components. We also examined the five automatically generated query terms in their different permutations to check if the automatic generation of query terms resulting bias. The summaries produced were evaluated by the ROUGE-1 metric, and the results showed that using QTO in a weighting combination resulted in the best performance. We also found that using a combination of more weighting components always produced improved performance compared to any single weighting component

    A probabilistic justification for using tf.idf term weighting in information retrieval

    Get PDF
    This paper presents a new probabilistic model of information retrieval. The most important modeling assumption made is that documents and queries are defined by an ordered sequence of single terms. This assumption is not made in well known existing models of information retrieval, but is essential in the field of statistical natural language processing. Advances already made in statistical natural language processing will be used in this paper to formulate a probabilistic justification for using tf.idf term weighting. The paper shows that the new probabilistic interpretation of tf.idf term weighting might lead to better understanding of statistical ranking mechanisms, for example by explaining how they relate to coordination level ranking. A pilot experiment on the TREC collection shows that the linguistically motivated weighting algorithm outperforms the popular BM25 weighting algorithm

    Extending weighting models with a term quality measure

    Get PDF
    Weighting models use lexical statistics, such as term frequencies, to derive term weights, which are used to estimate the relevance of a document to a query. Apart from the removal of stopwords, there is no other consideration of the quality of words that are being ‘weighted’. It is often assumed that term frequency is a good indicator for a decision to be made as to how relevant a document is to a query. Our intuition is that raw term frequency could be enhanced to better discriminate between terms. To do so, we propose using non-lexical features to predict the ‘quality’ of words, before they are weighted for retrieval. Specifically, we show how parts of speech (e.g. nouns, verbs) can help estimate how informative a word generally is, regardless of its relevance to a query/document. Experimental results with two standard TREC collections show that integrating the proposed term quality to two established weighting models enhances retrieval performance, over a baseline that uses the original weighting models, at all times

    Analisis Performansi Term Weighting dengan Metode Supervised Term Weighting untuk Kategorisasi Teks

    Get PDF
    ABSTRAKSI: Kategorisasi teks (atau juga dikenal dengan klasifikasi teks) adalah suatu task yang mengurutkan kumpulan dokumen D kedalam kategori yang telah ditentukan secara otomatis ke dalam kategori C. Dalam kategorisasi teks salah satu prosesnya adalah teks preprocessing yang termasuk didalamnya meliputi feature selection dan term weighting.Salah satu metode pembobotan yang dikenal adalah TF-IDF dimana dalam metode ini setiap term/kata dalam sebuah dokumen dihitung frekuensinya dalam sebuah dokumen (term frequency) yang kemudian hasilnya dikombinasikan dengan frekuensi kemunculan term pada suatu kumpulan dokumen (inverse document frequency). Term yang sering muncul pada dokumen tapi jarang muncul pada kumpulan dokumen memberikan nilai bobot yang tinggi. TF-IDF akan meningkat dengan jumlah kemunculan term pada sebuah dokumen dan berkurang dengan jumlah term yang muncul pada kumpulan dokumen.Namun mengingat kategorisasi teks bersifat terawasi dimana menggunakan dataset yang dibagi menjadi dataset training dan dataset testing, maka diperlukan suatu metode yang memenuhi syarat diatas. Dalam konteks standar Information Retrieval, asumsi IDF cukup beralasan karena dapat menginterpretasikan term dengan baik karena term yang sering muncul dalam banyak dokumen adalah diskriminator yang tidak baik. Tapi ketika data training untuk query tersedia, cara yang lebih baik harus digunakan yang dapat membedakan term yang terdistribusi ke dalam kumpulan data training baik kategori positif maupun negative. Data training tidak tersedia dalam query di konsep standar IR, namun lebih sering tersedia untuk kategori dalam konteks TC, dimana gagasan “relevansi dengan query” digantikan dengan “keanggotaan dalam kategori”. Maka dari itu digunakanlah Category-based Function yang ada pada Term Evalution Function seperti Chi-square, Information Gain (IG), dan Gain Ratio (GR) sebagai pengganti fungsi IDF pada TF-IDF. Metode ini disebut Supervised Term Weighting. Dan metode inilah yang digunakan dalam Skripsi ini.Pada metode ini dikombinasikan antara nilai term evaluation function dari setiap term yang terpilih dengan nilai term frequency di setiap dokumen kemudian dilakukan klasifikasi dengan metode SVM dan dievaluasi dengan parameter precison, recall, f-measure dan akurasinya. Hasil penelitian menunjukkan bahwa metode Supervised Term Weighting memberikan performansi yang lebih baik dibandingkan TF-IDF khususnya pada threshold local policy.Kata Kunci : kategorisasi teks, term frequency, inverse document frequency, term weighting, supervised term weighting.ABSTRACT: Text categorization (or also known as text classification) is a task that sort a collection of documents D into a category that has been determined automatically Φ : D x C. In text categorization, one of the process are text preprocessing which include feature selection and term weighting.One known method of term weighting is TF-IDF, in this method where each term / word in a document is calculated in their frequency in a document (term frequency), which the results combined with the occurrence frequency of terms in a document collection (inverse document frequency). Term that often appears on the document but rarely appear on the set of documents providing a high weight value. TF-IDF will increase with the number of occurrences of terms in a document and is reduced by the number of terms that appear on the document collection.Since text categorization are supervised which include dataset was divided into training and testing datasets, we need a method that meets the requirement. In standard IR contexts this assumption is reasonable, since it encodes the quite plausible intuition that a term tk that occurs in too many documents is not a good discriminator, i.e when it occurs in a query q it is not sufficiently helpful in discriminating the documents relevant to q from the irrelevant. However, if training data for the query were available (i.e. documents whose relevance or irrelevance to q is known), an even stronger intuition should be brought to bear, i.e. the one according to which the best discriminators are the terms that are distributed most differently in the sets of positive and negative training examples. Training data is not available for queries in standard IR contexts, but is usually available for categories in TC contexts, where the notion of “relevance to a query” is replaced by the notion of “membership in a category”. In these contexts, category-based functions like on Term Evalution Function such as Chi-square, Information Gain (IG), dan Gain Ratio (GR) that score terms according to how differently they are distributed in the sets of positive and negative training examples, are thus better substitutes of idf-like functions.In this method combined the term evaluation function value of each term is chosen with a value frequencynya term in each document and then performed SVM classification methods and evaluated with the parameters of precison, recall, f-measure and accuracy. The results showed that the supervised term weighting method gives better performance than TF-IDF, especially at the local threshold policy.Keyword: text categorization, term frequency, inverse document frequency, term weighting, supervised term weighting

    Combining and selecting characteristics of information use

    Get PDF
    In this paper we report on a series of experiments designed to investigate the combination of term and document weighting functions in Information Retrieval. We describe a series of weighting functions, each of which is based on how information is used within documents and collections, and use these weighting functions in two types of experiments: one based on combination of evidence for ad-hoc retrieval, the other based on selective combination of evidence within a relevance feedback situation. We discuss the difficulties involved in predicting good combinations of evidence for ad-hoc retrieval, and suggest the factors that may lead to the success or failure of combination. We also demonstrate how, in a relevance feedback situation, the relevance assessments can provide a good indication of how evidence should be selected for query term weighting. The use of relevance information to guide the combination process is shown to reduce the variability inherent in combination of evidence

    Credibility Adjusted Term Frequency: A Supervised Term Weighting Scheme for Sentiment Analysis and Text Classification

    Full text link
    We provide a simple but novel supervised weighting scheme for adjusting term frequency in tf-idf for sentiment analysis and text classification. We compare our method to baseline weighting schemes and find that it outperforms them on multiple benchmarks. The method is robust and works well on both snippets and longer documents
    • 

    corecore