21 research outputs found

    Human assessments of document similarity

    Get PDF
    Two studies are reported that examined the reliability of human assessments of document similarity and the association between human ratings and the results of n-gram automatic text analysis (ATA). Human interassessor reliability (IAR) was moderate to poor. However, correlations between average human ratings and n-gram solutions were strong. The average correlation between ATA and individual human solutions was greater than IAR. N-gram length influenced the strength of association, but optimum string length depended on the nature of the text (technical vs. nontechnical). We conclude that the methodology applied in previous studies may have led to overoptimistic views on human reliability, but that an optimal n-gram solution can provide a good approximation of the average human assessment of document similarity, a result that has important implications for future development of document visualization systems

    Probabilistic retrieval of OCR degraded text using N-grams

    Full text link

    Challenges in Short Text Classification: The Case of Online Auction Disclosure

    Get PDF
    Text classification is an important research problem in many fields. We examine a special case of textual content namely, short text. Examples of short text appear in a number of contexts such as online reviews, chat messages, twitter feeds, etc. In this research, we examine short text for the purpose of classification in internet auctions. The “ask seller a question” forum of a large horizontal intermediary auction platform is used to conduct this research. We describe our approach to classification by examining various solution methods to the problem. The unsupervised K-Medoids clustering algorithm provides useful but limited insights into keywords extraction while the supervised Naïve Bayes algorithm successfully achieves on average, around 65% classification accuracy. We then present a score assigning approach to this issue which outperforms the other two methods. Finally, we discuss how our approach to short text classification can be used to analyse the effectiveness of internet auctions

    Text Document Classification: An Approach Based on Indexing

    Get PDF
    ABSTRACT In this paper we propose a new method of classifying text documents. Unlike conventional vector space models, the proposed method preserves the sequence of term occurrence in a document. The term sequence is effectively preserved with the help of a novel datastructure called ‘Status Matrix’. Further the corresponding classification technique has been proposed for efficient classification of text documents. In addition, in order to avoid sequential matching during classification, we propose to index the terms in Btree, an efficient index scheme. Each term in B-tree is associated with a list of class labels of those documents which contain the term. Further the corresponding classification technique has been proposed. To corroborate the efficacy of the proposed representation and status matrix based classification, we have conducted extensive experiments on various datasets. Original Source URL : http://aircconline.com/ijdkp/V2N1/2112ijdkp04.pdf For more details : http://airccse.org/journal/ijdkp/vol2.htm

    A RE-UNIFICATION OF TWO COMPETING MODELS FOR DOCUMENT RETRIEVAL

    Get PDF
    Two competing approaches for document retrieval were first identified by Robertson et al (Robertson, Maron et al. 1982) for probabilistic retrieval. We point out the corresponding two competing approaches for the Vector Space Model. In both the probabilistic and Vector Space models, only one of the two competing approaches has received significant research attention, because of the unavailibility of sufficient data to implement the second approach. Because it is now feasible to collect vast amounts of feedback data from users, both approaches are now possible. We therefore re-visit the question of a unification of both approaches, for both probabilistic and Vector Space models. This unification of approaches differs from that originally proposed in (Robertson, Maron et al. 1982), and offers unique advantages. Preliminary results of a simulation experiment are reported, and an outline is provided of an ongoing field study.Information Systems Working Papers Serie

    Classificação automática de documentos usando subespaços aleatórios e conjuntos de classificadores

    Get PDF
    Atualmente, devido ao volume grande de texto disponível em meios digitais, a classificação automática de documentos se torna uma tarefa importante da área do Tratamento Automatizado de Informações. Neste artigo descreve-se uma nova abordagem para o problema, baseada no modelo vetorial para o tratamento de textos e no uso de técnicas de Reconhecimento de Padrões. Como coleções de textos produzem espaços vetoriais de dimensão bastante elevada, o problema é tratado usando várias técnicas de préprocessamento e um conjunto de classificadores baseados em instâncias – do tipo k-vizinhos mais próximos, cada um dos quais dedicado a um subespaço do espaço original. A classificação final é obtida por uma combinação de resultados dos classificadores individuais. Esta abordagem foi aplicada a documentos oriundos das bases de dados TIPSTER e REUTERS, amplamente utilizadas na área. São apresentados os principais resultados obtidos e algumas conclusões e perspectivas do trabalho.Nowadays, due to the large volume of text available in digital media, the automatic document categorization becomes an important modern Information Retrieval task. In this paper we describe a new approach to the problem, based on the classical vector space model for text treatment and on the use of Pattern Recognition techniques. As texts collections produce huge dimensional vector spaces, we attack the problem using several preprocessing techniques, and a set of k-Nearest-Neighbors classifiers, each of them dedicated to a sub-space of the original space. The final classification is obtained by a combination of the results of the individual classifiers. We apply our approach to documents extracted from the TIPSTER and REUTERS databases. The obtained results and some conclusions are presented.Eje: V - Workshop de agentes y sistemas inteligentesRed de Universidades con Carreras en Informática (RedUNCI

    Classificação automática de documentos usando subespaços aleatórios e conjuntos de classificadores

    Get PDF
    Atualmente, devido ao volume grande de texto disponível em meios digitais, a classificação automática de documentos se torna uma tarefa importante da área do Tratamento Automatizado de Informações. Neste artigo descreve-se uma nova abordagem para o problema, baseada no modelo vetorial para o tratamento de textos e no uso de técnicas de Reconhecimento de Padrões. Como coleções de textos produzem espaços vetoriais de dimensão bastante elevada, o problema é tratado usando várias técnicas de préprocessamento e um conjunto de classificadores baseados em instâncias – do tipo k-vizinhos mais próximos, cada um dos quais dedicado a um subespaço do espaço original. A classificação final é obtida por uma combinação de resultados dos classificadores individuais. Esta abordagem foi aplicada a documentos oriundos das bases de dados TIPSTER e REUTERS, amplamente utilizadas na área. São apresentados os principais resultados obtidos e algumas conclusões e perspectivas do trabalho.Nowadays, due to the large volume of text available in digital media, the automatic document categorization becomes an important modern Information Retrieval task. In this paper we describe a new approach to the problem, based on the classical vector space model for text treatment and on the use of Pattern Recognition techniques. As texts collections produce huge dimensional vector spaces, we attack the problem using several preprocessing techniques, and a set of k-Nearest-Neighbors classifiers, each of them dedicated to a sub-space of the original space. The final classification is obtained by a combination of the results of the individual classifiers. We apply our approach to documents extracted from the TIPSTER and REUTERS databases. The obtained results and some conclusions are presented.Eje: V - Workshop de agentes y sistemas inteligentesRed de Universidades con Carreras en Informática (RedUNCI

    Classificação automática de documentos usando subespaços aleatórios e conjuntos de classificadores

    Get PDF
    Atualmente, devido ao volume grande de texto disponível em meios digitais, a classificação automática de documentos se torna uma tarefa importante da área do Tratamento Automatizado de Informações. Neste artigo descreve-se uma nova abordagem para o problema, baseada no modelo vetorial para o tratamento de textos e no uso de técnicas de Reconhecimento de Padrões. Como coleções de textos produzem espaços vetoriais de dimensão bastante elevada, o problema é tratado usando várias técnicas de préprocessamento e um conjunto de classificadores baseados em instâncias – do tipo k-vizinhos mais próximos, cada um dos quais dedicado a um subespaço do espaço original. A classificação final é obtida por uma combinação de resultados dos classificadores individuais. Esta abordagem foi aplicada a documentos oriundos das bases de dados TIPSTER e REUTERS, amplamente utilizadas na área. São apresentados os principais resultados obtidos e algumas conclusões e perspectivas do trabalho.Nowadays, due to the large volume of text available in digital media, the automatic document categorization becomes an important modern Information Retrieval task. In this paper we describe a new approach to the problem, based on the classical vector space model for text treatment and on the use of Pattern Recognition techniques. As texts collections produce huge dimensional vector spaces, we attack the problem using several preprocessing techniques, and a set of k-Nearest-Neighbors classifiers, each of them dedicated to a sub-space of the original space. The final classification is obtained by a combination of the results of the individual classifiers. We apply our approach to documents extracted from the TIPSTER and REUTERS databases. The obtained results and some conclusions are presented.Eje: V - Workshop de agentes y sistemas inteligentesRed de Universidades con Carreras en Informática (RedUNCI
    corecore