Search CORE

21 research outputs found

Human assessments of document similarity

Author: Belkin
Belz
Cavnar
Cavnar
Damashek
Damashek
Flesch
Fox
Furnas
Gardenfors
Haenggi
Harman
Harman
Harman
Hjørland
Johnson-Laird
Järvelin
Landauer
Lee
Lin
Lund
Miller
Morris
Resnik
Salton
Saracevic
Skupin
Vorhees
Westerman
Publication venue: 'Wiley'
Publication date: 01/01/2010
Field of study

Two studies are reported that examined the reliability of human assessments of document similarity and the association between human ratings and the results of n-gram automatic text analysis (ATA). Human interassessor reliability (IAR) was moderate to poor. However, correlations between average human ratings and n-gram solutions were strong. The average correlation between ATA and individual human solutions was greater than IAR. N-gram length influenced the strength of association, but optimum string length depended on the nature of the text (technical vs. nontechnical). We conclude that the methodology applied in previous studies may have led to overoptimistic views on human reliability, but that an optimal n-gram solution can provide a good approximation of the average human assessment of document similarity, a result that has important implications for future development of document visualization systems

Crossref

University of Gloucestershire Research Repository

Brunel University Research Archive

Probabilistic retrieval of OCR degraded text using N-grams

Author: A. Zamora
C. Pierce
D.J. Cohen
E. Ukkonen
H. Turtle
J. Zobel
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Challenges in Short Text Classification: The Case of Online Auction Disclosure

Author: Li Yichen
Srinivasan Ananth
Tripathi Arvind
Publication venue: AIS Electronic Library (AISeL)
Publication date: 01/01/2016
Field of study

Text classification is an important research problem in many fields. We examine a special case of textual content namely, short text. Examples of short text appear in a number of contexts such as online reviews, chat messages, twitter feeds, etc. In this research, we examine short text for the purpose of classification in internet auctions. The “ask seller a question” forum of a large horizontal intermediary auction platform is used to conduct this research. We describe our approach to classification by examining various solution methods to the problem. The unsupervised K-Medoids clustering algorithm provides useful but limited insights into keywords extraction while the supervised Naïve Bayes algorithm successfully achieves on average, around 65% classification accuracy. We then present a score assigning approach to this issue which outperforms the other two methods. Finally, we discuss how our approach to short text classification can be used to analyse the effectiveness of internet auctions

AIS Electronic Library (AISeL)

Text Document Classification: An Approach Based on Indexing

Author: B S Harish
Publication venue
Publication date
Field of study

ABSTRACT In this paper we propose a new method of classifying text documents. Unlike conventional vector space models, the proposed method preserves the sequence of term occurrence in a document. The term sequence is effectively preserved with the help of a novel datastructure called ‘Status Matrix’. Further the corresponding classification technique has been proposed for efficient classification of text documents. In addition, in order to avoid sequential matching during classification, we propose to index the terms in Btree, an efficient index scheme. Each term in B-tree is associated with a list of class labels of those documents which contain the term. Further the corresponding classification technique has been proposed. To corroborate the efficacy of the proposed representation and status matrix based classification, we have conducted extensive experiments on various datasets. Original Source URL : http://aircconline.com/ijdkp/V2N1/2112ijdkp04.pdf For more details : http://airccse.org/journal/ijdkp/vol2.htm

ZENODO

A RE-UNIFICATION OF TWO COMPETING MODELS FOR DOCUMENT RETRIEVAL

Author: Bodoff David
Publication venue: Stern School of Business, New York University
Publication date: 01/06/1997
Field of study

Two competing approaches for document retrieval were first identified by Robertson et al (Robertson, Maron et al. 1982) for probabilistic retrieval. We point out the corresponding two competing approaches for the Vector Space Model. In both the probabilistic and Vector Space models, only one of the two competing approaches has received significant research attention, because of the unavailibility of sufficient data to implement the second approach. Because it is now feasible to collect vast amounts of feedback data from users, both approaches are now possible. We therefore re-visit the question of a unification of both approaches, for both probabilistic and Vector Space models. This unification of approaches differs from that originally proposed in (Robertson, Maron et al. 1982), and offers unique advantages. Preliminary results of a simulation experiment are reported, and an outline is provided of an ongoing field study.Information Systems Working Papers Serie

New York University Faculty Digital Archive

Classificação automática de documentos usando subespaços aleatórios e conjuntos de classificadores

Author: Gean Chu Chia
Publication venue
Publication date: 16/10/2012
Field of study

Atualmente, devido ao volume grande de texto disponível em meios digitais, a classificação automática de documentos se torna uma tarefa importante da área do Tratamento Automatizado de Informações. Neste artigo descreve-se uma nova abordagem para o problema, baseada no modelo vetorial para o tratamento de textos e no uso de técnicas de Reconhecimento de Padrões. Como coleções de textos produzem espaços vetoriais de dimensão bastante elevada, o problema é tratado usando várias técnicas de préprocessamento e um conjunto de classificadores baseados em instâncias – do tipo k-vizinhos mais próximos, cada um dos quais dedicado a um subespaço do espaço original. A classificação final é obtida por uma combinação de resultados dos classificadores individuais. Esta abordagem foi aplicada a documentos oriundos das bases de dados TIPSTER e REUTERS, amplamente utilizadas na área. São apresentados os principais resultados obtidos e algumas conclusões e perspectivas do trabalho.Nowadays, due to the large volume of text available in digital media, the automatic document categorization becomes an important modern Information Retrieval task. In this paper we describe a new approach to the problem, based on the classical vector space model for text treatment and on the use of Pattern Recognition techniques. As texts collections produce huge dimensional vector spaces, we attack the problem using several preprocessing techniques, and a set of k-Nearest-Neighbors classifiers, each of them dedicated to a sub-space of the original space. The final classification is obtained by a combination of the results of the individual classifiers. We apply our approach to documents extracted from the TIPSTER and REUTERS databases. The obtained results and some conclusions are presented.Eje: V - Workshop de agentes y sistemas inteligentesRed de Universidades con Carreras en Informática (RedUNCI

Servicio de Difusión de la Creación Intelectual

Classificação automática de documentos usando subespaços aleatórios e conjuntos de classificadores

Author: Gean Chu Chia
Publication venue
Publication date: 16/10/2012
Field of study

Servicio de Difusión de la Creación Intelectual

Classificação automática de documentos usando subespaços aleatórios e conjuntos de classificadores

Author: Gean Chu Chia
Publication venue
Publication date: 01/01/2004
Field of study