Search CORE

1,266 research outputs found

Collaborative Filtering-based Context-Aware Document-Clustering (CF-CAC) Technique

Author: Wei Chih-Ping
Yang Chin-Sheng
Publication venue: AIS Electronic Library (AISeL)
Publication date: 03/07/2008
Field of study

Document clustering is an intentional act that should reflect an individual\u27s preference with regard to the semantic coherency or relevant categorization of documents and should conform to the context of a target task under investigation. Thus, effective document clustering techniques need to take into account a user\u27s categorization context. In response, Yang & Wei (2007) propose a Context-Aware document Clustering (CAC) technique that takes into consideration a user\u27s categorization preference relevant to the context of a target task and subsequently generates a set of document clusters from this specific contextual perspective. However, the CAC technique encounters the problem of small-sized anchoring terms. To overcome this shortcoming, we extend the CAC technique and propose a Collaborative Filtering-based Context-Aware document-Clustering (CF-CAC) technique that considers not only a target user\u27s but also other users\u27 anchoring terms when approximating the categorization context of the target user. Our empirical evaluation results suggest that our proposed CF-CAC technique outperforms the CAC technique

AIS Electronic Library (AISeL)

Combining Thesaurus Knowledge and Probabilistic Topic Models

Author: A Smith
D Blei
J Lau
K Frantzi
T Griffiths
Y Gao
Publication venue
Publication date: 31/07/2017
Field of study

In this paper we present the approach of introducing thesaurus knowledge into probabilistic topic models. The main idea of the approach is based on the assumption that the frequencies of semantically related words and phrases, which are met in the same texts, should be enhanced: this action leads to their larger contribution into topics found in these texts. We have conducted experiments with several thesauri and found that for improving topic models, it is useful to utilize domain-specific knowledge. If a general thesaurus, such as WordNet, is used, the thesaurus-based improvement of topic models can be achieved with excluding hyponymy relations in combined topic models.Comment: Accepted to AIST-2017 conference (http://aistconf.ru/). The final publication will be available at link.springer.co

arXiv.org e-Print Archive

Crossref

Query expansion with naive bayes for searching distributed collections

Author: Yang Hui
Zhang Minjie
Publication venue
Publication date: 01/01/2002
Field of study

The proliferation of online information resources increases the importance of effective and efficient distributed searching. However, the problem of word mismatch seriously hurts the effectiveness of distributed information retrieval. Automatic query expansion has been suggested as a technique for dealing with the fundamental issue of word mismatch. In this paper, we propose a method - query expansion with Naive Bayes to address the problem, discuss its implementation in IISS system, and present experimental results demonstrating its effectiveness. Such technique not only enhances the discriminatory power of typical queries for choosing the right collections but also hence significantly improves retrieval results

CiteSeerX

Open Research Online (The Open University)

Automated subject classification of textual web documents

Author: Koraljka Golub
Publication venue: 'Emerald'
Publication date
Field of study

Crossref

Could we automatically reproduce semantic relations of an information retrieval thesaurus?

Author: Panchenko A.
Publication venue: Издательско-полиграфический центр Воронежского государственного университета
Publication date: 01/01/2010
Field of study

A well constructed thesaurus is recognized as a valuable source of semantic information for various applications, especially for Information Retrieval. The main hindrances to using thesaurus-oriented approaches are the high complexity and cost of manual thesauri creation. This paper addresses the problem of automatic thesaurus construction, namely we study the quality of automatically extracted semantic relations as compared with the semantic relations of a manually crafted thesaurus. The vector-space model based on syntactic contexts was used to reproduce relations between the terms of a manually constructed thesaurus. We propose a simple algorithm for representing both single word and multiword terms in the distributional space of syntactic contexts. Furthermore, we propose a method for evaluation quality of the extracted relations. Our experiments show significant difference between the automatically and manually constructed relations: while many of the automatically generated relations are relevant, just a small part of them could be found in the original thesaurus

Institutional repository of Ural Federal University named after the first President of Russia B.N.Yeltsin

Machine Learning in Automated Text Categorization

Author: ANDROUTSOPOULOS I.
ATTARDI G.
BAKER L.D.
BIEBRICHER P.
CAROPRESO M.F.
CAVNAR W.B.
CHAKRABARTI S.
CLACK C.
CLEVERDON C.
COHEN W. W.
COHEN W. W.
COHEN W.W.
DAGAN I.
DEERWESTER S.
DENOYER L.
DIAZ ESTEBAN A.
DRUCKER H.
DUMAIS S.T.
DUMAIS S.T.
ESCUDERO G.
Fabrizio Sebastiani
FIELD B.
FORSYTH R. S.
FUHR N.
FUHR N.
FUHR N.
FURNKRANZ J.
GALAVOTTI L.
GALE W. A.
GOVERT N.
GRAY W.A.
GUTHRIE L.
HAYES P.J.
HEAPS H.
HERSH W.
HULL D. A.
HULL D. A.
ITTNER D.J.
IWAYAMA M.
IYER R.D.
JOACHIMS T.
JOACHIMS T.
JOACHIMS T.
JOHN G. H.
JUNKER M.
JUNKER M.
KESSLER B.
KIM Y.-H.
KLINKENBERG R.
KNORZ G.
KOLLER D.
LAM S.L.
LAM W.
LAM W.
LANG K.
LARKEY L. S.
LARKEY L. S.
LARKEY L.S.
LEWIS D. D.
LEWIS D. D.
LEWIS D. D.
LEWIS D. D.
LEWIS D.D.
LEWIS D.D.
LEWIS D.D.
LEWIS D.D.
LEWIS D.D.
LI H.
LI Y.H.
LIERE R.
LIM J. H.
MASAND B.
MASAND B.
MCCALLUM A. K.
MCCALLUM A.K.
MLADENIC D.
MLADENIC D.
MOULINIER I.
MOULINIER I.
MYERS K.
NG H.T.
OH H.-J.
PAZIENZA M. T.
RILOFF E.
ROBERTSON S.E.
ROBERTSON S.E.
ROTH D.
RUIZ M.E.
SABLE C.L.
SARACEVIC T.
SCHAPIRE R. E.
SCHUTZE H.
SCHUTZE H.
SCOTT S.
SEBASTIANI F.
SINGHAL A.
SLONIM N.
TAIRA H.
TUMER K.
TZERAS K.
VAN RIJSBERGEN C. J.
WIENER E.D.
YANG Y.
YANG Y.
YANG Y.
YANG Y.
YU K.L.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2001
Field of study

The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert manpower, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey

arXiv.org e-Print Archive

CiteSeerX

Crossref

Text Document Categorization using Enhanced Sentence Vector Space Model and Bi-Gram Text Representation Model Based on Novel Fusion Techniques

Author: Amensisa Abdisa Demissie
Publication venue: 'International Institute for Science, Technology and Education'
Publication date: 02/10/2020
Field of study

The text document classification tasks passes under the Automatic Classification (also known as pattern Recognition) problem in Machine Learning and Text Mining. It is necessary to classify large text documents into specific classes, to make clear and search simply. Classified data are easy for users to browse. The important issue in usual text document classification is representing the features for classification of an unknown document into predefined categories. The Combination of classifiers is fused together to increase the accuracy classification result in a single text document. This paper states a novel fusion approach to classify text documents by considering ES-VSM and Bigram representation models for text documents. ES-VSM: Enhanced Sentence –Vector Space Model is an advanced feature of the sentence based vector space model and extension to simple VSM will be considered for the constructive representation of text documents. The main objective of the study is to boost the accuracy of text classification by accounting for the features extracted from the text document. The proposed system concatenates two different representation models of the text documents for designing two different classifiers and feeds them as one input to the classifier. An enhanced S-VSM and interval-valued representation model are considered for the effective representation of text documents. A word level neural network Bigram representation of text documents is proposed for effective capturing of semantic information present in the text data. A Proposed approach improves the overall accuracy of text document classification to a significant extent. Keywords: ES-VSM; Fusion, Text Document Classification, Neural Network, Text Representation, Machine learning. DOI: 10.7176/NMMC/93-03 Publication date:September 30th 2020

International Institute for Science, Technology and Education (IISTE): E-Journals

HelpfulMed: Intelligent searching for medical information over the internet

Author: Bates
Brin
Chen
Chen
Chen
Chen
Cho
Cimino
Crouch
Deerwester
Eysenbach
Fallis
Furnas
Guntzer
Haveliwala
Hearst
Hopfield
Houston
Janes
Kohonen
Lyman
Mechkour
Roussinov
Salton
Salton
Salton
Srinivasan
Tolle
van Rijsbergen
Vélez
Woolf
Wu
Zamir
Publication venue: 'Wiley'
Publication date: 01/01/2003
Field of study

Crossref