Search CORE

18,702 research outputs found

Comparing SVM and Naive Bayes classifiers for text categorization with Wikitology as knowledge enrichment

Author: Hassan Sundus
Rafi Muhammad
Shaikh Muhammad Shahid
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 18/02/2012
Field of study

The activity of labeling of documents according to their content is known as text categorization. Many experiments have been carried out to enhance text categorization by adding background knowledge to the document using knowledge repositories like Word Net, Open Project Directory (OPD), Wikipedia and Wikitology. In our previous work, we have carried out intensive experiments by extracting knowledge from Wikitology and evaluating the experiment on Support Vector Machine with 10- fold cross-validations. The results clearly indicate Wikitology is far better than other knowledge bases. In this paper we are comparing Support Vector Machine (SVM) and Na\"ive Bayes (NB) classifiers under text enrichment through Wikitology. We validated results with 10-fold cross validation and shown that NB gives an improvement of +28.78%, on the other hand SVM gives an improvement of +6.36% when compared with baseline results. Na\"ive Bayes classifier is better choice when external enriching is used through any external knowledge base.Comment: 5 page

arXiv.org e-Print Archive

Crossref

Effective pattern discovery for text mining

Author: Li Yuefeng
Wu Sheng-Tang
Zhong Ning
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2012
Field of study

Many data mining techniques have been proposed for mining useful patterns in text documents. However, how to effectively use and update discovered patterns is still an open research issue, especially in the domain of text mining. Since most existing text mining methods adopted term-based approaches, they all suffer from the problems of polysemy and synonymy. Over the years, people have often held the hypothesis that pattern (or phrase) based approaches should perform better than the term-based ones, but many experiments did not support this hypothesis. This paper presents an innovative technique, effective pattern discovery which includes the processes of pattern deploying and pattern evolving, to improve the effectiveness of using and updating discovered patterns for finding relevant and interesting information. Substantial experiments on RCV1 data collection and TREC topics demonstrate that the proposed solution achieves encouraging performance

Queensland University of Technology ePrints Archive

Feature selection, optimization and clustering strategies of text documents

Author: Nikhath A. Kousar
Subrahmanyam K.
Publication venue: 'Institute of Advanced Engineering and Science'
Publication date: 01/04/2019
Field of study

Clustering is one of the most researched areas of data mining applications in the contemporary literature. The need for efficient clustering is observed across wide sectors including consumer segmentation, categorization, shared filtering, document management, and indexing. The research of clustering task is to be performed prior to its adaptation in the text environment. Conventional approaches typically emphasized on the quantitative information where the selected features are numbers. Efforts also have been put forward for achieving efficient clustering in the context of categorical information where the selected features can assume nominal values. This manuscript presents an in-depth analysis of challenges of clustering in the text environment. Further, this paper also details prominent models proposed for clustering along with the pros and cons of each model. In addition, it also focuses on various latest developments in the clustering task in the social network and associated environments

Crossref

ZENODO

Institute of Advanced Engineering and Science

Discovering New Sentiments from the Social Web

Author: Borrego-Díaz Joaquín
Galan-Paez Juan
Publication venue
Publication date: 01/01/2014
Field of study

A persistent challenge in Complex Systems (CS) research is the phenomenological reconstruction of systems from raw data. In order to face the problem, the use of sound features to reason on the system from data processing is a key step. In the specific case of complex societal systems, sentiment analysis allows to mirror (part of) the affective dimension. However it is not reasonable to think that individual sentiment categorization can encompass the new affective phenomena in digital social networks. The present papers addresses the problem of isolating sentiment concepts which emerge in social networks. In an analogy to Artificial Intelligent Singularity, we propose the study and analysis of these new complex sentiment structures and how they are similar to or diverge from classic conceptual structures associated to sentiment lexicons. The conjecture is that it is highly probable that hypercomplex sentiment structures -not explained with human categorizations- emerge from high dynamic social information networks. Roughly speaking, new sentiment can emerge from the new global nervous systems as it occurs in humans

arXiv.org e-Print Archive

idUS. Depósito de Investigación Universidad de Sevilla

SoK: Chasing Accuracy and Privacy, and Catching Both in Differentially Private Histogram Publication

Author: Nelson Boel
Reuben Jenni
Publication venue
Publication date: 01/01/2020
Field of study

Histograms and synthetic data are of key importance in data analysis. However, researchers have shown that even aggregated data such as histograms, containing no obvious sensitive attributes, can result in privacy leakage. To enable data analysis, a strong notion of privacy is required to avoid risking unintended privacy violations.Such a strong notion of privacy is differential privacy, a statistical notion of privacy that makes privacy leakage quantifiable. The caveat regarding differential privacy is that while it has strong guarantees for privacy, privacy comes at a cost of accuracy. Despite this trade-off being a central and important issue in the adoption of differential privacy, there exists a gap in the literature regarding providing an understanding of the trade-off and how to address it appropriately. Through a systematic literature review (SLR), we investigate the state-of-the-art within accuracy improving differentially private algorithms for histogram and synthetic data publishing. Our contribution is two-fold: 1) we identify trends and connections in the contributions to the field of differential privacy for histograms and synthetic data and 2) we provide an understanding of the privacy/accuracy trade-off challenge by crystallizing different dimensions to accuracy improvement. Accordingly, we position and visualize the ideas in relation to each other and external work, and deconstruct each algorithm to examine the building blocks separately with the aim of pinpointing which dimension of accuracy improvement each technique/approach is targeting. Hence, this systematization of knowledge (SoK) provides an understanding of in which dimensions and how accuracy improvement can be pursued without sacrificing privacy

Chalmers Research

Concept graphs: Applications to biomedical text categorization and concept extraction

Author: Bleik Said
Publication venue: Digital Commons @ NJIT
Publication date: 31/05/2013
Field of study

As science advances, the underlying literature grows rapidly providing valuable knowledge mines for researchers and practitioners. The text content that makes up these knowledge collections is often unstructured and, thus, extracting relevant or novel information could be nontrivial and costly. In addition, human knowledge and expertise are being transformed into structured digital information in the form of vocabulary databases and ontologies. These knowledge bases hold substantial hierarchical and semantic relationships of common domain concepts. Consequently, automating learning tasks could be reinforced with those knowledge bases through constructing human-like representations of knowledge. This allows developing algorithms that simulate the human reasoning tasks of content perception, concept identification, and classification. This study explores the representation of text documents using concept graphs that are constructed with the help of a domain ontology. In particular, the target data sets are collections of biomedical text documents, and the domain ontology is a collection of predefined biomedical concepts and relationships among them. The proposed representation preserves those relationships and allows using the structural features of graphs in text mining and learning algorithms. Those features emphasize the significance of the underlying relationship information that exists in the text content behind the interrelated topics and concepts of a text document. The experiments presented in this study include text categorization and concept extraction applied on biomedical data sets. The experimental results demonstrate how the relationships extracted from text and captured in graph structures can be used to improve the performance of the aforementioned applications. The discussed techniques can be used in creating and maintaining digital libraries through enhancing indexing, retrieval, and management of documents as well as in a broad range of domain-specific applications such as drug discovery, hypothesis generation, and the analysis of molecular structures in chemoinformatics

Digital Commons @ New Jersey Institute of Technology (NJIT)

Extracting corpus specific knowledge bases from Wikipedia

Author: Milne David N.
Nichols David M.
Witten Ian H.
Publication venue: University of Waikato, Department of Computer Science
Publication date: 01/06/2007
Field of study

Thesauri are useful knowledge structures for assisting information retrieval. Yet their production is labor-intensive, and few domains have comprehensive thesauri that cover domain-specific concepts and contemporary usage. One approach, which has been attempted without much success for decades, is to seek statistical natural language processing algorithms that work on free text. Instead, we propose to replace costly professional indexers with thousands of dedicated amateur volunteers--namely, those that are producing Wikipedia. This vast, open encyclopedia represents a rich tapestry of topics and semantics and a huge investment of human effort and judgment. We show how this can be directly exploited to provide WikiSauri: manually-defined yet inexpensive thesaurus structures that are specifically tailored to expose the topics, terminology and semantics of individual document collections. We also offer concrete evidence of the effectiveness of WikiSauri for assisting information retrieval

Research Commons@Waikato

Reviews

Author: Barker Philip
Buckner Kathy
Dwyer Peter
Fayter Debra
Green Steve
Hubbard Bill
Warren Lorraine
Publication venue: 'Informa UK Limited'
Publication date: 01/01/1997
Field of study

Teaching and Learning Materials and the Internet by Ian Forsyth, London: Kogan Page, 1996. ISBN: 0–7494‐ 20596. 181 pages, paperback. £18.99

Crossref

ALT Open Access Repository