133,814 research outputs found
Effective pattern discovery for text mining
Many data mining techniques have been proposed for mining useful patterns in text documents. However, how to effectively use and update discovered patterns is still an open research issue, especially in the domain of text mining. Since most existing text mining methods adopted term-based approaches, they all suffer from the problems of polysemy and synonymy. Over the years, people have often held the hypothesis that pattern (or phrase) based approaches should perform better than the term-based ones, but many experiments did not support this hypothesis. This paper presents an innovative technique, effective pattern discovery which includes the processes of pattern deploying and pattern evolving, to improve the effectiveness of using and updating discovered patterns for finding relevant and interesting information. Substantial experiments on RCV1 data collection and TREC topics demonstrate that the proposed solution achieves encouraging performance
Recommended from our members
Application of Natural Language Processing and Evidential Analysis to Web-Based Intelligence Information Acquisition
The quality of decisions made in business and government relates directly to the quality of the information used to formulate the decision. This information may be retrieved from an organization's knowledge base (Intranet) or from the World Wide Web. Intelligence services Intranet held information can be efficiently manipulated by technologies based upon either semantics such as ontologies, or statistics such as meaning-based computing. These technologies require complex processing of large amount of textual information. However, they cannot currently be effectively applied to Web-based search due to various obstacles, such as lack of semantic tagging. A new approach proposed in this paper supports Web-based search for intelligence information utilizing evidence-based natural language processing (NLP). This approach combines traditional NLP methods for filtering of Web-search results, Grounded Theory to test the completeness of the evidence, and Evidential Analysis to test the quality of gathered information. The enriched information derived from the Web-search will be transferred to the intelligence services knowledge base for handling by an effective Intranet search system thus increasing substantially the information for intelligence analysis. The paper will show that the quality of retrieved information is significantly enhanced by the discovery of previously unknown facts derived from known facts
Relation Discovery from Web Data for Competency Management
This paper describes a technique for automatically discovering associations between people and expertise from an analysis of very large data sources (including web pages, blogs and emails), using a family of algorithms that perform accurate named-entity recognition, assign different weights to terms according to an analysis of document structure, and access distances between terms in a document. My contribution is to add a social networking approach called BuddyFinder which relies on associations within a large enterprise-wide "buddy list" to help delimit the search space and also to provide a form of 'social triangulation' whereby the system can discover documents from your colleagues that contain pertinent information about you. This work has been influential in the information retrieval community generally, as it is the basis of a landmark system that achieved overall first place in every category in the Enterprise Search Track of TREC2006
Learning Reputation in an Authorship Network
The problem of searching for experts in a given academic field is hugely
important in both industry and academia. We study exactly this issue with
respect to a database of authors and their publications. The idea is to use
Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA) to perform
topic modelling in order to find authors who have worked in a query field. We
then construct a coauthorship graph and motivate the use of influence
maximisation and a variety of graph centrality measures to obtain a ranked list
of experts. The ranked lists are further improved using a Markov Chain-based
rank aggregation approach. The complete method is readily scalable to large
datasets. To demonstrate the efficacy of the approach we report on an extensive
set of computational simulations using the Arnetminer dataset. An improvement
in mean average precision is demonstrated over the baseline case of simply
using the order of authors found by the topic models
- …