2,042 research outputs found
Relation Discovery from Web Data for Competency Management
This paper describes a technique for automatically discovering associations between people and expertise from an analysis of very large data sources (including web pages, blogs and emails), using a family of algorithms that perform accurate named-entity recognition, assign different weights to terms according to an analysis of document structure, and access distances between terms in a document. My contribution is to add a social networking approach called BuddyFinder which relies on associations within a large enterprise-wide "buddy list" to help delimit the search space and also to provide a form of 'social triangulation' whereby the system can discover documents from your colleagues that contain pertinent information about you. This work has been influential in the information retrieval community generally, as it is the basis of a landmark system that achieved overall first place in every category in the Enterprise Search Track of TREC2006
Recommended from our members
DEAR: A New Technique for Information Extraction and Context-Dependent Text Mining
The desire to store and the need to use electronic data has greatly increased as the power, availability, and connectivity of computers has grown. A large portion of this data is in the form of unstructured text documents. Locating specific information within this amorphous mass of documents is an area of active research. Our contribution to this pursuit is the development of the Document Entity and Resolution (DEAR) system. This system combines semantic similarity matching as provided by the open source WordNet database with the ability to recognize named entities through the OpenCalais system. When used in concert, this provides a novel way for users to quickly find relevant content and detect and identify uniquely named entities within that content. The theory behind the system is defined and the working system is described. This system is then applied to a collection of assessment documents as a proof-of-concept test of its viability. The results are promising and indicate that further research is warranted
Applying Wikipedia to Interactive Information Retrieval
There are many opportunities to improve the interactivity of information retrieval systems beyond the ubiquitous search box. One idea is to use knowledge bases—e.g. controlled vocabularies, classification schemes, thesauri and ontologies—to organize, describe and navigate the information space. These resources are popular in libraries and specialist collections, but have proven too expensive and narrow to be applied to everyday webscale search. Wikipedia has the potential to bring structured knowledge into more widespread use. This online, collaboratively generated encyclopaedia is one of the largest and most consulted reference works in existence. It is broader, deeper and more agile than the knowledge bases put forward to assist retrieval in the past. Rendering this resource machine-readable is a challenging task that has captured the interest of many researchers. Many see it as a key step required to break the knowledge acquisition bottleneck that crippled previous efforts. This thesis claims that the roadblock can be sidestepped: Wikipedia can be applied effectively to open-domain information retrieval with minimal natural language processing or information extraction. The key is to focus on gathering and applying human-readable rather than machine-readable knowledge. To demonstrate this claim, the thesis tackles three separate problems: extracting knowledge from Wikipedia; connecting it to textual documents; and applying it to the retrieval process. First, we demonstrate that a large thesaurus-like structure can be obtained directly from Wikipedia, and that accurate measures of semantic relatedness can be efficiently mined from it. Second, we show that Wikipedia provides the necessary features and training data for existing data mining techniques to accurately detect and disambiguate topics when they are mentioned in plain text. Third, we provide two systems and user studies that demonstrate the utility of the Wikipedia-derived knowledge base for interactive information retrieval
Intelligent User Assistance for Automated Data Mining Method Selection
In any data science and analytics project, the task of mapping a domain-specific problem to an adequate set of data mining methods by experts of the field is a crucial step. However, these experts are not always available and data mining novices may be required to perform the task. While there are several research efforts for automated method selection as a means of support, only a few approaches consider the particularities of problems expressed in the natural and domain-specific language of the novice. The study proposes the design of an intelligent assistance system that takes problem descriptions articulated in natural lan- guage as an input and offers advice regarding the most suitable class of data mining methods. Following a design science research approach, the paper (i) outlines the prob- lem setting with an exemplary scenario from industrial practice, (ii) derives design requirements, (iii) develops design principles and proposes design features, (iv) devel- ops and implements the IT artifact using several methods such as embeddings, keyword extractions, topic models, and text classifiers, (v) demonstrates and evaluates the implemented prototype based on different classification pipelines, and (vi) discusses the results’ practical and the- oretical contributions. The best performing classification pipelines show high accuracies when applied to validation data and are capable of creating a suitable mapping that exceeds the performance of joint novice assessments and simpler means of text mining. The research provides a promising foundation for further enhancements, either as a stand-alone intelligent assistance system or as an add-on to already existing data science and analytics platforms
- …