Search CORE

291,282 research outputs found

Normalized Information Distance

Author: Balbach Frank J.
Cilibrasi Rudi L.
Li Ming
Vitanyi Paul M. B.
Publication venue
Publication date: 01/01/2008
Field of study

The normalized information distance is a universal distance measure for objects of all kinds. It is based on Kolmogorov complexity and thus uncomputable, but there are ways to utilize it. First, compression algorithms can be used to approximate the Kolmogorov complexity if the objects have a string representation. Second, for names and abstract concepts, page count statistics from the World Wide Web can be used. These practical realizations of the normalized information distance can then be applied to machine learning tasks, expecially clustering, to perform feature-free and parameter-free data mining. This chapter discusses the theoretical foundations of the normalized information distance and both practical realizations. It presents numerous examples of successful real-world applications based on these distance measures, ranging from bioinformatics to music clustering to machine translation.Comment: 33 pages, 12 figures, pdf, in: Normalized information distance, in: Information Theory and Statistical Learning, Eds. M. Dehmer, F. Emmert-Streib, Springer-Verlag, New-York, To appea

arXiv.org e-Print Archive

CiteSeerX

International Migration, Integration and Social Cohesion online publications

UvA-DARE

Recommended from our members

Personalization via collaboration in web retrieval systems: a context based approach

Author: Cool
Coon
Dix
Edwards
Goker
Goker
Goker
Goker
He
Hersh
Jansen
Mobasher
Pazzani
Resnick
Robertson
Robertson
Publication venue: 'Wiley'
Publication date: 01/10/2003
Field of study

World Wide Web is a source of information, and searches on the Web can be analyzed to detect patterns in Web users' search behaviors and information needs to effectively handle the users' subsequent needs. The rationale is that the information need of a user at a particular time point occurs in a particular context, and queries are derived from that need. In this paper, we discuss an extension of our personalization approach that was originally developed for a traditional bibliographic retrieval system but has been adapted and extended with a collaborative model for the Web retrieval environment. We start with a brief introduction of our personalization approach in a traditional information retrieval system. Then, based on the differences in the nature of documents, users and search tasks between traditional and Web retrieval environments, we describe our extensions of integrating collaboration in personalization in the Web retrieval environment. The architecture for the extension integrates machine learning techniques for the purpose of better modeling users' search tasks. Finally, a user-oriented evaluation of Web-based adaptive retrieval systems is presented as an important aspect of the overall strategy for personalization

City Research Online

Crossref

Machine Learning of User Profiles: Representational Issues

Author: Bloedorn Eric
MacMillan T. Richard
Mani Inderjeet
Publication venue
Publication date: 01/01/1996
Field of study

As more information becomes available electronically, tools for finding information of interest to users becomes increasingly important. The goal of the research described here is to build a system for generating comprehensible user profiles that accurately capture user interest with minimum user interaction. The research described here focuses on the importance of a suitable generalization hierarchy and representation for learning profiles which are predictively accurate and comprehensible. In our experiments we evaluated both traditional features based on weighted term vectors as well as subject features corresponding to categories which could be drawn from a thesaurus. Our experiments, conducted in the context of a content-based profiling system for on-line newspapers on the World Wide Web (the IDD News Browser), demonstrate the importance of a generalization hierarchy and the promise of combining natural language processing techniques with machine learning (ML) to address an information retrieval (IR) problem.Comment: 6 page

arXiv.org e-Print Archive

CiteSeerX

Feature Selection Technique for Text Document Classification: An Alternative Approach

Author: S.W. Mohod, Dr. C.A.Dhote
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 30/09/2014
Field of study

Text classification and feature selection plays an important role for correctly identifying the documents into particular category, due to the explosive growth of the textual information from the electronic digital documents as well as world wide web. In the text mining present challenge is to select important or relevant feature from large and vast amount of features in the data set. The aim of this paper is to improve the feature selection method for text document classification in machine learning. In machine learning the training set is generated for testing the documents. This can be achieved by selecting important new term i.e. weights of term in text document to improve both classification with relevance to accuracy and performance

International Journal on Recent and Innovation Trends in Computing and Communication

Distributed human computation framework for linked data co-reference resolution

Author: Au Yeung Ching Man
Cai Zhonglun
Gibbins Nicholas
Hall Wendy
Salvadores Manuel
Shadbolt Nigel
Singh Priyanka
Wang Xiaowei
Yang Yang
Yao Jiadi
Zareian Amir
Publication venue
Publication date: 01/01/2011
Field of study

Distributed Human Computation (DHC) is a technique used to solve computational problems by incorporating the collaborative effort of a large number of humans. It is also a solution to AI-complete problems such as natural language processing. The Semantic Web with its root in AI is envisioned to be a decentralised world-wide information space for sharing machine-readable data with minimal integration costs. There are many research problems in the Semantic Web that are considered as AI-complete problems. An example is co-reference resolution, which involves determining whether different URIs refer to the same entity. This is considered to be a significant hurdle to overcome in the realisation of large-scale Semantic Web applications. In this paper, we propose a framework for building a DHC system on top of the Linked Data Cloud to solve various computational problems. To demonstrate the concept, we are focusing on handling the co-reference resolution in the Semantic Web when integrating distributed datasets. The traditional way to solve this problem is to design machine-learning algorithms. However, they are often computationally expensive, error-prone and do not scale. We designed a DHC system named iamResearcher, which solves the scientific publication author identity co-reference problem when integrating distributed bibliographic datasets. In our system, we aggregated 6 million bibliographic data from various publication repositories. Users can sign up to the system to audit and align their own publications, thus solving the co-reference problem in a distributed manner. The aggregated results are published to the Linked Data Cloud

Southampton (e-Prints Soton)

Web Content Extraction Techniques: A survey

Author: Kinnari Ajmera, Khushali Deulkar
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 30/11/2015
Field of study

As technology grows everyday and the amount of research done in various fields rises exponentially the amount of this information being published on the World Wide Web rises in a similar fashion. Along with the rise in useful information being published on the world wide web the amount of excess irrelevant information termed as ‘noise’ is also published in the form of (advertisement, links, scrollers, etc.). Thus now-a-days systems are being developed for data pre-processing and cleaning for real-time applications. Also these systems help other analyzing systems such as social network mining, web mining, data mining, etc to analyze the data in real time or even special tasks such as false advertisement detection, demand forecasting, and comment extraction on product and service reviews. For web content extraction task, researchers have proposed many different methods, such as wrapper-based method, DOM tree rule-based method, machine learning-based method and so on. This paper presents a comparative study of 4 recently proposed methods for web content extraction. These methods have used the traditional DOM tree rule-based method as the base and worked on using other tools to express better results

International Journal on Recent and Innovation Trends in Computing and Communication