Search CORE

85,886 research outputs found

A Robust System for Local Reuse Detection of Arabic Text on the Web

Author: Ahmed Lulu Leena Mahmoud
Publication venue: Scholarworks@UAEU
Publication date: 01/12/2016
Field of study

We developed techniques for finding local text reuse on the Web, with an emphasis on the Arabic language. That is, our objective is to develop text reuse detection methods that can detect alternative versions of the same information and focus on exploring the feasibility of employing text reuse detection methods on the Web. The results of this research can be thought of as rich tools to information analysts for corporate and intelligence applications. Such tools will become essential parts in validating and assessing information coming from uncertain origins. These tools will prove useful for detecting reuse in scientific literature too. It is also the time for ordinary Web users to become Fact Inspectors by providing a tool that allows people to quickly check the validity and originality of statements and their sources, so they will be given the opportunity to perform their own assessment of information quality. Local text reuse detection can be divided into two major subtasks: the first subtask is the retrieval of candidate documents that are likely to be the original sources of a given document in a collection of documents and then performing an extensive pairwise comparison between the given document and each of the possible sources of text reuse that have been retrieved. For this purpose, we develop a new technique to address the challenging problem of candidate documents retrieval from the Web. Given an input document d, the problem of local text reuse detection is to detect from a given documents collection, all the possible reused passages between d and the other documents. Comparing the passages of document d with the passages of every other document in the collection is obviously infeasible especially with large collections such as the Web. Therefore, selecting a subset of the documents that potentially contains reused text with d becomes a major step in the detection problem. In the setting of the Web, the search for such candidate source documents is usually performed through limited query interface. We developed a new efficient approach of query formulation to retrieve Arabic-based candidate source documents from the Web. The candidate documents are then fed to a local text reuse detection system for detailed similarity evaluation with d. We consider the candidate source document retrieval problem as an essential step in the detection of text reuse. Several techniques have been previously proposed for detecting text reuse, however, these techniques have been designed for relatively small and homogeneous collections. Furthermore, we are not aware of any actual previous work on Arabic text reuse detection on the Web. This is due to complexity of the Arabic language as well as the heterogeneity of the information contained on the Web and its large scale that makes the task of text reuse detection on the Web much more difficult than in relatively small and homogeneous collections. We evaluated the work using a collection of documents especially constructed and downloaded from the Web for the evaluation of Web documents retrieval in particular and the detailed text reuse detection in general. Our work to a certain degree is exploratory rather than definitive, in that this problem has not been investigated before for Arabic documents at the Web scale. However, our results show that the methods we described are applicable for Arabic-based reuse detection in practice. The experiments show that around 80% of the Web documents used in the reused cases were successfully retrieved. As for the detailed similarity analysis, the system achieved an overall score of 97.2% based on the precision and recall evaluation metrics

United Arab Emirates University: Scholarworks@UAEU / جامعة الامارات

Query expansion with naive bayes for searching distributed collections

Author: Yang Hui
Zhang Minjie
Publication venue
Publication date: 01/01/2002
Field of study

The proliferation of online information resources increases the importance of effective and efficient distributed searching. However, the problem of word mismatch seriously hurts the effectiveness of distributed information retrieval. Automatic query expansion has been suggested as a technique for dealing with the fundamental issue of word mismatch. In this paper, we propose a method - query expansion with Naive Bayes to address the problem, discuss its implementation in IISS system, and present experimental results demonstrating its effectiveness. Such technique not only enhances the discriminatory power of typical queries for choosing the right collections but also hence significantly improves retrieval results

CiteSeerX

Open Research Online (The Open University)

Combining and selecting characteristics of information use

Author: Lalmas M.
Ruthven I.
van Rijsbergen C.J.
Publication venue
Publication date: 01/01/2002
Field of study

In this paper we report on a series of experiments designed to investigate the combination of term and document weighting functions in Information Retrieval. We describe a series of weighting functions, each of which is based on how information is used within documents and collections, and use these weighting functions in two types of experiments: one based on combination of evidence for ad-hoc retrieval, the other based on selective combination of evidence within a relevance feedback situation. We discuss the difficulties involved in predicting good combinations of evidence for ad-hoc retrieval, and suggest the factors that may lead to the success or failure of combination. We also demonstrate how, in a relevance feedback situation, the relevance assessments can provide a good indication of how evidence should be selected for query term weighting. The use of relevance information to guide the combination process is shown to reduce the variability inherent in combination of evidence

University of Strathclyde Institutional Repository

Selective relevance feedback using term characteristics

Author: Lalmas Mounia
Ruthven Ian
Publication venue
Publication date: 01/01/1999
Field of study

This paper presents a new relevance feedback technique; selectively combining evidence based on the usage of terms within documents. By considering how terms are used within documents, we can better describe the features that might make a document relevant and thus improve retrieval effectiveness. In this paper we present an initial, experimental investigation of this technique, incorporating new and existing measures for describing the information content of a document. The results from these experiments positively support our hypothesis that extending relevance feedback to take into account how terms are used within documents can improve the performance of relevance feedback

CiteSeerX

University of Strathclyde Institutional Repository

How Many Topics? Stability Analysis for Topic Models

Author: C. Lin
D. Greene
D.D. Lee
D.M. Blei
E. Levine
H.W. Kuhn
J.P. Brunet
L.N. Hutchins
M. Kendall
P. Jaccard
R. Fagin
S. Ben-David
T. Lange
W. Webber
Publication venue
Publication date: 01/01/2014
Field of study

Topic modeling refers to the task of discovering the underlying thematic structure in a text corpus, where the output is commonly presented as a report of the top terms appearing in each topic. Despite the diversity of topic modeling algorithms that have been proposed, a common challenge in successfully applying these techniques is the selection of an appropriate number of topics for a given corpus. Choosing too few topics will produce results that are overly broad, while choosing too many will result in the "over-clustering" of a corpus into many small, highly-similar topics. In this paper, we propose a term-centric stability analysis strategy to address this issue, the idea being that a model with an appropriate number of topics will be more robust to perturbations in the data. Using a topic modeling approach based on matrix factorization, evaluations performed on a range of corpora show that this strategy can successfully guide the model selection process.Comment: Improve readability of plots. Add minor clarification

arXiv.org e-Print Archive

Crossref

Research Repository UCD

Irish Universities

Examining and improving the effectiveness of relevance feedback for retrieval of scanned text documents

Author: Jones Gareth J.F.
Lam-Adesina Adenike M.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2006
Field of study

Important legacy paper documents are digitized and collected in online accessible archives. This enables the preservation, sharing, and significantly the searching of these documents. The text contents of these document images can be transcribed automatically using OCR systems and then stored in an information retrieval system. However, OCR systems make errors in character recognition which have previously been shown to impact on document retrieval behaviour. In particular relevance feedback query-expansion methods, which are often effective for improving electronic text retrieval, are observed to be less reliable for retrieval of scanned document images. Our experimental examination of the effects of character recognition errors on an ad hoc OCR retrieval task demonstrates that, while baseline information retrieval can remain relatively unaffected by transcription errors, relevance feedback via query expansion becomes highly unstable. This paper examines the reason for this behaviour, and introduces novel modifications to standard relevance feedback methods. These methods are shown experimentally to improve the effectiveness of relevance feedback for errorful OCR transcriptions. The new methods combine similar recognised character strings based on term collection frequency and a string edit-distance measure. The techniques are domain independent and make no use of external resources such as dictionaries or training data

Irish Universities

DCU Online Research Access Service

Updating collection representations for federated search

Author: Azzopardi L.
Baillie M.
Shokouhi M.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2007
Field of study

To facilitate the search for relevant information across a set of online distributed collections, a federated information retrieval system typically represents each collection, centrally, by a set of vocabularies or sampled documents. Accurate retrieval is therefore related to how precise each representation reflects the underlying content stored in that collection. As collections evolve over time, collection representations should also be updated to reflect any change, however, a current solution has not yet been proposed. In this study we examine both the implications of out-of-date representation sets on retrieval accuracy, as well as proposing three different policies for managing necessary updates. Each policy is evaluated on a testbed of forty-four dynamic collections over an eight-week period. Our findings show that out-of-date representations significantly degrade performance overtime, however, adopting a suitable update policy can minimise this problem

CiteSeerX

Crossref

University of Strathclyde Institutional Repository

Enlighten

Recommended from our members

An experimental comparison of a genetic algorithm and a hill-climber for term selection

Author: MacFarlane A.
May P.
Secker A.
Timmis J.
Publication venue: 'Emerald'
Publication date: 01/01/2010
Field of study

Purpose – The term selection problem for selecting query terms in information filtering and routing has been investigated using hill-climbers of various kinds, largely through the Okapi experiments in the TREC series of conferences. Although these are simple deterministic approaches which examine the effect of changing the weight of one term at a time, they have been shown to improve the retrieval effectiveness of filtering queries in these TREC experiments. Hill-climbers are, however, likely to get trapped in local optima, and the use of more sophisticated local search techniques for this problem that attempt to break out of these optima are worth investigating. To this end, we apply a genetic algorithm (GA) to the same problem. Design/Methodology/Approach – We use a standard TREC test collection from the TREC-8 filtering track, recording mean average precision and recall measures to allow comparison between the hillclimber and GA algorithms. We also vary elements of the GA, such as probability of a word being included, probability of mutation and population size in order to measure the effect of these variables. Different strategies such as Elitist and Non-Elitist methods are used, as well as Roulette Wheel and Rank selection GA algorithms. Findings – The results of tests suggest that both techniques are, on average, better than the baseline, but the implemented GA does not match the overall performance of a hill-climber. The Rank selection algorithm does better on average than the Roulette Wheel algorithm. There is no evidence in this study that varying word inclusion probability, mutation probability or Elitist method make much difference to the overall results. Small population sizes do not appear to be as effective as larger population sizes. Research limitations/implications – The evidence provided here would suggest that being stuck in a local optima for the term selection optimization problem does not appear to be detrimental to the overall success of the hill-climber. The evidence from term rank order would appear to provide extra useful evidence which hill-climbers can use efficiently and effectively to narrow the search space. Originality/Value – The paper represents the first attempt to compare hill-climbers with GAs on a problem of this type

City Research Online

Crossref

Aberystwyth Research Portal

A survey on the use of relevance feedback for information access systems

Author: Lalmas Mounia
Ruthven Ian
Publication venue: 'Cambridge University Press (CUP)'
Publication date: 01/06/2003
Field of study

Users of online search engines often find it difficult to express their need for information in the form of a query. However, if the user can identify examples of the kind of documents they require then they can employ a technique known as relevance feedback. Relevance feedback covers a range of techniques intended to improve a user's query and facilitate retrieval of information relevant to a user's information need. In this paper we survey relevance feedback techniques. We study both automatic techniques, in which the system modifies the user's query, and interactive techniques, in which the user has control over query modification. We also consider specific interfaces to relevance feedback systems and characteristics of searchers that can affect the use and success of relevance feedback systems

Crossref

University of Strathclyde Institutional Repository