Search CORE

378 research outputs found

Recommended from our members

Parallel computing in information retrieval - An updated review

The progress of parallel computing in Information Retrieval (IR) is reviewed. In particular we stress the importance of the motivation in using parallel computing for Text Retrieval. We analyse parallel IR systems using a classification due to Rasmussen [1] and describe some parallel IR systems. We give a description of the retrieval models used in parallel Information Processing.. We describe areas of research which we believe are needed

City Research Online

Crossref

A lightweight, graph-theoretic model of class-based similarity to support object-oriented code reuse.

Author: MacLean Angus
Publication venue
Publication date: 31/01/2003
Field of study

The work presented in this thesis is principally concerned with the development of a method and set of tools designed to support the identification of class-based similarity in collections of object-oriented code. Attention is focused on enhancing the potential for software reuse in situations where a reuse process is either absent or informal, and the characteristics of the organisation are unsuitable, or resources unavailable, to promote and sustain a systematic approach to reuse. The approach builds on the definition of a formal, attributed, relational model that captures the inherent structure of class-based, object-oriented code. Based on code-level analysis, it relies solely on the structural characteristics of the code and the peculiarly object-oriented features of the class as an organising principle: classes, those entities comprising a class, and the intra and inter-class relationships existing between them, are significant factors in defining a two-phase similarity measure as a basis for the comparison process. Established graph-theoretic techniques are adapted and applied via this model to the problem of determining similarity between classes. This thesis illustrates a successful transfer of techniques from the domains of molecular chemistry and computer vision. Both domains provide an existing template for the analysis and comparison of structures as graphs. The inspiration for representing classes as attributed relational graphs, and the application of graph-theoretic techniques and algorithms to their comparison, arose out of a well-founded intuition that a common basis in graph-theory was sufficient to enable a reasonable transfer of these techniques to the problem of determining similarity in object-oriented code. The practical application of this work relates to the identification and indexing of instances of recurring, class-based, common structure present in established and evolving collections of object-oriented code. A classification so generated additionally provides a framework for class-based matching over an existing code-base, both from the perspective of newly introduced classes, and search "templates" provided by those incomplete, iteratively constructed and refined classes associated with current and on-going development. The tools and techniques developed here provide support for enabling and improving shared awareness of reuse opportunity, based on analysing structural similarity in past and ongoing development, tools and techniques that can in turn be seen as part of a process of domain analysis, capable of stimulating the evolution of a systematic reuse ethic

Open Access Institutional Repository at Robert Gordon University

Rôle de la matrice d'information et pondération des composantes dans les noyaux de Fisher pour PLSI

Author: Chappelier Jean-Cedric
Eckard Emmanuel
Publication venue
Publication date: 06/07/2009
Field of study

ABSTRACT. An information-geometric approach for document similarities in the framework of “Probabilistic Latent Semantic Indexing” was ﬁrst proposed by T. Hofmann (2000) and later extended (“revisited”) by Nyffenegger et al. (2006). This paper presents an in-depth study and revision of these models by (1) providing a simpler uniﬁed description framework, (2) investigating the role of the Fisher Information Matrix G(θ), and (3) analyzing the impact of latent “topic” parameters in such models. It furthermore provides new experimental results on larger collections coming from the TREC–AP evaluation corpus

Infoscience - École polytechnique fédérale de Lausanne

Exploring the reuse of past search results in information retrieval

Author: Gutiérrez Soto Claudio
Publication venue
Publication date: 17/05/2016
Field of study

Les recherches passées constituent pourtant une source d'information utile pour les nouveaux utilisateurs (nouvelles requêtes). En raison de l'absence de collections ad-hoc de RI, à ce jour il y a un faible intérêt de la communauté RI autour de l'utilisation des recherches passées. En effet, la plupart des collections de RI existantes sont composées de requêtes indépendantes. Ces collections ne sont pas appropriées pour évaluer les approches fondées sur les requêtes passées parce qu'elles ne comportent pas de requêtes similaires ou qu'elles ne fournissent pas de jugements de pertinence. Par conséquent, il n'est pas facile d'évaluer ce type d'approches. En outre, l'élaboration de ces collections est difficile en raison du coût et du temps élevés nécessaires. Une alternative consiste à simuler les collections. Par ailleurs, les documents pertinents de requêtes passées similaires peuvent être utilisées pour répondre à une nouvelle requête. De nombreuses contributions ont été proposées portant sur l'utilisation de techniques probabilistes pour améliorer les résultats de recherche. Des solutions simples à mettre en œuvre pour la réutilisation de résultats de recherches peuvent être proposées au travers d'algorithmes probabilistes. De plus, ce principe peut également bénéficier d'un clustering des recherches antérieures selon leurs similarités. Ainsi, dans cette thèse un cadre pour simuler des collections pour des approches basées sur les résultats de recherche passées est mis en œuvre et évalué. Quatre algorithmes probabilistes pour la réutilisation des résultats de recherches passées sont ensuite proposés et évalués. Enfin, une nouvelle mesure dans un contexte de clustering est proposée.Past searches provide a useful source of information for new users (new queries). Due to the lack of ad-hoc IR collections, to this date there is a weak interest of the IR community on the use of past search results. Indeed, most of the existing IR collections are composed of independent queries. These collections are not appropriate to evaluate approaches rooted in past queries because they do not gather similar queries due to the lack of relevance judgments. Therefore, there is no easy way to evaluate the convenience of these approaches. In addition, elaborating such collections is difficult due to the cost and time needed. Thus a feasible alternative is to simulate such collections. Besides, relevant documents from similar past queries could be used to answer the new query. This principle could benefit from clustering of past searches according to their similarities. Thus, in this thesis a framework to simulate ad-hoc approaches based on past search results is implemented and evaluated. Four randomized algorithms to improve precision are proposed and evaluated, finally a new measure in the clustering context is proposed

Thèses en ligne de l'Université Toulouse III - Paul Sabatier

Utilisation de PLSI en recherche d'information

Author: Chappelier Jean-Cedric
Eckard Emmanuel
Publication venue
Publication date: 06/07/2009
Field of study

The PLSI model (“Probabilistic Latent Semantic Indexing”) offers a document indexing scheme based on probabilistic latent category models. It entailed applications in diverse ﬁelds, notably in information retrieval (IR). Nevertheless, PLSI cannot process documents not seen during parameter inference, a major liability for queries in IR. A method known as “folding-in” allows to circumvent this problem up to a point, but has its own weaknesses. The present paper introduces a new document-query similarity measure for PLSI based on language models that entirely avoids the problem a query projection. We compare this similarity to Fisher kernels, the state of the art similarities for PLSI. Moreover, we present an evaluation of PLSI on a particularly large training set of almost 7500 document and over one million term occurrence large, created from the TREC–AP collection

Infoscience - École polytechnique fédérale de Lausanne

The Effectiveness of Query-Based Hierarchic Clustering of Documents for Information Retrieval

Author: Tombros Anastasios
Publication venue: ProQuest Dissertations & Theses,
Publication date: 01/01/2002
Field of study

Hierarchic document clustering has been applied to Information Retrieval (IR) for over three decades. Its introduction to IR was based on the grounds of its potential to improve the effectiveness of IR systems. Central to the issue of improved effectiveness is the Cluster Hypothesis. The hypothesis states that relevant documents tend to be highly similar to each other, and therefore tend to appear in the same clusters. However, previous research has been inconclusive as to whether document clustering does bring improvements. The main motivation for this work has been to investigate methods for the improvement of the effectiveness of document clustering, by challenging some assumptions that implicitly characterise its application. Such assumptions relate to the static manner in which document clustering is typically performed, and include the static application of document clustering prior to querying, and the static calculation of interdocument associations. The type of clustering that is investigated in this thesis is query-based, that is, it incorporates information from the query into the process of generating clusters of documents. Two approaches for incorporating query information into the clustering process are examined: clustering documents which are returned from an IR system in response to a user query (post-retrieval clustering), and clustering documents by using query-sensitive similarity measures. For the first approach, post-retrieval clustering, an analytical investigation into a number of issues that relate to its retrieval effectiveness is presented in this thesis. This is in contrast to most of the research which has employed post-retrieval clustering in the past, where it is mainly viewed as a convenient and efficient means of presenting documents to users. In this thesis, post-retrieval clustering is employed based on its potential to introduce effectiveness improvements compared both to static clustering and best-match IR systems. The motivation for the second approach, the use of query-sensitive measures, stems from the role of interdocument similarities for the validity of the cluster hypothesis. In this thesis, an axiomatic view of the hypothesis is proposed, by suggesting that documents relevant to the same query (co-relevant documents) display an inherent similarity to each other which is dictated by the query itself. Because of this inherent similarity, the cluster hypothesis should be valid for any document collection. Past research has attributed failure to validate the hypothesis for a document collection to characteristics of the collection. Contrary to this, the view proposed in this thesis suggests that failure of a document set to adhere to the hypothesis is attributed to the assumptions made about interdocument similarity. This thesis argues that the query determines the context and the purpose for which the similarity between documents is judged, and it should therefore be incorporated in the similarity calculations. By taking the query into account when calculating interdocument similarities, co-relevant documents can be "forced" to be more similar to each other. This view challenges the typically static nature of interdocument relationships in IR. Specific formulas for the calculation of query-sensitive similarity are proposed in this thesis. Four hierarchic clustering methods and six document collections are used in the experiments. Three main issues are investigated: the effectiveness of hierarchic post-retrieval clustering which uses static similarity measures, the effectiveness of query-sensitive measures at increasing the similarity of pairs of co-relevant documents, and the effectiveness of hierarchic clustering which uses query-sensitive similarity measures. The results demonstrate the effectiveness improvements that are introduced by the use of both approaches of query-based clustering, compared both to the effectiveness of static clustering and to the effectiveness of best-match IR systems. Query-sensitive similarity measures, in particular, introduce significant improvements over the use of static similarity measures for document clustering, and they also significantly improve the structure of the document space in terms of the similarity of pairs of co-relevant documents. The results provide evidence for the effectiveness of hierarchic query-based clustering of documents, and also challenge findings of previous research which had dismissed the potential of hierarchic document clustering as an effective method for information retrieval

Glasgow Theses Service

CiteSeerX

OpenGrey Repository

Approaches for the clustering of geographic metadata and the automatic detection of quasi-spatial dataset series

Author: Béjar R.
Lacasta J.
López Pellicer F. J.
Nogueras-Iso J.
Zarazaga-Soria J.
Publication venue: 'MDPI AG'
Publication date: 01/01/2022
Field of study

The discrete representation of resources in geospatial catalogues affects their information retrieval performance. The performance could be improved by using automatically generated clusters of related resources, which we name quasi-spatial dataset series. This work evaluates whether a clustering process can create quasi-spatial dataset series using only textual information from metadata elements. We assess the combination of different kinds of text cleaning approaches, word and sentence-embeddings representations (Word2Vec, GloVe, FastText, ELMo, Sentence BERT, and Universal Sentence Encoder), and clustering techniques (K-Means, DBSCAN, OPTICS, and agglomerative clustering) for the task. The results demonstrate that combining word-embeddings representations with an agglomerative-based clustering creates better quasi-spatial dataset series than the other approaches. In addition, we have found that the ELMo representation with agglomerative clustering produces good results without any preprocessing step for text cleaning

Multidisciplinary Digital Publishing Institute

Repositorio Universidad de Zaragoza

Directory of Open Access Journals

AFRANCI : multi-layer architecture for cognitive agents

Author: Reinaldo Francisco Antonio Fernandes
Publication venue
Publication date: 01/01/2010
Field of study

Tese de doutoramento. Engenharia Electrotécnica e de Computadores. Faculdade de Engenharia. Universidade do Porto. 201

Repositório Aberto da Universidade do Porto

Classifying web documents in a hierarchy of categories: a comprehensive study

Author: A. McCallum
A. McCallum
A. S. Weigend
A. Sun
A. Vinokourov
C. Apté
D. D. Lewis
D. D. Lewis
D. Koller
D. Malerba
D. Mladenić
D. Mladenić
D. Mladenić
D. Tikk
Donato Malerba
E.-H. Han
F. Debole
F. Esposito
F. Sebastiani
G. Miller
G. Salton
H. Blockeel
H. T. Ng
J. Rocchio
J. Zhang
L. Galavotti
M. Ceci
M. Craven
M. E. Ruiz
M. F. Porter
M. Sahami
Michelangelo Ceci
P. Domingos
R. E. Schapire
R. E. Schapire
S. Dumais
S. Dumais
S. Kim
T. Hastie
T. Joachims
T. Joachims
T. Mitchell
T. Theeramunkong
V. Lertnattee
V. Vapnik
W. T. Chuang
Y. Yang
Y. Yang
Y. Yang
Z. Zheng
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Recommended from our members

Distributed Inverted Files and Performance: A Study of Parallelism and Data Distribution Methods in IR

Author: Macfarlane A.
Publication venue
Publication date
Field of study

The study investigates the performance of parallel information retrieval (IR) algorithms on different data distribution methods for Inverted files to identify which is the best for the requirements of specific IR tasks. We define a data distribution method as a way of distributing Inverted file data to local disks on a parallel machine. A data distribution method may be on-the-fly (with one copy of the index held), replication (all nodes have all of the index) or partitioning (data for index is split amongst nodes). Partitioning of inverted file data can be done in many ways but we consider only two: by term (Termld) and by document (Dodd). Termld partitioning is a type of partitioning which distributes unique word data to a single partition, while D odd partitioning distributes unique document data to a single partition. We consider the issue of improving the performance of standard IR algorithms on these data distribution methods by looking at sequential job service not concurrent job service, e.g. we consider the issue of sequential query service not concurrent query service. This methodology rules out some distribution methods for some tasks studied. We consider the following main tasks of IR: indexing, search, passage retrieval, inverted file update and query optimisation for routing /filtering. We produce a synthetic performance model for each of these tasks for the purposes of comparison. We have two subsidiary aims; one was to demonstrate portability of our implemented data structures and algorithms on different parallel machines. Secondly, we also study the possibility of increased retrieval effectiveness by examining a larger section of the search space for both passage retrieval and routing/filtering. We consider the implications of concurrency in updates on Inverted files. Our theoretical and empirical results show that in most cases the D odd partitioning method is the best data distribution method apart from routing/filtering where replication was found to be superior

City Research Online