Search CORE

2,709 research outputs found

A model for information retrieval driven by conceptual spaces

Author: Tanase D.
Tanase D.
Publication venue
Publication date: 01/01/2015
Field of study

A retrieval model describes the transformation of a query into a set of documents. The question is: what drives this transformation? For semantic information retrieval type of models this transformation is driven by the content and structure of the semantic models. In this case, Knowledge Organization Systems (KOSs) are the semantic models that encode the meaning employed for monolingual and cross-language retrieval. The focus of this research is the relationship between these meanings’ representations and their role and potential in augmenting existing retrieval models effectiveness. The proposed approach is unique in explicitly interpreting a semantic reference as a pointer to a concept in the semantic model that activates all its linked neighboring concepts. It is in fact the formalization of the information retrieval model and the integration of knowledge resources from the Linguistic Linked Open Data cloud that is distinctive from other approaches. The preprocessing of the semantic model using Formal Concept Analysis enables the extraction of conceptual spaces (formal contexts)that are based on sub-graphs from the original structure of the semantic model. The types of conceptual spaces built in this case are limited by the KOSs structural relations relevant to retrieval: exact match, broader, narrower, and related. They capture the definitional and relational aspects of the concepts in the semantic model. Also, each formal context is assigned an operational role in the flow of processes of the retrieval system enabling a clear path towards the implementations of monolingual and cross-lingual systems. By following this model’s theoretical description in constructing a retrieval system, evaluation results have shown statistically significant results in both monolingual and bilingual settings when no methods for query expansion were used. The test suite was run on the Cross-Language Evaluation Forum Domain Specific 2004-2006 collection with additional extensions to match the specifics of this model

WestminsterResearch

Engines of Order

Author: Rieder Bernhard
Publication venue: 'JSTOR'
Publication date
Field of study

Over the last decades, and in particular since the widespread adoption of the Internet, encounters with algorithmic procedures for ‘information retrieval’ – the activity of getting some piece of information out of a col-lection or repository of some kind – have become everyday experiences for most people in large parts of the world

OAPEN Library

The Symbiotic Relationship Between Information Retrieval and Informetrics

Author: Wolfram Dietmar
Publication venue: UWM Digital Commons
Publication date: 01/03/2015
Field of study

Informetrics and information retrieval (IR) represent fundamental areas of study within information science. Historically, researchers have not fully capitalized on the potential research synergies that exist between these two areas. Data sources used in traditional informetrics studies have their analogues in IR, with similar types of empirical regularities found in IR system content and use. Methods for data collection and analysis used in informetrics can help to inform IR system development and evaluation. Areas of application have included automatic indexing, index term weighting and understanding user query and session patterns through the quantitative analysis of user transaction logs. Similarly, developments in database technology have made the study of informetric phenomena less cumbersome, and recent innovations used in IR research, such as language models and ranking algorithms, provide new tools that may be applied to research problems of interest to informetricians. Building on the author’s previous work (Wolfram 2003), this paper reviews a sample of relevant literature published primarily since 2000 to highlight how each area of study may help to inform and benefit the other

University of Wisconsin-Milwaukee

Implementation of an information retrieval system within a central knowledge management system

Author: Rodrigues Daniel Fidalgo
Publication venue
Publication date: 01/01/2010
Field of study

Páginas numeradas: I-XIII, 14-126Estágio realizado na Wipro Portugal SA e orientado pelo Eng.º Hugo NetoTese de mestrado integrado. Engenharia Informática e Computação. Faculdade de Engenharia. Universidade do Porto. 201

Repositório Aberto da Universidade do Porto

Towards an All-Purpose Content-Based Multimedia Information Retrieval System

Author: Gasser Ralph
Rossetto Luca
Schuldt Heiko
Publication venue
Publication date: 01/01/2019
Field of study

The growth of multimedia collections - in terms of size, heterogeneity, and variety of media types - necessitates systems that are able to conjointly deal with several forms of media, especially when it comes to searching for particular objects. However, existing retrieval systems are organized in silos and treat different media types separately. As a consequence, retrieval across media types is either not supported at all or subject to major limitations. In this paper, we present vitrivr, a content-based multimedia information retrieval stack. As opposed to the keyword search approach implemented by most media management systems, vitrivr makes direct use of the object's content to facilitate different types of similarity search, such as Query-by-Example or Query-by-Sketch, for and, most importantly, across different media types - namely, images, audio, videos, and 3D models. Furthermore, we introduce a new web-based user interface that enables easy-to-use, multimodal retrieval from and browsing in mixed media collections. The effectiveness of vitrivr is shown on the basis of a user study that involves different query and media types. To the best of our knowledge, the full vitrivr stack is unique in that it is the first multimedia retrieval system that seamlessly integrates support for four different types of media. As such, it paves the way towards an all-purpose, content-based multimedia information retrieval system

arXiv.org e-Print Archive

edoc

Recommended from our members

Beyond Discourse: Computational Text Analysis and Material Historical Processes

Author: Atria Jose Tomas
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2018
Field of study

This dissertation proposes a general methodological framework for the application of computational text analysis to the study of long duration material processes of transformation, beyond their traditional application to the study of discourse and rhetorical action. Over a thin theory of the linguistic nature of social facts, the proposed methodology revolves around the compilation of term co-occurrence matrices and their projection into different representations of an hypothetical semantic space. These representations offer solutions to two problems inherent to social scientific research: that of "mapping" features in a given representation to theoretical entities and that of "alignment" of the features seen in models built from different sources in order to enable their comparison. The data requirements of the exercise are discussed through the introduction of the notion of a "narrative horizon", the extent to which a given source incorporates a narrative account in its rendering of the context that produces it. Useful primary data will consist of text with short narrative horizons, such that the ideal source will correspond to a continuous archive of institutional, ideally bureaucratic text produced as mere documentation of a definite population of more or less stable and comparable social facts across a couple of centuries. Such a primary source is available in the Proceedings of the Old Bailey (POB), a collection of transcriptions of 197,752 criminal trials seen by the Old Bailey and the Central Criminal Court of London and Middlesex between 1674 and 1913 that includes verbatim transcriptions of witness testimony. The POB is used to demonstrate the proposed framework, starting with the analysis of the evolution of an historical corpus to illustrate the procedure by which provenance data is used to construct longitudinal and cross-sectional comparisons of different corpus segments. The co-occurrence matrices obtained from the POB corpus are used to demonstrate two different projections: semantic networks that model different notions of similarity between the terms in a corpus' lexicon as an adjacency matrix describing a graph and semantic vector spaces that approximate a lower-dimensional representation of an hypothetical semantic space from its empirical effects on the co-occurrence matrix. Semantic networks are presented as discrete mathematical objects that offer a solution to the mapping problem through operation that allow for the construction of sets of terms over which an order can be induced using any measure of significance of the strength of association between a term set and its elements. Alignment can then be solved through different similarity measures computed over the intersection and union of the sets under comparison. Semantic vector spaces are presented as continuous mathematical objects that offer a solution to the mapping problem in the linear structures contained in them. This include, in all cases, a meaningful metric that makes it possible to define neighbourhoods and regions in the semantic space and, in some cases, a meaningful orientation that makes it possible to trace dimensions across them. Alignment can then proceed endogenously in the case of oriented vector spaces for relative comparisons, or through the construction of common basis sets for non-oriented semantic spaces for absolute comparisons. The dissertation concludes with the proposition of a general research program for the systematic compilation of text distributional patterns in order to facilitate a much needed process of calibration required by the techniques discussed in the previous chapters. Two specific avenues for further research are identified. First, the development of incremental methods of projection that allow a semantic model to be updated as new observations come along, an area that has received considerable attention from the field of electronic finance and the pervasive use of Gentleman's algorithm for matrix factorisation. Second, the development of additively decomposable models that may be combined or disaggregated to obtain a similar result to the one that would have been obtained had the model being computed from the union or difference of their inputs. This is established to be dependent on whether the functions that actualise a given model are associative under addition or not

Columbia University Academic Commons

Using Search Term Positions for Determining Document Relevance

Author: Galeas Patricio
Publication venue: Philipps-Universität Marburg
Publication date: 01/01/2010
Field of study

The technological advancements in computer networks and the substantial reduction of their production costs have caused a massive explosion of digitally stored information. In particular, textual information is becoming increasingly available in electronic form. Finding text documents dealing with a certain topic is not a simple task. Users need tools to sift through non-relevant information and retrieve only pieces of information relevant to their needs. The traditional methods of information retrieval (IR) based on search term frequency have somehow reached their limitations, and novel ranking methods based on hyperlink information are not applicable to unlinked documents. The retrieval of documents based on the positions of search terms in a document has the potential of yielding improvements, because other terms in the environment where a search term appears (i.e. the neighborhood) are considered. That is to say, the grammatical type, position and frequency of other words help to clarify and specify the meaning of a given search term. However, the required additional analysis task makes position-based methods slower than methods based on term frequency and requires more storage to save the positions of terms. These drawbacks directly affect the performance of the most user critical phase of the retrieval process, namely query evaluation time, which explains the scarce use of positional information in contemporary retrieval systems. This thesis explores the possibility of extending traditional information retrieval systems with positional information in an efficient manner that permits us to optimize the retrieval performance by handling term positions at query evaluation time. To achieve this task, several abstract representation of term positions to efficiently store and operate on term positional data are investigated. In the Gauss model, descriptive statistics methods are used to estimate term positional information, because they minimize outliers and irregularities in the data. The Fourier model is based on Fourier series to represent positional information. In the Hilbert model, functional analysis methods are used to provide reliable term position estimations and simple mathematical operators to handle positional data. The proposed models are experimentally evaluated using standard resources of the IR research community (Text Retrieval Conference). All experiments demonstrate that the use of positional information can enhance the quality of search results. The suggested models outperform state-of-the-art retrieval utilities. The term position models open new possibilities to analyze and handle textual data. For instance, document clustering and compression of positional data based on these models could be interesting topics to be considered in future research

Publikations- und Dokumentenserver der Universitätsbibliothek Marburg

A Machine Learning Approach for Plagiarism Detection

Author: Al-Sallal Muna
Publication venue
Publication date: 01/01/2016
Field of study

Plagiarism detection is gaining increasing importance due to requirements for integrity in education. The existing research has investigated the problem of plagrarim detection with a varying degree of success. The literature revealed that there are two main methods for detecting plagiarism, namely extrinsic and intrinsic. This thesis has developed two novel approaches to address both of these methods. Firstly a novel extrinsic method for detecting plagiarism is proposed. The method is based on four well-known techniques namely Bag of Words (BOW), Latent Semantic Analysis (LSA), Stylometry and Support Vector Machines (SVM). The LSA application was fine-tuned to take in the stylometric features (most common words) in order to characterise the document authorship as described in chapter 4. The results revealed that LSA based stylometry has outperformed the traditional LSA application. Support vector machine based algorithms were used to perform the classification procedure in order to predict which author has written a particular book being tested. The proposed method has successfully addressed the limitations of semantic characteristics and identified the document source by assigning the book being tested to the right author in most cases. Secondly, the intrinsic detection method has relied on the use of the statistical properties of the most common words. LSA was applied in this method to a group of most common words (MCWs) to extract their usage patterns based on the transitivity property of LSA. The feature sets of the intrinsic model were based on the frequency of the most common words, their relative frequencies in series, and the deviation of these frequencies across all books for a particular author. The Intrinsic method aims to generate a model of author “style” by revealing a set of certain features of authorship. The model’s generation procedure focuses on just one author as an attempt to summarise aspects of an author’s style in a definitive and clear-cut manner. The thesis has also proposed a novel experimental methodology for testing the performance of both extrinsic and intrinsic methods for plagiarism detection. This methodology relies upon the CEN (Corpus of English Novels) training dataset, but divides that dataset up into training and test datasets in a novel manner. Both approaches have been evaluated using the well-known leave-one-out-cross-validation method. Results indicated that by integrating deep analysis (LSA) and Stylometric analysis, hidden changes can be identified whether or not a reference collection exists

Coventry University Pure Portal

A modular architecture for systematic text categorisation

Author: Barnes Andrew James
Publication venue
Publication date
Field of study

This work examines and attempts to overcome issues caused by the lack of formal standardisation when defining text categorisation techniques and detailing how they might be appropriately integrated with each other. Despite text categorisation’s long history the concept of automation is relatively new, coinciding with the evolution of computing technology and subsequent increase in quantity and availability of electronic textual data. Nevertheless insufficient descriptions of the diverse algorithms discovered have lead to an acknowledged ambiguity when trying to accurately replicate methods, which has made reliable comparative evaluations impossible. Existing interpretations of general data mining and text categorisation methodologies are analysed in the first half of the thesis and common elements are extracted to create a distinct set of significant stages. Their possible interactions are logically determined and a unique universal architecture is generated that encapsulates all complexities and highlights the critical components. A variety of text related algorithms are also comprehensively surveyed and grouped according to which stage they belong in order to demonstrate how they can be mapped. The second part reviews several open-source data mining applications, placing an emphasis on their ability to handle the proposed architecture, potential for expansion and text processing capabilities. Finding these inflexible and too elaborate to be readily adapted, designs for a novel framework are introduced that focus on rapid prototyping through lightweight customisations and reusable atomic components. Being a consequence of inadequacies with existing options, a rudimentary implementation is realised along with a selection of text categorisation modules. Finally a series of experiments are conducted that validate the feasibility of the outlined methodology and importance of its composition, whilst also establishing the practicality of the framework for research purposes. The simplicity of experiments and results gathered clearly indicate the potential benefits that can be gained when a formalised approach is utilised

University of Huddersfield Repository

An automatic approach to weighted subject indexing – An empirical study in the biomedical domain

Author: An
Blair
Blei
Chung
Cooper
Cooper
Foskett
Furnas
Hjørland
Hjørland
Hjørland
Hjørland
Hjørland
Hutchins
Jalali
Kent
Kleineberg
Klingbiel
Lavrenko
Lu
Mai
Manning
Maron
Maron
Maron
Medelyan
Meij
Mu
Plaunt
Ruch
Salton
Salton
Shin
Stokes
Taylor
Travis
Willis
Wilson
Wolfram
Zhang
Publication venue: 'Wiley'
Publication date: 01/01/2015
Field of study

Subject indexing is an intellectually intensive process that has many inherent uncertainties. Existing manual subject indexing systems generally produce binary outcomes for whether or not to assign an indexing term. This does not sufficiently reflect the extent to which the indexing terms are associated with the documents. On the other hand, the idea of probabilistic or weighted indexing was proposed a long time ago and has seen success in capturing uncertainties in the automatic indexing process. One hurdle to overcome in implementing weighted indexing in manual subject indexing systems is the practical burden that could be added to the already intensive indexing process. This study proposes a method to infer automatically the associations between subject terms and documents through text mining. By uncovering the connections between MeSH descriptors and document text, we are able to derive the weights of MeSH descriptors manually assigned to documents. Our initial results suggest that the inference method is feasible and promising. The study has practical implications for improving subject indexing practice and providing better support for information retrieval.Ye

Crossref

SHAREOK repository