12 research outputs found

    The accessibility dimension for structured document retrieval

    Get PDF
    Structured document retrieval aims at retrieving the document components that best satisfy a query, instead of merely retrieving pre-defined document units. This paper reports on an investigation of a tf-idf-acc approach, where tf and idf are the classical term frequency and inverse document frequency, and acc, a new parameter called accessibility, that captures the structure of documents. The tf-idf-acc approach is defined using a probabilistic relational algebra. To investigate the retrieval quality and estimate the acc values, we developed a method that automatically constructs diverse test collections of structured documents from a standard test collection, with which experiments were carried out. The analysis of the experiments provides estimates of the acc values

    Systematizing Web Search Through A Meta-Cognitive, Systems-Based, Information Structuring Model (McSIS)

    Get PDF
    This paper proposes a meta-cognitive, systems-based, information structuring model (McSIS) to systematize online information search behavior based on literature review of information-seeking models. The General Systems Theory’s (GST) prepositions serve as its framework. Factors influencing information-seekers, such as the individual learning styles of Field Independence and Field Dependence (FI/FD); Holist, Serialist (H/S), and Problem-Focused vs. Emotion-Focused (PF/EF) problem-solving approaches, and individual or personal domain knowledge, are incorporated in the model. An example demonstrating the model is presented along with recommendations for educators and researchers

    What is the influence of genre during the perception of structured text for retrieval and search?

    Get PDF
    This thesis presents an investigation into the high value of structured text (or form) in the context of genre within Information Retrieval. In particular, how are these structured texts perceived and why are they not more heavily used within Information Retrieval & Search communities? The main motivation is to show the features in which people can exploit genre within Information Search & Retrieval, in particular, categorisation and search tasks. To do this, it was vital to record and analyse how and why this was done during typical tasks. The literature review highlighted two previous studies (Toms & Campbell 1999a; Watt 2009) which have reported pilot studies consisting of genre categorisation and information searching. Both studies and other findings within the literature review inspired the work contained within this thesis. Genre is notoriously hard to define, but a very useful framework of Purpose and Form, developed by Yates & Orlikowski (1992), was utilised to design two user studies for the research reported within the thesis. The two studies consisted of, first, a categorisation task (e-mails), and second, a set of six simulated situations in Wikipedia, both of which collected quantitative data from eye tracking experiments as well as qualitative user data. The results of both studies showed the extent to which the participants utilised the form features of the stimuli presented, in particular, how these were used, which ocular behaviours (skimming or scanning) and actual features were used, and which were the most important. The main contributions to research made by this thesis were, first of all, that the task-based user evaluations employing simulated search scenarios revealed how and why users make decisions while interacting with the textual features of structure and layout within a discourse community, and, secondly, an extensive evaluation of the quantitative data revealed the features that were used by the participants in the user studies and the effects of the interpretation of genre in the search and categorisation process as well as the perceptual processes used in the various communities. This will be of benefit for the re-development of information systems. As far as is known, this is the first detailed and systematic investigation into the types of features, value of form, perception of features, and layout of genre using eye tracking in online communities, such as Wikipedia

    Formal concept matching and reinforcement learning in adaptive information retrieval

    Get PDF
    The superiority of the human brain in information retrieval (IR) tasks seems to come firstly from its ability to read and understand the concepts, ideas or meanings central to documents, in order to reason out the usefulness of documents to information needs, and secondly from its ability to learn from experience and be adaptive to the environment. In this work we attempt to incorporate these properties into the development of an IR model to improve document retrieval. We investigate the applicability of concept lattices, which are based on the theory of Formal Concept Analysis (FCA), to the representation of documents. This allows the use of more elegant representation units, as opposed to keywords, in order to better capture concepts/ideas expressed in natural language text. We also investigate the use of a reinforcement leaming strategy to learn and improve document representations, based on the information present in query statements and user relevance feedback. Features or concepts of each document/query, formulated using FCA, are weighted separately with respect to the documents they are in, and organised into separate concept lattices according to a subsumption relation. Furthen-nore, each concept lattice is encoded in a two-layer neural network structure known as a Bidirectional Associative Memory (BAM), for efficient manipulation of the concepts in the lattice representation. This avoids implementation drawbacks faced by other FCA-based approaches. Retrieval of a document for an information need is based on concept matching between concept lattice representations of a document and a query. The learning strategy works by making the similarity of relevant documents stronger and non-relevant documents weaker for each query, depending on the relevance judgements of the users on retrieved documents. Our approach is radically different to existing FCA-based approaches in the following respects: concept formulation; weight assignment to object-attribute pairs; the representation of each document in a separate concept lattice; and encoding concept lattices in BAM structures. Furthermore, in contrast to the traditional relevance feedback mechanism, our learning strategy makes use of relevance feedback information to enhance document representations, thus making the document representations dynamic and adaptive to the user interactions. The results obtained on the CISI, CACM and ASLIB Cranfield collections are presented and compared with published results. In particular, the performance of the system is shown to improve significantly as the system learns from experience.The School of Computing, University of Plymouth, UK

    Digital life stories: Semi-automatic (auto)biographies within lifelog collections

    Get PDF
    Our life stories enable us to reflect upon and share our personal histories. Through emerging digital technologies the possibility of collecting life experiences digitally is increasingly feasible; consequently so is the potential to create a digital counterpart to our personal narratives. In this work, lifelogging tools are used to collect digital artifacts continuously and passively throughout our day. These include images, documents, emails and webpages accessed; texts messages and mobile activity. This range of data when brought together is known as a lifelog. Given the complexity, volume and multimodal nature of such collections, it is clear that there are significant challenges to be addressed in order to achieve coherent and meaningful digital narratives of our events from our life histories. This work investigates the construction of personal digital narratives from lifelog collections. It examines the underlying questions, issues and challenges relating to construction of personal digital narratives from lifelogs. Fundamentally, it addresses how to organize and transform data sampled from an individual’s day-to-day activities into a coherent narrative account. This enquiry is enabled by three 20-month long-term lifelogs collected by participants and produces a narrative system which enables the semi-automatic construction of digital stories from lifelog content. Inspired by probative studies conducted into current practices of curation, from which a set of fundamental requirements are established, this solution employs a 2-dimensional spatial framework for storytelling. It delivers integrated support for the structuring of lifelog content and its distillation into storyform through information retrieval approaches. We describe and contribute flexible algorithmic approaches to achieve both. Finally, this research inquiry yields qualitative and quantitative insights into such digital narratives and their generation, composition and construction. The opportunities for such personal narrative accounts to enable recollection, reminiscence and reflection with the collection owners are established and its benefit in sharing past personal experience experiences is outlined. Finally, in a novel investigation with motivated third parties we demonstrate the opportunities such narrative accounts may have beyond the scope of the collection owner in: personal, societal and cultural explorations, artistic endeavours and as a generational heirloom

    Evaluating sources of implicit feedback for web search

    Get PDF
    This dissertation investigated several important issues in using implicit feedback techniques to assist searchers with difficulties in formulating effective search strategies. The study focused on examining the relationship between types of behavioral evidence that can be captured from Web searches and searchers’ interests. Web search cases which involved underspecification of information needs at the beginning and modification of search strategies during the search process were collected and reviewed by human analysts (reference librarians) who tried to infer searchers’ interests from behavioral traces. Analysts’ rationales for making the inferences were elicited and analyzed with the focus on understanding what evidence was used to support the inferences and how it was used. The analysis revealed the complexities and nuances in using behavioral evidence for implicit feedback and led to the proposal of an implicit feedback model for Web search that bridged previous studies on behavioral evidence and implicit feedback measures. A new level of analysis termed an analytical lens emerged from the data and provides a road map for future research on this topic. The study also put forward design recommendations for implicit feedback systems based on the signals that analysts identified and the rules that they used in making inferences

    Content-based retrieval of digital music

    Get PDF
    One of the advantages of having information in digital form is that it lends itself readily to content-based access. This applies to information stored in any media, though content searching through information stored in a structured database or as text is more developed than content searching through information stored in other media such as music In practice, the most common way to index and provide retrieval on digital music is to use its metadata such as title, performer, etc , as has been done in Napster. My research has lead to the development of a digital music information retrieval system called Ceolaire which can index monophonic music files Music files are analysed for notes on the equal tempering scale, where note changes are observed and recorded as being up (U), down (D) or the same (S) relative to the previous note. These note changes are then indexed in a search engine At query time, notes are generated by a user using a web based interface These notes form the query for the retneval engine. A user is presented with a ranked list of highly scored documents. This thesis explores the building and evaluation of the Ceolaire system

    De nouveaux facteurs pour l'exploitation de la sémantique d'un texte en recherche d'information

    Get PDF
    Les travaux présentés dans ce mémoire se situent dans le contexte de la recherche d'information. Plus précisément, nous proposons de nouveaux facteurs " centralité, fréquence conceptuelle" permettant à notre sens, de mieux caractériser la dimension sémantique du contenu des textes, allant au-delà des méthodes d'indexation classiques basées exclusivement sur les statistiques. Ces facteurs devraient tirer parti de l'identification de différents types de relations telles que -est-une partie-de, liés à, synonymie, domaine, etc.- qui existent entre les mots d'un texte. L'approche que nous avons proposée pour calculer la valeur de nos facteurs est bâtie en trois étapes : (1) Extraction des concepts issus de WordNet1 associés aux termes du document puis désambigüisation de leurs sens, (2) Regroupement des concepts pour former des clusters de concepts (Ces étapes construisent la vue sémantique des documents), (3) A l'intérieur de chaque cluster, chaque terme possède un degré de " centralité ", fonction du nombre de mots du cluster avec lequel il est en relation directe, et une " fréquence conceptuelle " estimée par la somme des fréquences de ces mots. D'une part, nous menons une étude sur des méthodes potentielles basées sur les facteurs proposés pour extraire des vues sémantiques du contenu des textes. L'objectif est de construire des structures de graphes/hiérarchies offrant une vue du contenu sémantique des documents. Ensuite, ces vues seront élaborées à partir de nos nouveaux facteurs, mais aussi de l'utilisation des fréquences d'occurrence, et de la prise en compte de l'importance des mots (en particulier en terme de leur spécificité). Le poids relatif des vues partielles, la fréquence et la spécificité de leurs composants sont d'autant des indications qui devraient permettre d'identifier et de construire des sous-ensembles hiérarchisés de mots (présents dans le texte ou sémantiquement associés à des mots du texte), et de refléter les concepts présents dans le contenu du texte. L'obtention d'une meilleure représentation du contenu sémantique des textes aidera à mieux retrouver les textes pertinents pour une requête donnée, et à donner une vue synthétisée du contenu des textes proposés à l'utilisateur en réponse à sa requête. D'autre part, nous proposons une technique de désambiguïsation du concept basée sur la centralité. En fait, le sens d'un terme est ambigu, il dépend de son contexte d'emploi. Dans notre proposition, nous utilisons l'ontologie de WordNet, qui est précise dans la couverture des sens de termes, où un terme peut être attaché à plusieurs concepts. La méthode proposée consiste à trouver le meilleur concept WordNet permettant de représenter le sens du terme désigné par le texte. Le concept choisi est celui qui a un maximum de relations avec les termes du document, autrement dit, celui qui a une valeur maximale de centralité. L'utilisation d'une méthode de désambiguïsation est une étape inévitable dans une indexation conceptuelle, elle permet de mieux représenter le contenu sémantique d'un document. Enfin, nous utilisons nos facteurs dans le cadre de Recherche d'Information comme de nouveaux facteurs pour mesurer la pertinence d'un document vis-à-vis d'une requête (tâche de RI ad-hoc). L'utilisation de nos facteurs sémantiques est intéressante dans la RI, où nous estimons un degré de relativité entre les termes d'une requête et ceux d'un document indépendamment de leur présence dans ce dernier. Dans ce cadre, nous avons proposé une nouvelle fonction de pondération basée sur la centralité, ainsi que nous avons intégré les nouveaux facteurs à des fonctions connues. Dans les différentes expérimentations menées, nous avons montré que l'intégration de nos facteurs sémantiques ramène une amélioration au niveau de précision dans un moteur de recherche d'information. Tâche prometteuse pour une recherche plus ciblée et plus efficace.The work presented in this paper are in the context of information retrieval. Specifically, we propose new factors "centrality frequebcy conceptual" to our senses, to better characterize the semantic dimension of the text content, going beyond traditional indexing methods based solely on statistics. Theses factors should benefit from the identification of different typesif relationships sich as is-part-of, relating to, synonymy, domain, etc. -between tha words of text

    Accès à l'information biomédicale : vers une approche d'indexation et de recherche d'information conceptuelle basée sur la fusion de ressources termino-ontologiques

    Get PDF
    La recherche d'information (RI) est une discipline scientifique qui a pour objectif de produire des solutions permettant de sélectionner à partir de corpus d'information celle qui sont dites pertinentes pour un utilisateur ayant exprimé une requête. Dans le contexte applicatif de la RI biomédicale, les corpus concernent différentes sources d'information du domaine : dossiers médicaux de patients, guides de bonnes pratiques médicales, littérature scientifique du domaine médical etc. Les besoins en information peuvent concerner divers profils : des experts médicaux, des patients et leurs familles, des utilisateurs néophytes etc. Plusieurs défis sont liés spécifiquement à la RI biomédicale : la représentation "spécialisée" des documents, basés sur l'usage des ressources terminologiques du domaine, le traitement des synonymes, des acronymes et des abréviations largement pratiquée dans le domaine, l'accès à l'information guidé par le contexte du besoin et des profils des utilisateurs. Nos travaux de thèse s'inscrivent dans le domaine général de la RI biomédicale et traitent des défis de représentation de l'information biomédicale et de son accès. Sur le volet de la représentation de l'information, nous proposons des techniques d'indexation de documents basées sur : 1) la reconnaissance de concepts termino-ontologiques : cette reconnaissance s'apparente à une recherche approximative de concepts pertinents associés à un contenu, vu comme un sac de mots. La technique associée exploite à la fois la similitude structurelle des contenus informationnels des concepts vis-à-vis des documents mais également la similitude du sujet porté par le document et le concept, 2) la désambiguïsation des entrées de concepts reconnus en exploitant la branche liée au sous-domaine principal de la ressource termino-ontologique, 3) l'exploitation de différentes ressources termino-ontologiques dans le but de couvrir au mieux la sémantique du contenu documentaire. Sur le volet de l'accès à l'information, nous proposons des techniques d'appariement basées sur l'expansion combinée de requêtes et des documents guidées par le contexte du besoin en information d'une part et des contenus documentaires d'autre part. Notre analyse porte essentiellement sur l'étude de l'impact des différents paramètres d'expansion sur l'efficacité de la recherche : distribution des concepts dans les ressources ontologiques, modèle de fusion des concepts, modèle de pondération des concepts, etc. L'ensemble de nos contributions, en termes de techniques d'indexation et d'accès à l'information ont fait l'objet d'évaluation expérimentale sur des collections de test dédiées à la recherche d'information médicale, soit du point de vue de la tâche telles que TREC Medical track, CLEF Image, Medical case ou des collections de test telles que TREC Genomics.Information Retrieval (IR) is a scientific field aiming at providing solutions to select relevant information from a corpus of documents in order to answer the user information need. In the context of biomedical IR, there are different sources of information: patient records, guidelines, scientific literature, etc. In addition, the information needs may concern different profiles : medical experts, patients and their families, and other users ... Many challenges are specifically related to the biomedical IR : the document representation, the usage of terminologies with synonyms, acronyms, abbreviations as well as the access to the information guided by the context of information need and the user profiles. Our work is most related to the biomedical IR and deals with the challenges of the representation of biomedical information and the access to this rich source of information in the biomedical domain.Concerning the representation of biomedical information, we propose techniques and approaches to indexing documents based on: 1) recognizing and extracting concepts from terminologies : the method of concept extraction is basically based on an approximate lookup of candidate concepts that could be useful to index the document. This technique expoits two sources of evidence : (a) the content-based similarity between concepts and documents and (b) the semantic similarity between them. 2) disambiguating entry terms denoting concepts by exploiting the polyhierarchical structure of a medical thesaurus (MeSH - Medical Subject Headings). More specifically, the domains of each concept are exploited to compute the semantic similarity between ambiguous terms in documents. The most appropriate domain is detected and associated to each term denoting a particular concept. 3) exploiting different termino-ontological resources in an attempt to better cover the semantics of document contents. Concerning the information access, we propose a document-query matching method based on the combination of document and query expansion techniques. Such a combination is guided by the context of information need on one hand and the semantic context in the document on the other hand. Our analysis is essentially based on the study of factors related to document and query expansion that could have an impact on the IR performance: distribution of concepts in termino-ontological resources, fusion techniques for concept extraction issued from multiple terminologies, concept weighting models, etc
    corecore