54 research outputs found

    Intelligent Fusion of Structural and Citation-Based Evidence for Text Classification

    Get PDF
    This paper investigates how citation-based information and structural content (e.g., title, abstract) can be combined to improve classification of text documents into predefined categories. We evaluate different measures of similarity, five derived from the citation structure of the collection, and three measures derived from the structural content, and determine how they can be fused to improve classification effectiveness. To discover the best fusion framework, we apply Genetic Programming (GP) techniques. Our empirical experiments using documents from the ACM digital library and the ACM classification scheme show that we can discover similarity functions that work better than any evidence in isolation and whose combined performance through a simple majority voting is comparable to that of Support Vector Machine classifiers

    Advance of the Access Methods

    Get PDF
    The goal of this paper is to outline the advance of the access methods in the last ten years as well as to make review of all available in the accessible bibliography methods

    Large Language Models for Information Retrieval: A Survey

    Full text link
    As a primary means of information acquisition, information retrieval (IR) systems, such as search engines, have integrated themselves into our daily lives. These systems also serve as components of dialogue, question-answering, and recommender systems. The trajectory of IR has evolved dynamically from its origins in term-based methods to its integration with advanced neural models. While the neural models excel at capturing complex contextual signals and semantic nuances, thereby reshaping the IR landscape, they still face challenges such as data scarcity, interpretability, and the generation of contextually plausible yet potentially inaccurate responses. This evolution requires a combination of both traditional methods (such as term-based sparse retrieval methods with rapid response) and modern neural architectures (such as language models with powerful language understanding capacity). Meanwhile, the emergence of large language models (LLMs), typified by ChatGPT and GPT-4, has revolutionized natural language processing due to their remarkable language understanding, generation, generalization, and reasoning abilities. Consequently, recent research has sought to leverage LLMs to improve IR systems. Given the rapid evolution of this research trajectory, it is necessary to consolidate existing methodologies and provide nuanced insights through a comprehensive overview. In this survey, we delve into the confluence of LLMs and IR systems, including crucial aspects such as query rewriters, retrievers, rerankers, and readers. Additionally, we explore promising directions within this expanding field

    Recommend Touring Routes to Travelers According to Their Sequential Wandering Behaviours

    Full text link

    Expert agreement and content based reranking in a meta search environment using Mearf

    Get PDF

    Leveraging Semantic Annotations for Event-focused Search & Summarization

    Get PDF
    Today in this Big Data era, overwhelming amounts of textual information across different sources with a high degree of redundancy has made it hard for a consumer to retrospect on past events. A plausible solution is to link semantically similar information contained across the different sources to enforce a structure thereby providing multiple access paths to relevant information. Keeping this larger goal in view, this work uses Wikipedia and online news articles as two prominent yet disparate information sources to address the following three problems: ‱ We address a linking problem to connect Wikipedia excerpts to news articles by casting it into an IR task. Our novel approach integrates time, geolocations, and entities with text to identify relevant documents that can be linked to a given excerpt. ‱ We address an unsupervised extractive multi-document summarization task to generate a fixed-length event digest that facilitates efficient consumption of information contained within a large set of documents. Our novel approach proposes an ILP for global inference across text, time, geolocations, and entities associated with the event. ‱ To estimate temporal focus of short event descriptions, we present a semi-supervised approach that leverages redundancy within a longitudinal news collection to estimate accurate probabilistic time models. Extensive experimental evaluations demonstrate the effectiveness and viability of our proposed approaches towards achieving the larger goal.Im heutigen Big Data Zeitalters existieren ĂŒberwĂ€ltigende Mengen an Textinformationen, die ĂŒber mehrere Quellen verteilt sind und ein hohes Maß an Redundanz haben. Durch diese Gegebenheiten ist eine Retroperspektive auf vergangene Ereignisse fĂŒr Konsumenten nur schwer möglich. Eine plausible Lösung ist die VerknĂŒpfung semantisch Ă€hnlicher, aber ĂŒber mehrere Quellen verteilter Informationen, um dadurch eine Struktur zu erzwingen, die mehrere Zugriffspfade auf relevante Informationen, bietet. Vor diesem Hintergrund benutzt diese Dissertation Wikipedia und Onlinenachrichten als zwei prominente, aber dennoch grundverschiedene Informationsquellen, um die folgenden drei Probleme anzusprechen: ‱ Wir adressieren ein VerknĂŒpfungsproblem, um Wikipedia-AuszĂŒge mit Nachrichtenartikeln zu verbinden und das Problem in eine Information-Retrieval-Aufgabe umzuwandeln. Unser neuartiger Ansatz integriert Zeit- und GeobezĂŒge sowie EntitĂ€ten mit Text, um relevante Dokumente, die mit einem gegebenen Auszug verknĂŒpft werden können, zu identifizieren. ‱ Wir befassen uns mit einer unĂŒberwachten Extraktionsmethode zur automatischen Zusammenfassung von Texten aus mehreren Dokumenten um Ereigniszusammenfassungen mit fester LĂ€nge zu generieren, was eine effiziente Aufnahme von Informationen aus großen Dokumentenmassen ermöglicht. Unser neuartiger Ansatz schlĂ€gt eine ganzzahlige lineare Optimierungslösung vor, die globale Inferenzen ĂŒber Text, Zeit, Geolokationen und mit Ereignis-verbundenen EntitĂ€ten zieht. ‱ Um den zeitlichen Fokus kurzer Ereignisbeschreibungen abzuschĂ€tzen, stellen wir einen semi-ĂŒberwachten Ansatz vor, der die Redundanz innerhalb einer langzeitigen Dokumentensammlung ausnutzt, um genaue probabilistische Zeitmodelle abzuschĂ€tzen. Umfangreiche experimentelle Auswertungen zeigen die Wirksamkeit und TragfĂ€higkeit unserer vorgeschlagenen AnsĂ€tze zur Erreichung des grĂ¶ĂŸeren Ziels

    Tailored deep learning techniques for information retrieval

    Full text link
    La recherche d'information vise Ă  trouver des documents pertinents par rapport Ă  une requĂȘte. Auparavant, de nombreux modĂšles traditionnels de la Recherche d'Informations ont Ă©tĂ© proposĂ©s. Ils essaient soit d'encoder la requĂȘte et les documents en vecteurs dans l'espace des termes et d'estimer la pertinence en calculant la similaritĂ© des deux vecteurs, soit d'estimer la pertinence par des modĂšles probabilistes. Cependant, pour les modĂšles d'espace vectoriel, l'encodage des requĂȘtes et des documents dans l'espace des termes a ses limites: par exemple, il est difficile d'identifier les termes du document qui ont des sens similaires au termes exactes de la requĂȘte. Il est Ă©galement difficile de reprĂ©senter le contenu du texte Ă  diffĂ©rents niveaux d'abstraction pouvant correspondre aux besoins diffĂ©rents d'information exprimĂ©s dans des requĂȘtes. Avec le dĂ©veloppement rapide des techniques d'apprentissage profond, il est possible d'apprendre des reprĂ©sentations utiles Ă  travers une sĂ©rie de couches neurones, ce qui ouvre la voie Ă  de meilleures reprĂ©sentations dans un espace dense latent plutĂŽt que dans l'espace des termes, ce qui peut aider Ă  identifier les termes non exactes mais qui portent les sens similaires. Il nous permet Ă©galement de crĂ©er de diffĂ©rentes couches de reprĂ©sentation pour la requĂȘte et le document, permettant ainsi des correspondances entre la requĂȘte et les documents Ă  diffĂ©rents niveaux d'abstractions, ce qui peut mieux rĂ©pondre aux besoins d'informations pour diffĂ©rents types de requĂȘtes. Enfin, les techniques d'apprentissage profond permettent Ă©galement d'apprendre une meilleure fonction d'appariement. Dans cette thĂšse, nous explorons diffĂ©rentes techniques d'apprentissage profond pour traiter ces problĂšmes. Nous Ă©tudions d'abord la construction de plusieurs couches de reprĂ©sentation avec diffĂ©rents niveaux d'abstraction entre la requĂȘte et le document, pour des modĂšles basĂ©s sur la reprĂ©sentation et l'interaction. Nous proposons ensuite un modĂšle permettant de faire les matchings croisĂ©s des representations entre la requĂȘte et le document sur diffĂ©rentes couches pour mieux rĂ©pondre au besoin de correspondance terme-phrase. Enfin, nous explorons l'apprentissage intĂ©grĂ© d'une fonction de rang et les reprĂ©sentations de la requĂȘte et du document. Des expĂ©riences sur des jeux de donnĂ©es publics ont montrĂ© que nos mĂ©thods proposĂ©es dans cette thĂšse sont plus performantes que les mĂ©thodes existantes.Information Retrieval aims to find relevant documents to a query. Previously many traditional information retrieval models have been proposed. They either try to encode query and documents into vectors in term space and estimate the relevance by computing the similarity of the two vectors or estimate the relevance by probabilistic models. However for vector space models, encoding query and documents into term space has its limitations: for example, it's difficult to catch terms of similar meanings to the exact query term in the document. It is also difficult to represent the text in a hierarchy of abstractions to better match the information need expressed in the query. With the fast development of deep learning techniques, it is possible to learn useful representations through a series of neural layers, which paves the way to learn better representations in latent dense space rather the term space, which may help to match the non exact matched but similar terms. It also allows us to create different layers of representation for query and document thereby enabling matchings between query and documents at different levels of abstractions, which may better serve the information needs for different queries. Finally, deep learning techniques also allows to learn better ranking function. In this thesis, we explore several deep learning techniques to deal with the above problems. First, we study the effectiveness of building multiple abstraction layers between query and document, for representation- and interaction-based models. Then we propose a model allowing for cross-matching of query and document representations at different layers to better serve the need of term-phrase matching. Finally we propose an integrated learning framework of ranking function and neural features from query and document. Experiments on public datasets demonstrate that the methods we propose in this thesis are more effective than the existing ones

    A new filtering index for fast processing of SPARQL queries

    Get PDF
    Title from PDF of title page, viewed on October 21, 2013VitaThesis advisor: Praveen RaoIncludes bibliographic references (pages 78-82)Thesis (M.S.)--School of Computing and Engineering. University of Missouri--Kansas City, 2013The Resource Description Framework (RDF) has become a popular data model for representing data on the Web. Using RDF, any assertion can be represented as a (subject, predicate, object) triple. Essentially, RDF datasets can be viewed as directed, labeled graphs. Queries on RDF data are written using the SPARQL query language and contain basic graph patterns (BGPs). We present a new filtering index and query processing technique for processing large BGPs in SPARQL queries. Our approach called RIS treats RDF graphs as "first-class citizens." Unlike previous scalable approaches that store RDF data as triples in an RDBMS and process SPARQL queries by executing appropriate SQL queries, RIS aims to speed up query processing by reducing the processing cost of join operations. In RIS, RDF graphs are mapped into signatures, which are multisets. These signatures are grouped based on a similarity metric and indexed using Counting Bloom Filters. During query processing, the Counting Bloom Filters are checked to filter out non-matches, and finally the candidates are verified using Apache Jena. The filtering step prunes away a large portion of the dataset and results in faster processing of queries. We have conducted an in-depth performance evaluation using the Lehigh University Benchmark (LUBM) dataset and SPARQL queries containing large BGPs. We compared RIS with RDF-3X, which is a state-of-the-art scalable RDF querying engine that uses an RDBMS. RIS can significantly outperform RDF-3X in terms of total execution time for the tested dataset and queries.Introduction -- Motivation and related work -- Background -- Bloom filters and Bloom counters -- System architecture -- Signature tree generation -- Querying the signature tree -- Evaluation -- Experiments -- Conclusio
    • 

    corecore