35 research outputs found

    Retrieve-Cluster-Summarize: An Alternative to End-to-End Training for Query-specific Article Generation

    Full text link
    Query-specific article generation is the task of, given a search query, generate a single article that gives an overview of the topic. We envision such articles as an alternative to presenting a ranking of search results. While generative Large Language Models (LLMs) like chatGPT also address this task, they are known to hallucinate new information, their models are secret, hard to analyze and control. Some generative LLMs provide supporting references, yet these are often unrelated to the generated content. As an alternative, we propose to study article generation systems that integrate document retrieval, query-specific clustering, and summarization. By design, such models can provide actual citations as provenance for their generated text. In particular, we contribute an evaluation framework that allows to separately trains and evaluate each of these three components before combining them into one system. We experimentally demonstrate that a system comprised of the best-performing individual components also obtains the best F-1 overall system quality.Comment: 5 pages, 1 figure

    Search Result Diversification in Short Text Streams

    Get PDF
    We consider the problem of search result diversification for streams of short texts. Diversifying search results in short text streams is more challenging than in the case of long documents, as it is difficult to capture the latent topics of short documents. To capture the changes of topics and the probabilities of documents for a given query at a specific time in a short text stream, we propose a dynamic Dirichlet multinomial mixture topic model, called D2M3, as well as a Gibbs sampling algorithm for the inference. We also propose a streaming diversification algorithm, SDA, that integrates the information captured by D2M3 with our proposed modified version of the PM-2 (Proportionality-based diversification Method -- second version) diversification algorithm. We conduct experiments on a Twitter dataset and find that SDA statistically significantly outperforms state-of-the-art non-streaming retrieval methods, plain streaming retrieval methods, as well as streaming diversification methods that use other dynamic topic models

    DOCUMENT REPRESENTATION FOR CLUSTERING OF SCIENTIFIC ABSTRACTS

    Get PDF
    The key issue of the present paper is clustering of narrow-domain short texts, such as scientific abstracts. The work is based on the observations made when improving the performance of key phrase extraction algorithm. An extended stop-words list was used that was built automatically for the purposes of key phrase extraction and gave the possibility for a considerable quality enhancement of the phrases extracted from scientific publications. A description of the stop- words list creation procedure is given. The main objective is to investigate the possibilities to increase the performance and/or speed of clustering by the above-mentioned list of stop-words as well as information about lexeme parts of speech. In the latter case a vocabulary is applied for the document representation, which contains not all the words that occurred in the collection, but only nouns and adjectives or their sequences encountered in the documents. Two base clustering algorithms are applied: k-means and hierarchical clustering (average agglomerative method). The results show that the use of an extended stop-words list and adjective-noun document representation makes it possible to improve the performance and speed of k-means clustering. In a similar case for average agglomerative method a decline in performance quality may be observed. It is shown that the use of adjective-noun sequences for document representation lowers the clustering quality for both algorithms and can be justified only when a considerable reduction of feature space dimensionality is necessary

    Analyse de l'ambiguĂŻtĂ© des requĂȘtes utilisateurs par catĂ©gorisation thĂ©matique.

    Get PDF
    International audienceDans cet article, nous cherchons Ă  identiïŹer la nature de l'ambiguĂŻtĂ© des requĂȘtes utilisateurs issues d'un moteur de recherche dĂ©diĂ© Ă  l'actualitĂ©, 2424actu.fr, en utilisant une tĂąche de catĂ©gorisation. Dans un premier temps, nous verrons les diffĂ©rentes formes de l'ambiguĂŻtĂ© des requĂȘtes dĂ©jĂ  dĂ©crites dans les travaux de TAL. Nous confrontons la vision lexicographique de l'ambiguĂŻtĂ© Ă  celle dĂ©crite par les techniques de classiïŹcation appliquĂ©es Ă  la recherche d'information. Dans un deuxiĂšme temps, nous appliquons une mĂ©thode de catĂ©gorisation thĂ©matique aïŹn d'explorer l'ambiguĂŻtĂ© des requĂȘtes, celle-ci nous permet de conduire une analyse sĂ©mantique de ces requĂȘtes, en intĂ©grant la dimension temporelle propre au contexte des news. Nous proposons une typologie des phĂ©nomĂšnes d'ambiguĂŻtĂ© basĂ©e sur notre analyse sĂ©mantique. EnïŹn, nous comparons l'exploration par catĂ©gorisation Ă  une ressource comme WikipĂ©dia, montrant concrĂštement les divergences des deux approches

    Automatic population of knowledge bases with multimodal data about named entities

    Get PDF
    Knowledge bases are of great importance for Web search, recommendations, and many Information Retrieval tasks. However, maintaining them for not so popular entities is often a bottleneck. Typically, such entities have limited textual coverage and only a few ontological facts. Moreover, these entities are not well populated with multimodal data, such as images, videos, or audio recordings. The goals in this thesis are (1) to populate a given knowledge base with multimodal data about entities, such as images or audio recordings, and (2) to ease the task of maintaining and expanding the textual knowledge about a given entity, by recommending valuable text excerpts to the contributors of knowledge bases. The thesis makes three main contributions. The first two contributions concentrate on finding images of named entities with high precision, high recall, and high visual diversity. Our main focus are less popular entities, for which the image search engines fail to retrieve good results. Our methods utilize background knowledge about the entity, such as ontological facts or a short description, and a visual-based image similarity to rank and diversify a set of candidate images. Our third contribution is an approach for extracting text contents related to a given entity. It leverages a language-model-based similarity between a short description of the entity and the text sources, and solves a budget-constraint optimization program without any assumptions on the text structure. Moreover, our approach is also able to reliably extract entity related audio excerpts from news podcasts. We derive the time boundaries from the usually very noisy audio transcriptions.Wissensbasen wird bei der Websuche, bei Empfehlungsdiensten und vielen anderen Information Retrieval Aufgaben eine große Bedeutung zugeschrieben. Allerdings stellt sich deren Unterhalt fĂŒr weniger populĂ€re EntitĂ€ten als schwierig heraus. Üblicherweise ist die Anzahl an Texten ĂŒber EntitĂ€ten dieser Art begrenzt, und es gibt nur wenige ontologische Fakten. Außerdem sind nicht viele multimediale Daten, wie zum Beispiel Bilder, Videos oder Tonaufnahmen, fĂŒr diese EntitĂ€ten verfĂŒgbar. Die Ziele dieser Dissertation sind daher (1) eine gegebene Wissensbasis mit multimedialen Daten, wie Bilder oder Tonaufnahmen, ĂŒber EntitĂ€ten anzureichern und (2) die Erleichterung der Aufgabe Texte ĂŒber eine gegebene EntitĂ€t zu verwalten und zu erweitern, indem den Beitragenden einer Wissensbasis nĂŒtzliche Textausschnitte vorgeschlagen werden. Diese Dissertation leistet drei HauptbeitrĂ€ge. Die ersten zwei BeitrĂ€ge sind im Gebiet des Auffindens von Bildern von benannten EntitĂ€ten mit hoher Genauigkeit, hoher Trefferquote, und hoher visueller Vielfalt. Das Hauptaugenmerk liegt auf den weniger populĂ€ren EntitĂ€ten bei denen die Bildersuchmaschinen normalerweise keine guten Ergebnisse liefern. Unsere Verfahren benutzen Hintergrundwissen ĂŒber die EntitĂ€t, zum Beispiel ontologische Fakten oder eine Kurzbeschreibung, so wie ein visuell-basiertes BilderĂ€hnlichkeitsmaß um die Bilder nach Rang zu ordnen und um eine Menge von Bilderkandidaten zu diversifizieren. Der dritte Beitrag ist ein Ansatz um Textinhalte, die sich auf eine gegebene EntitĂ€t beziehen, zu extrahieren. Der Ansatz nutzt ein auf einem Sprachmodell basierendes Ähnlichkeitsmaß zwischen einer Kurzbeschreibung der EntitĂ€t und den Textquellen und löst zudem ein Optimierungsproblem mit Budgetrestriktion, das keine Annahmen an die Textstruktur macht. DarĂŒber hinaus ist der Ansatz in der Lage Tonaufnahmen, welche in Beziehung zu einer EntitĂ€t stehen, zuverlĂ€ssig aus Nachrichten-Podcasts zu extrahieren. DafĂŒr werden zeitliche Abgrenzungen aus den normalerweise sehr verrauschten Audiotranskriptionen hergeleitet
    corecore