986 research outputs found

    Question-based Text Summarization

    Get PDF
    In the modern information age, finding the right information at the right time is an art (and a science). However, the abundance of information makes it difficult for people to digest it and make informed choices. In this thesis, we aim to help people who want to quickly capture the main idea of a piece of information before they read the details through text summarization. In contrast with existing works, which mainly utilize declarative sentences to summarize a text document, we aim to use a few questions as a summary. In this way, people would know what questions a given text document can address and thus they may further read it if they have similar questions in mind. A question-based summary needs to satisfy three goals, relevancy, answerability, and diversity. Relevancy measures whether a few questions can cover the main points that discussed in a text document; answerability measures whether answers to the questions are included in the text document; and diversity measures whether there is redundant information carried by the questions. To achieve the three goals, we design a two-stage approach which consists of question selection and question diversification. The question selection component aims to find a set of candidate questions that are relevant to a text document, which in turn can be treated as answers to the questions. Specifically, we explore two lines of approaches that have been developed for traditional text summarization tasks, extractive approaches and abstractive approaches to achieve the goals of relevancy and answerability, respectively. The question diversification component is designed to re-rank the questions with the goal of rewarding diversity in the final question-based summary. Evaluation on product review summarization tasks for two product categories shows that the proposed approach is effective for discovering meaningful questions that are representative for individual reviews. This thesis opens up a new direction in the intersection of information retrieval and natural language processing. Despite the evaluation on the product review domain, the thesis provides a general solution for question selection for many interesting applications and discusses the possibility of extending the problem to other domain-specific question-based text summarization tasks.Ph.D., Information Science -- Drexel University, 201

    Recuperação multimodal e interativa de informação orientada por diversidade

    Get PDF
    Orientador: Ricardo da Silva TorresTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Os métodos de Recuperação da Informação, especialmente considerando-se dados multimídia, evoluíram para a integração de múltiplas fontes de evidência na análise de relevância de itens em uma tarefa de busca. Neste contexto, para atenuar a distância semântica entre as propriedades de baixo nível extraídas do conteúdo dos objetos digitais e os conceitos semânticos de alto nível (objetos, categorias, etc.) e tornar estes sistemas adaptativos às diferentes necessidades dos usuários, modelos interativos que consideram o usuário mais próximo do processo de recuperação têm sido propostos, permitindo a sua interação com o sistema, principalmente por meio da realimentação de relevância implícita ou explícita. Analogamente, a promoção de diversidade surgiu como uma alternativa para lidar com consultas ambíguas ou incompletas. Adicionalmente, muitos trabalhos têm tratado a ideia de minimização do esforço requerido do usuário em fornecer julgamentos de relevância, à medida que mantém níveis aceitáveis de eficácia. Esta tese aborda, propõe e analisa experimentalmente métodos de recuperação da informação interativos e multimodais orientados por diversidade. Este trabalho aborda de forma abrangente a literatura acerca da recuperação interativa da informação e discute sobre os avanços recentes, os grandes desafios de pesquisa e oportunidades promissoras de trabalho. Nós propusemos e avaliamos dois métodos de aprimoramento do balanço entre relevância e diversidade, os quais integram múltiplas informações de imagens, tais como: propriedades visuais, metadados textuais, informação geográfica e descritores de credibilidade dos usuários. Por sua vez, como integração de técnicas de recuperação interativa e de promoção de diversidade, visando maximizar a cobertura de múltiplas interpretações/aspectos de busca e acelerar a transferência de informação entre o usuário e o sistema, nós propusemos e avaliamos um método multimodal de aprendizado para ranqueamento utilizando realimentação de relevância sobre resultados diversificados. Nossa análise experimental mostra que o uso conjunto de múltiplas fontes de informação teve impacto positivo nos algoritmos de balanceamento entre relevância e diversidade. Estes resultados sugerem que a integração de filtragem e re-ranqueamento multimodais é eficaz para o aumento da relevância dos resultados e também como mecanismo de potencialização dos métodos de diversificação. Além disso, com uma análise experimental minuciosa, nós investigamos várias questões de pesquisa relacionadas à possibilidade de aumento da diversidade dos resultados e a manutenção ou até mesmo melhoria da sua relevância em sessões interativas. Adicionalmente, nós analisamos como o esforço em diversificar afeta os resultados gerais de uma sessão de busca e como diferentes abordagens de diversificação se comportam para diferentes modalidades de dados. Analisando a eficácia geral e também em cada iteração de realimentação de relevância, nós mostramos que introduzir diversidade nos resultados pode prejudicar resultados iniciais, enquanto que aumenta significativamente a eficácia geral em uma sessão de busca, considerando-se não apenas a relevância e diversidade geral, mas também o quão cedo o usuário é exposto ao mesmo montante de itens relevantes e nível de diversidadeAbstract: Information retrieval methods, especially considering multimedia data, have evolved towards the integration of multiple sources of evidence in the analysis of the relevance of items considering a given user search task. In this context, for attenuating the semantic gap between low-level features extracted from the content of the digital objects and high-level semantic concepts (objects, categories, etc.) and making the systems adaptive to different user needs, interactive models have brought the user closer to the retrieval loop allowing user-system interaction mainly through implicit or explicit relevance feedback. Analogously, diversity promotion has emerged as an alternative for tackling ambiguous or underspecified queries. Additionally, several works have addressed the issue of minimizing the required user effort on providing relevance assessments while keeping an acceptable overall effectiveness. This thesis discusses, proposes, and experimentally analyzes multimodal and interactive diversity-oriented information retrieval methods. This work, comprehensively covers the interactive information retrieval literature and also discusses about recent advances, the great research challenges, and promising research opportunities. We have proposed and evaluated two relevance-diversity trade-off enhancement work-flows, which integrate multiple information from images, such as: visual features, textual metadata, geographic information, and user credibility descriptors. In turn, as an integration of interactive retrieval and diversity promotion techniques, for maximizing the coverage of multiple query interpretations/aspects and speeding up the information transfer between the user and the system, we have proposed and evaluated a multimodal learning-to-rank method trained with relevance feedback over diversified results. Our experimental analysis shows that the joint usage of multiple information sources positively impacted the relevance-diversity balancing algorithms. Our results also suggest that the integration of multimodal-relevance-based filtering and reranking was effective on improving result relevance and also boosted diversity promotion methods. Beyond it, with a thorough experimental analysis we have investigated several research questions related to the possibility of improving result diversity and keeping or even improving relevance in interactive search sessions. Moreover, we analyze how much the diversification effort affects overall search session results and how different diversification approaches behave for the different data modalities. By analyzing the overall and per feedback iteration effectiveness, we show that introducing diversity may harm initial results whereas it significantly enhances the overall session effectiveness not only considering the relevance and diversity, but also how early the user is exposed to the same amount of relevant items and diversityDoutoradoCiência da ComputaçãoDoutor em Ciência da ComputaçãoP-4388/2010140977/2012-0CAPESCNP

    Approaches to implement and evaluate aggregated search

    Get PDF
    La recherche d'information agrégée peut être vue comme un troisième paradigme de recherche d'information après la recherche d'information ordonnée (ranked retrieval) et la recherche d'information booléenne (boolean retrieval). Les deux paradigmes les plus explorés jusqu'à aujourd'hui retournent un ensemble ou une liste ordonnée de résultats. C'est à l'usager de parcourir ces ensembles/listes et d'en extraire l'information nécessaire qui peut se retrouver dans plusieurs documents. De manière alternative, la recherche d'information agrégée ne s'intéresse pas seulement à l'identification des granules (nuggets) d'information pertinents, mais aussi à l'assemblage d'une réponse agrégée contenant plusieurs éléments. Dans nos travaux, nous analysons les travaux liés à la recherche d'information agrégée selon un schéma général qui comprend 3 parties: dispatching de la requête, recherche de granules d'information et agrégation du résultat. Les approches existantes sont groupées autours de plusieurs perspectives générales telle que la recherche relationnelle, la recherche fédérée, la génération automatique de texte, etc. Ensuite, nous nous sommes focalisés sur deux pistes de recherche selon nous les plus prometteuses: (i) la recherche agrégée relationnelle et (ii) la recherche agrégée inter-verticale. * La recherche agrégée relationnelle s'intéresse aux relations entre les granules d'information pertinents qui servent à assembler la réponse agrégée. En particulier, nous nous sommes intéressés à trois types de requêtes notamment: requête attribut (ex. président de la France, PIB de l'Italie, maire de Glasgow, ...), requête instance (ex. France, Italie, Glasgow, Nokia e72, ...) et requête classe (pays, ville française, portable Nokia, ...). Pour ces requêtes qu'on appelle requêtes relationnelles nous avons proposés trois approches pour permettre la recherche de relations et l'assemblage des résultats. Nous avons d'abord mis l'accent sur la recherche d'attributs qui peut aider à répondre aux trois types de requêtes. Nous proposons une approche à large échelle capable de répondre à des nombreuses requêtes indépendamment de la classe d'appartenance. Cette approche permet l'extraction des attributs à partir des tables HTML en tenant compte de la qualité des tables et de la pertinence des attributs. Les différentes évaluations de performances effectuées prouvent son efficacité qui dépasse les méthodes de l'état de l'art. Deuxièmement, nous avons traité l'agrégation des résultats composés d'instances et d'attributs. Ce problème est intéressant pour répondre à des requêtes de type classe avec une table contenant des instances (lignes) et des attributs (colonnes). Pour garantir la qualité du résultat, nous proposons des pondérations sur les instances et les attributs promouvant ainsi les plus représentatifs. Le troisième problème traité concerne les instances de la même classe (ex. France, Italie, Allemagne, ...). Nous proposons une approche capable d'identifier massivement ces instances en exploitant les listes HTML. Toutes les approches proposées fonctionnent à l'échelle Web et sont importantes et complémentaires pour la recherche agrégée relationnelle. Enfin, nous proposons 4 prototypes d'application de recherche agrégée relationnelle. Ces derniers peuvent répondre des types de requêtes différents avec des résultats relationnels. Plus précisément, ils recherchent et assemblent des attributs, des instances, mais aussi des passages et des images dans des résultats agrégés. Un exemple est la requête ``Nokia e72" dont la réponse sera composée d'attributs (ex. prix, poids, autonomie batterie, ...), de passages (ex. description, reviews, ...) et d'images. Les résultats sont encourageants et illustrent l'utilité de la recherche agrégée relationnelle. * La recherche agrégée inter-verticale s'appuie sur plusieurs moteurs de recherche dits verticaux tel que la recherche d'image, recherche vidéo, recherche Web traditionnelle, etc. Son but principal est d'assembler des résultats provenant de toutes ces sources dans une même interface pour répondre aux besoins des utilisateurs. Les moteurs de recherche majeurs et la communauté scientifique nous offrent déjà une série d'approches. Notre contribution consiste en une étude sur l'évaluation et les avantages de ce paradigme. Plus précisément, nous comparons 4 types d'études qui simulent des situations de recherche sur un total de 100 requêtes et 9 sources différentes. Avec cette étude, nous avons identifiés clairement des avantages de la recherche agrégée inter-verticale et nous avons pu déduire de nombreux enjeux sur son évaluation. En particulier, l'évaluation traditionnelle utilisée en RI, certes la moins rapide, reste la plus réaliste. Pour conclure, nous avons proposé des différents approches et études sur deux pistes prometteuses de recherche dans le cadre de la recherche d'information agrégée. D'une côté, nous avons traité trois problèmes importants de la recherche agrégée relationnelle qui ont porté à la construction de 4 prototypes d'application avec des résultats encourageants. De l'autre côté, nous avons mis en place 4 études sur l'intérêt et l'évaluation de la recherche agrégée inter-verticale qui ont permis d'identifier les enjeux d'évaluation et les avantages du paradigme. Comme suite à long terme de ce travail, nous pouvons envisager une recherche d'information qui intègre plus de granules relationnels et plus de multimédia.Aggregated search or aggregated retrieval can be seen as a third paradigm for information retrieval following the Boolean retrieval paradigm and the ranked retrieval paradigm. In the first two, we are returned respectively sets and ranked lists of search results. It is up to the time-poor user to scroll this set/list, scan within different documents and assemble his/her information need. Alternatively, aggregated search not only aims the identification of relevant information nuggets, but also the assembly of these nuggets into a coherent answer. In this work, we present at first an analysis of related work to aggregated search which is analyzed with a general framework composed of three steps: query dispatching, nugget retrieval and result aggregation. Existing work is listed aside different related domains such as relational search, federated search, question answering, natural language generation, etc. Within the possible research directions, we have then focused on two directions we believe promise the most namely: relational aggregated search and cross-vertical aggregated search. * Relational aggregated search targets relevant information, but also relations between relevant information nuggets which are to be used to assemble reasonably the final answer. In particular, there are three types of queries which would easily benefit from this paradigm: attribute queries (e.g. president of France, GDP of Italy, major of Glasgow, ...), instance queries (e.g. France, Italy, Glasgow, Nokia e72, ...) and class queries (countries, French cities, Nokia mobile phones, ...). We call these queries as relational queries and we tackle with three important problems concerning the information retrieval and aggregation for these types of queries. First, we propose an attribute retrieval approach after arguing that attribute retrieval is one of the crucial problems to be solved. Our approach relies on the HTML tables in the Web. It is capable to identify useful and relevant tables which are used to extract relevant attributes for whatever queries. The different experimental results show that our approach is effective, it can answer many queries with high coverage and it outperforms state of the art techniques. Second, we deal with result aggregation where we are given relevant instances and attributes for a given query. The problem is particularly interesting for class queries where the final answer will be a table with many instances and attributes. To guarantee the quality of the aggregated result, we propose the use of different weights on instances and attributes to promote the most representative and important ones. The third problem we deal with concerns instances of the same class (e.g. France, Germany, Italy ... are all instances of the same class). Here, we propose an approach that can massively extract instances of the same class from HTML lists in the Web. All proposed approaches are applicable at Web-scale and they can play an important role for relational aggregated search. Finally, we propose 4 different prototype applications for relational aggregated search. They can answer different types of queries with relevant and relational information. Precisely, we not only retrieve attributes and their values, but also passages and images which are assembled into a final focused answer. An example is the query ``Nokia e72" which will be answered with attributes (e.g. price, weight, battery life ...), passages (e.g. description, reviews ...) and images. Results are encouraging and they illustrate the utility of relational aggregated search. * The second research direction that we pursued concerns cross-vertical aggregated search, which consists of assembling results from different vertical search engines (e.g. image search, video search, traditional Web search, ...) into one single interface. Here, different approaches exist in both research and industry. Our contribution concerns mostly evaluation and the interest (advantages) of this paradigm. We propose 4 different studies which simulate different search situations. Each study is tested with 100 different queries and 9 vertical sources. Here, we could clearly identify new advantages of this paradigm and we could identify different issues with evaluation setups. In particular, we observe that traditional information retrieval evaluation is not the fastest but it remains the most realistic. To conclude, we propose different studies with respect to two promising research directions. On one hand, we deal with three important problems of relational aggregated search following with real prototype applications with encouraging results. On the other hand, we have investigated on the interest and evaluation of cross-vertical aggregated search. Here, we could clearly identify some of the advantages and evaluation issues. In a long term perspective, we foresee a possible combination of these two kinds of approaches to provide relational and cross-vertical information retrieval incorporating more focus, structure and multimedia in search results

    Measuring Short Text Semantic Similarity with Deep Learning Models

    Get PDF
    Natural language processing (NLP) is the ability of a computer program to understand human language as it is spoken, which is a subfield of artificial intelligence (AI). The development of NLP applications is challenging because computers traditionally require humans to speak" to them in a programming language that is precise, unambiguous and highly structured, or through a limited number of clearly enunciated voice commands. We study the use of deep learning models, the state-of-the-art artificial intelligence (AI) method, for the problem of measuring short text semantic similarity in NLP area. In particular, we propose a novel deep neural network architecture to identify semantic similarity for pairs of question sentence. In the proposed network, multiple channels of knowledge for pairs of question text can be utilized to improve the representation of text. Then a dense layer is used to learn a classifier for classifying duplicated question pairs. Through extensive experiments on the Quora test collection, our proposed approach has shown remarkable and significant improvement over strong baselines, which verifies the effectiveness of the deep models as well as the proposed deep multi-channel framework
    corecore