26 research outputs found

    A Study of Snippet Length and Informativeness: Behaviour, Performance and User Experience

    Get PDF
    The design and presentation of a Search Engine Results Page (SERP) has been subject to much research. With many contemporary aspects of the SERP now under scrutiny, work still remains in investigating more traditional SERP components, such as the result summary. Prior studies have examined a variety of different aspects of result summaries, but in this paper we investigate the influence of result summary length on search behaviour, performance and user experience. To this end, we designed and conducted a within-subjects experiment using the TREC AQUAINT news collection with 53 participants. Using Kullback-Leibler distance as a measure of information gain, we examined result summaries of different lengths and selected four conditions where the change in information gain was the greatest: (i) title only; (ii) title plus one snippet; (iii) title plus two snippets; and (iv) title plus four snippets. Findings show that participants broadly preferred longer result summaries, as they were perceived to be more informative. However, their performance in terms of correctly identifying relevant documents was similar across all four conditions. Furthermore, while the participants felt that longer summaries were more informative, empirical observations suggest otherwise; while participants were more likely to click on relevant items given longer summaries, they also were more likely to click on non-relevant items. This shows that longer is not necessarily better, though participants perceived that to be the case - and second, they reveal a positive relationship between the length and informativeness of summaries and their attractiveness (i.e. clickthrough rates). These findings show that there are tensions between perception and performance when designing result summaries that need to be taken into account

    Automatic Summarization

    Get PDF
    It has now been 50 years since the publication of Luhn’s seminal paper on automatic summarization. During these years the practical need for automatic summarization has become increasingly urgent and numerous papers have been published on the topic. As a result, it has become harder to find a single reference that gives an overview of past efforts or a complete view of summarization tasks and necessary system components. This article attempts to fill this void by providing a comprehensive overview of research in summarization, including the more traditional efforts in sentence extraction as well as the most novel recent approaches for determining important content, for domain and genre specific summarization and for evaluation of summarization. We also discuss the challenges that remain open, in particular the need for language generation and deeper semantic understanding of language that would be necessary for future advances in the field

    Discourse oriented summarization

    Get PDF
    The meaning of text appears to be tightly related to intentions and circumstances. Context sensitivity of meaning is addressed by theories of discourse structure. Few attempts have been made to exploit text organization in summarization. This thesis is an exploration of what knowledge of discourse structure can do for content selection as a subtask of automatic summarization, and query-based summarization in particular. Query-based summarization is the task of answering an arbitrary user query or question by using content from potentially relevant sources. This thesis presents a general framework for discourse oriented summarization, relying on graphs to represent semantic relations in discourse, and redundancy as a special type of semantic relation. Semantic relations occur on several levels of text analysis (query-relevance, coherence, layout, etc.), and a broad range of textual features may be required to detect them. The graph-based framework facilitates combining multiple features into an integrated semantic model of the documents to summarize. Recognizing redundancy and entailment relations between text passages is particularly important when a summary is generated of multiple documents, e.g. to avoid including redundant content in a summary. For this reason, I pay particular attention to recognizing textual entailment. Within this framework, a three-fold evaluation is performed to evaluate different aspects of discourse oriented summarization. The first is a user study, measuring the effect on user appreciation of using a particular type of knowledge for query-based summarization. In this study, three presentation strategies are compared: summarization using the rhetorical structure of the source, a baseline summarization method which uses the layout of the source, and a baseline presentation method which uses no summarization but just a concise answer to the query. Results show that knowledge of the rhetorical structure not only helps to provide the necessary context for the user to verify that the summary addresses the query adequately, but also to increase the amount of relevant content. The second evaluation is a comparison of implementations of the graph-based framework which are capable of fully automatic summarization. The two variables in the experiment are the set of textual features used to model the source and the algorithm used to search a graph for relevant content. The features are based on cosine similarity, and are realized as graph representations of the source. The graph search algorithms are inspired by existing algorithms in summarization. The quality of summaries is measured using the Rouge evaluation toolkit. The best performer would have ranked first (Rouge-2) or second (Rouge-SU4) if it had participated in the DUC 2005 query-based summarization challenge. The third study is an evaluation in the context of the DUC 2006 summarization challenge, which includes readability measurements as well as various content-based evaluation metrics. The evaluated automatic discourse oriented summarization system is similar to the one described above, but uses additional features, i.e. layout and textual entailment. The system performed well on readability at the cost of content-based scores which were well below the scores of the highest ranking DUC 2006 participant. This indicates a trade-off between readable, coherent content and useful content, an issue yet to be explored. Previous research implies that theories of text organization generalize well to multimedia. This suggests that the discourse oriented summarization framework applies to summarizing multimedia as well, provided sufficient knowledge of the organization of the (multimedia) source documents is available. The last study in this thesis is an investigation of the applicability of structural relations in multimedia for generating picture-illustrated summaries, by relating summary content to picture-associated text (i.e. captions or surrounding paragraphs). Results suggest that captions are the more suitable annotation for selecting appropriate pictures. Compared to manual illustration, results of automatic pictures are similar if the manual picture is mainly decorative

    The challenging task of summary evaluation: an overview

    Get PDF
    Evaluation is crucial in the research and development of automatic summarization applications, in order to determine the appropriateness of a summary based on different criteria, such as the content it contains, and the way it is presented. To perform an adequate evaluation is of great relevance to ensure that automatic summaries can be useful for the context and/or application they are generated for. To this end, researchers must be aware of the evaluation metrics, approaches, and datasets that are available, in order to decide which of them would be the most suitable to use, or to be able to propose new ones, overcoming the possible limitations that existing methods may present. In this article, a critical and historical analysis of evaluation metrics, methods, and datasets for automatic summarization systems is presented, where the strengths and weaknesses of evaluation efforts are discussed and the major challenges to solve are identified. Therefore, a clear up-to-date overview of the evolution and progress of summarization evaluation is provided, giving the reader useful insights into the past, present and latest trends in the automatic evaluation of summaries.This research is partially funded by the European Commission under the Seventh (FP7 - 2007- 2013) Framework Programme for Research and Technological Development through the SAM (FP7-611312) project; by the Spanish Government through the projects VoxPopuli (TIN2013-47090-C3-1-P) and Vemodalen (TIN2015-71785-R), the Generalitat Valenciana through project DIIM2.0 (PROMETEOII/2014/001), and the Universidad Nacional de Educación a Distancia through the project “Modelado y síntesis automática de opiniones de usuario en redes sociales” (2014-001-UNED-PROY)

    Principled Approaches to Automatic Text Summarization

    Get PDF
    Automatic text summarization is a particularly challenging Natural Language Processing (NLP) task involving natural language understanding, content selection and natural language generation. In this thesis, we concentrate on the content selection aspect, the inherent problem of summarization which is controlled by the notion of information Importance. We present a simple and intuitive formulation of the summarization task as two components: a summary scoring function θ measuring how good a text is as a summary of the given sources, and an optimization technique O extracting a summary with a high score according to θ. This perspective offers interesting insights over previous summarization efforts and allows us to pinpoint promising research directions. In particular, we realize that previous works heavily constrained the summary scoring function in order to solve convenient optimization problems (e.g., Integer Linear Programming). We question this assumption and demonstrate that General Purpose Optimization (GPO) techniques like genetic algorithms are practical. These GPOs do not require mathematical properties from the objective function and, thus, the summary scoring function can be relieved from its previously imposed constraints. Additionally, the summary scoring function can be evaluated on its own based on its ability to correlate with humans. This offers a principled way of examining the inner workings of summarization systems and complements the traditional evaluations of the extracted summaries. In fact, evaluation metrics are also summary scoring functions which should correlate well with humans. Thus, the two main challenges of summarization, the evaluation and the development of summarizers, are unified within the same setup: discovering strong summary scoring functions. Hence, we investigated ways of uncovering such functions. First, we conducted an empirical study of learning the summary scoring function from data. The results show that an unconstrained summary scoring function is better able to correlate with humans. Furthermore, an unconstrained summary scoring function optimized approximately with GPO extracts better summaries than a constrained summary scoring function optimized exactly with, e.g., ILP. Along the way, we proposed techniques to leverage the small and biased human judgment datasets. Additionally, we released a new evaluation metric explicitly trained to maximize its correlation with humans. Second, we developed a theoretical formulation of the notion of Importance. In a framework rooted in information theory, we defined the quantities: Redundancy, Relevance and Informativeness. Importance arises as the notion unifying these concepts. More generally, Importance is the measure that guides which choices to make when information must be discarded. Finally, evaluation remains an open-problem with a massive impact on summarization progress. Thus, we conducted experiments on available human judgment datasets commonly used to compare evaluation metrics. We discovered that these datasets do not cover the high-quality range in which summarization systems and evaluation metrics operate. This motivates efforts to collect human judgments for high-scoring summaries as this would be necessary to settle the debate over which metric to use. This would also be greatly beneficial for improving summarization systems and metrics alike

    Composite web search

    Get PDF
    The figure above shows Google’s results page for the query “taylor swift”, captured in March 2016. Assembled around the long-established list of search results is content extracted from various source — news items and tweets merged within the results ranking, images, songs and social media profiles displayed to the right of the ranking, in an interface element that is known as an entity card. Indeed, the entire page seems more like an assembly of content extracted from various sources, rather than just a ranked list of blue links. Search engine result pages have become increasingly diverse over the past few years, with most commercial web search providers responding to user queries with different types of results, merged within a unified page. The primary reason for this diversity on the results page is that the web itself has become more diverse, given the ease with which creating and hosting different types of content on the web is possible today. This thesis investigates the aggregation of web search results retrieved from various document sources (e.g., images, tweets, Wiki pages) within information “objects” to be integrated in the results page assembled in response to user queries. We use the terms “composite objects” or “composite results” to refer to such objects, and throughout this thesis use the terminology of Composite Web Search (e.g., result composition) to distinguish our approach from other methods of aggregating diverse content within a unified results page (e.g., Aggregated Search). In our definition, the aspects that differentiate composite information objects from aggregated search blocks are that composite objects (i) contain results from multiple sources of information, (ii) are specific to a common topic or facet of a topic rather than a grouping of results of the same type, and (iii) are not a uniform ranking of results ordered only by their topical relevance to a query. The most widely used type of composite result in web search today is the entity card. Entity cards have become extremely popular over the past few years, with some informal studies suggesting that entity cards are now shown on the majority of result pages generated by Google. As composite results are used more and more by commercial search engines to address information needs directly on the results page, understanding the properties of such objects and their influence on searchers is an essential aspect of modern web search science. The work presented throughout this thesis attempts the task of studying composite objects by exploring users’ perspectives on accessing and aggregating diverse content manually, by analysing the effect composite objects have on search behaviour and perceived workload, and by investigating different approaches to constructing such objects from diverse results. Overall, our experimental findings suggest that items which play a central role within composite objects are decisive in determining their usefulness, and that the overall properties of composite objects (i.e., relevance, diversity and coherence) play a combined role in mediating object usefulness

    Complex question answering : minimizing the gaps and beyond

    Get PDF
    xi, 192 leaves : ill. ; 29 cmCurrent Question Answering (QA) systems have been significantly advanced in demonstrating finer abilities to answer simple factoid and list questions. Such questions are easier to process as they require small snippets of texts as the answers. However, there is a category of questions that represents a more complex information need, which cannot be satisfied easily by simply extracting a single entity or a single sentence. For example, the question: “How was Japan affected by the earthquake?” suggests that the inquirer is looking for information in the context of a wider perspective. We call these “complex questions” and focus on the task of answering them with the intention to minimize the existing gaps in the literature. The major limitation of the available search and QA systems is that they lack a way of measuring whether a user is satisfied with the information provided. This was our motivation to propose a reinforcement learning formulation to the complex question answering problem. Next, we presented an integer linear programming formulation where sentence compression models were applied for the query-focused multi-document summarization task in order to investigate if sentence compression improves the overall performance. Both compression and summarization were considered as global optimization problems. We also investigated the impact of syntactic and semantic information in a graph-based random walk method for answering complex questions. Decomposing a complex question into a series of simple questions and then reusing the techniques developed for answering simple questions is an effective means of answering complex questions. We proposed a supervised approach for automatically learning good decompositions of complex questions in this work. A complex question often asks about a topic of user’s interest. Therefore, the problem of complex question decomposition closely relates to the problem of topic to question generation. We addressed this challenge and proposed a topic to question generation approach to enhance the scope of our problem domain

    Leveraging GPT-4 for Food Effect Summarization to Enhance Product-Specific Guidance Development via Iterative Prompting

    Full text link
    Food effect summarization from New Drug Application (NDA) is an essential component of product-specific guidance (PSG) development and assessment. However, manual summarization of food effect from extensive drug application review documents is time-consuming, which arouses a need to develop automated methods. Recent advances in large language models (LLMs) such as ChatGPT and GPT-4, have demonstrated great potential in improving the effectiveness of automated text summarization, but its ability regarding the accuracy in summarizing food effect for PSG assessment remains unclear. In this study, we introduce a simple yet effective approach, iterative prompting, which allows one to interact with ChatGPT or GPT-4 more effectively and efficiently through multi-turn interaction. Specifically, we propose a three-turn iterative prompting approach to food effect summarization in which the keyword-focused and length-controlled prompts are respectively provided in consecutive turns to refine the quality of the generated summary. We conduct a series of extensive evaluations, ranging from automated metrics to FDA professionals and even evaluation by GPT-4, on 100 NDA review documents selected over the past five years. We observe that the summary quality is progressively improved throughout the process. Moreover, we find that GPT-4 performs better than ChatGPT, as evaluated by FDA professionals (43% vs. 12%) and GPT-4 (64% vs. 35%). Importantly, all the FDA professionals unanimously rated that 85% of the summaries generated by GPT-4 are factually consistent with the golden reference summary, a finding further supported by GPT-4 rating of 72% consistency. These results strongly suggest a great potential for GPT-4 to draft food effect summaries that could be reviewed by FDA professionals, thereby improving the efficiency of PSG assessment cycle and promoting the generic drug product development.Comment: 22 pages, 6 figure

    A General Machine Reading Comprehension pipeline

    Get PDF
    Savoir lire est une compétence qui va de la capacité à décoder des caractères à la compréhension profonde du sens de textes. Avec l'émergence de l'intelligence artificielle, deux questions se posent : Comment peut-on apprendre à une intelligence artificielle à lire? Qu'est-ce que cela implique? En essayant de répondre à ces questions, une première évidence nous est rappelée : savoir lire ne peut pas se réduire à savoir répondre à des questions sur des textes. Étant donné que les modèles d'apprentissage machine apprennent avec des exemples d'essai erreur, ils vont apprendre à lire en apprenant à répondre correctement à des questions sur des textes. Cependant, il ne faut pas perdre de vue que savoir lire, c'est comprendre différents types de textes et c'est cette compréhension qui permet de répondre à des questions sur un texte. En d'autres termes, répondre à des questions sur des textes est un des moyens d'évaluation de la compétence de lecture plus qu'une fin en soi. Aujourd'hui, il existe différents types de jeux de données qui sont utilisées pour apprendre à des intelligences artificielles à apprendre à lire. Celles ci proposent des textes avec des questions associées qui requièrent différents types de raisonnement : associations lexicales, déductions à partir d'indices disséminés dans le texte, paraphrase, etc. Le problème est que lorsqu'une intelligence artificielle apprend à partir d'un seul de ces jeux de données, elle n'apprend pas à lire mais est plutôt formée à répondre à un type de question, sur un certain type de texte et avec un certain style d'écriture. Outre la problématique de la généralisation des compétences de lecture, les modèles d'intelligence artificielle qui apprennent à lire en apprenant à répondre à des questions retournent des réponses sans systématiquement indiquer sur quelles phrases du texte sources ils se basent. Cela pose un problème d'explicabilité et peut entrainer une mécompréhension des capacités de ces modèles. Dans ce mémoire, nous proposons de résoudre le problème de généralisation de l'apprentissage en proposant une méthodologie générale adaptée à n'importe quel jeu de données. Ainsi, en ayant une méthodologie commune à tous les types de jeux de données pour apprendre à répondre à tout type de question, sur tout type de texte, nous pourrions apprendre aux modèles d'intelligence artificielle à se concentrer sur les compétences générales de lecture plutôt que sur la capacité spécifique à répondre aux questions. Afin de résoudre également le problème de l'explicabilité, la méthodologie que nous proposons impose à tout modèle de compréhension de lecture automatique de renvoyer les extraits du texte source sur lequel ces réponses sont basées.Reading is a skill that ranges from the ability to decode characters to a deep understanding of the meaning of a text. With the emergence of artificial intelligence, two questions arise: How can an artificial intelligence be taught to read? What does this imply? In trying to answer these questions, we are reminded of the obvious: knowing how to read cannot be reduced to knowing how to answer questions about texts. Since machine learning models learn with trial-and-error examples, they will learn to read by learning to answer correctly questions about the text they read. However, one should not forget the fact that knowing how to read means understanding different types of texts sufficiently well, and it is this that enables answering questions about a text. In other words, answering questions about texts is one of the means of assessing reading skills rather than an end in itself. Today, there are different types of datasets that are used to teach artificial intelligences to learn to read. These provide texts with associated questions that require different types of reasoning: lexical associations, deductions from discrete clues in the text, paraphrasing, etc. The problem is that when an artificial intelligence learns from only one of these datasets, it does not learn to read but is instead trained to answer a certain type of question, on a certain type of text and with a certain writing style. In addition to the problem of generalizing reading skills, artificial intelligence models that learn to read by learning to answer questions return answers without systematically indicating which sentences in the source text they are based on. This poses a problem of explicability and can lead to a misunderstanding of the capabilities of these models. In this thesis, we propose to solve the generalization issue of learning from one dataset by proposing a general methodology suiting to any machine reading comprehension dataset. Thus, by having a methodology common to all types of datasets to learn how to answer any type of question, on any type of text, we could teach artificial intelligence models to focus on general reading skills rather than on the specific ability to answer questions. In order to also solve the issue of explanability, the methodology we propose impose any machine reading comprehension model to return the span of the source text its answers are based on
    corecore