28 research outputs found

    Comparative Summarization of Document Collections

    Get PDF
    Comparing documents is an important task that help us in understanding the differences between documents. Example of document comparisons include comparing laws on same related subject matter in different jurisdictions, comparing the specifications of similar product from different manufacturers. One can see that the need for comparison does not stop at individual documents, and extends to large collections of documents. For example comparing the writing styles of an author early vs late in their life, identifying linguistic and lexical patterns of different political ideologies, or discover commonalities of political arguments in disparate events. Comparing large document collections calls for automated algorithms to do so. Every day a huge volume of documents are produced in social and news media. There has been a lot of research in summarizing individual document such as a news article, or document collections such as a collection of news articles on a related topic or event. However, comparatively summarizing different document collections, or comparative summarization is under-explored problem in terms of methodology, datasets, evaluations and applicability in different domains. To address this, in this thesis, we make three types of contributions to comparative summarization, methodology, datasets and evaluation, and empirical measurements on a range of settings where comparative summarization can be applied. We propose a new formulation the problem of comparative summarization as competing binary classifiers. This formulation help us to develop new unsupervised and supervised methods for comparative summarization. Our methods are based on Maximum Mean Discrepancy (MMD), a metric that measures the distance between two sets of datapoints (or documents). The unsupervised methods incorporate information coverage, information diversity and discriminativeness of the prototypes based on global-model of sentence-sentence similarity, and be optimized with greedy and gradient methods. We show the efficacy of the approach in summarizing a long running news topic over time. Our supervised method improves the unsupervised methods, and can learn the importance of prototypes based on surface features (e.g., position, length, presence of cue words) and combine different text feature representations. Our supervised method meets or exceeds the state-of-the-art performance in benchmark datasets. We design new scalable automatic and crowd-sourced extrinsic evaluations of comparative summaries when human written ground truth summaries are not available. To evaluate our methods, we develop two new datasets on controversial news topics -- CONTROVNEWS2017 and NEWS2019+BIAS datasets which we use in different experiments. We use CONTROVNEWS2017, which consists of news articles on controversial topics to evaluate our unsupervised methods in summarizing over time. We use NEWS2019+BIAS, which again consists of news articles on controversial news topics, along with media bias labels to empirically study the applicability of methods. Finally, we measure the distinguishability and summarizability of document collections to quantify the applicability of our methods in different domains. We measure these metrics in a newly curated NEWS2019+BIAS dataset in comparing articles over time, and across ideological leanings of media outlets. First, we observe that the summarizability is proportional to the distinguishability, and identify the groups of articles that are less or more distinguishable.Second, better distinguishability and summarizability is amenable to the choice of document representations according to the comparisons we make, either over time, or across ideological leanings of media outlets. We also apply the comparative summarization method to the task of comparing stances in the social media domain

    Question-based Text Summarization

    Get PDF
    In the modern information age, finding the right information at the right time is an art (and a science). However, the abundance of information makes it difficult for people to digest it and make informed choices. In this thesis, we aim to help people who want to quickly capture the main idea of a piece of information before they read the details through text summarization. In contrast with existing works, which mainly utilize declarative sentences to summarize a text document, we aim to use a few questions as a summary. In this way, people would know what questions a given text document can address and thus they may further read it if they have similar questions in mind. A question-based summary needs to satisfy three goals, relevancy, answerability, and diversity. Relevancy measures whether a few questions can cover the main points that discussed in a text document; answerability measures whether answers to the questions are included in the text document; and diversity measures whether there is redundant information carried by the questions. To achieve the three goals, we design a two-stage approach which consists of question selection and question diversification. The question selection component aims to find a set of candidate questions that are relevant to a text document, which in turn can be treated as answers to the questions. Specifically, we explore two lines of approaches that have been developed for traditional text summarization tasks, extractive approaches and abstractive approaches to achieve the goals of relevancy and answerability, respectively. The question diversification component is designed to re-rank the questions with the goal of rewarding diversity in the final question-based summary. Evaluation on product review summarization tasks for two product categories shows that the proposed approach is effective for discovering meaningful questions that are representative for individual reviews. This thesis opens up a new direction in the intersection of information retrieval and natural language processing. Despite the evaluation on the product review domain, the thesis provides a general solution for question selection for many interesting applications and discusses the possibility of extending the problem to other domain-specific question-based text summarization tasks.Ph.D., Information Science -- Drexel University, 201

    Fast Adaptive Non-Monotone Submodular Maximization Subject to a Knapsack Constraint

    Get PDF
    Constrained submodular maximization problems encompass a wide variety of applications, including personalized recommendation, team formation, and revenue maximization via viral marketing. The massive instances occurring in modern-day applications can render existing algorithms prohibitively slow. Moreover, frequently those instances are also inherently stochastic. Focusing on these challenges, we revisit the classic problem of maximizing a (possibly non-monotone) submodular function subject to a knapsack constraint. We present a simple randomized greedy algorithm that achieves a 5.83-approximation and runs in O(n log n) time, i.e., at least a factor n faster than other state-of-the-art algorithms. The versatility of our approach allows us to further transfer it to a stochastic version of the problem. There, we obtain a (9 + ε)-approximation to the best adaptive policy, which is the first constant approximation for non-monotone objectives. Experimental evaluation of our algorithms showcases their improved performance on real and synthetic data

    Fast Adaptive Non-Monotone Submodular Maximization Subject to a Knapsack Constraint

    Get PDF
    Constrained submodular maximization problems encompass a wide variety of applications, including personalized recommendation, team formation, and revenue maximization via viral marketing. The massive instances occurring in modern-day applications can render existing algorithms prohibitively slow. Moreover, frequently those instances are also inherently stochastic. Focusing on these challenges, we revisit the classic problem of maximizing a (possibly non-monotone) submodular function subject to a knapsack constraint. We present a simple randomized greedy algorithm that achieves a 5.83-approximation and runs in O(n log n) time, i.e., at least a factor n faster than other state-of-the-art algorithms. The versatility of our approach allows us to further transfer it to a stochastic version of the problem. There, we obtain a (9 + ε)-approximation to the best adaptive policy, which is the first constant approximation for non-monotone objectives. Experimental evaluation of our algorithms showcases their improved performance on real and synthetic data

    Timeline Summarization from Relevant Headlines

    Full text link

    Aproximační algoritmy pro submodulární optimalizaci a aplikace

    Get PDF
    This study proposes approximation algorithms by using several strategies such as streaming, improved-greedy, stop-and-stare, and reverse influence sampling ( \RIS ) to solve three variants of the submodular optimization problem, and perform experiments of these algorithms on the well-known application problems of submodular optimization such as Influence Threshold ( \IT ) and Influence Maximization ( \IM) . Specifically, in the first problem, we propose the two single-pass streaming algorithms ( \StrA and \StrM ) for minimizing the cost of the submodular cover problem under the multiplicative and additive noise models. \StrA and \StrM provide bicriteria approximation solutions. These algorithms effectively increase performance computing the objective function, reduce complexity, and apply to big data. For the second problem, we focus on maximizing a submodular function on fairness constraints. This problem is also known as the problem of fairness budget distribution for influence maximization. We design three algorithms ( \FBIM1 , \FBIM2 , and \FBIM3 ) by combining several strategies such as the threshold greedy algorithm, dynamic stop-and-stare technique, generating samplings by reverse influence sampling framework, and seeds selection to ensure max coverage. \FBIM1 , \FBIM2 , and \FBIM3 perform effectively on big data, provide (1/2ϵ)(1/2-\epsilon)-approximation to the optimum solutions, and require complexities of the comparison algorithms. Finally, we devise two effective streaming algorithm ( \StrI and \StrII ) to maximize the Diminishing Returns submodular (DR-submodular) function with a cardinality constraint on the integer lattice for the third problem. \StrI and \StrII provide (1/2ϵ) (1/2-\epsilon)-approximation ratio and (11/eϵ) (1-1/e-\epsilon)-approximation ratio, respectively. Simultaneously, compared with the state-of-the-art, these two algorithms have reduced complexity, superior runtime performance, and negligible difference in objective function values. In each problem, we further investigate the performance of our proposed algorithms by conducting many experiments. The experimental results have indicated that our approximation algorithms provide high-efficiency solutions, outperform the-state-of-art algorithms in complexity, runtime, and satisfy the specified constraints. Some of the results have been confirmed through five publications at the Scopus international conferences (RIVF 2021, ICABDE 2021) and the SCIE journals (Computer Standards & \& Interfaces (Elsevier) and Mathematics (MDPI)).Tato studie navrhuje aproximační algoritmy pomocí několika strategií, jako je streamování, vylepšená chamtivost, stop-and-stare a vzorkování zpětného vlivu ( \RIS ) k vyřešení tří variant submodulárního optimalizačního problému a provádění experimentů s těmito algoritmy na dobře známé aplikační problémy submodulární optimalizace, jako je práh vlivu ( \IT ) a maximalizace vlivu ( \IM) . Konkrétně v prvním problému navrhujeme dva jednoprůchodové streamovací algoritmy ( \StrA a \StrM ) pro minimalizaci nákladů na problém submodulárního pokrytí v rámci multiplikativních a aditivních šumových modelů. \StrA a \StrM poskytují řešení aproximace bikriterií. Tyto algoritmy efektivně zvyšují výkon při výpočtu cílové funkce, snižují složitost a aplikují se na velká data. U druhého problému se zaměřujeme na maximalizaci submodulární funkce na omezeních spravedlnosti. Tento problém je také známý jako problém spravedlivého rozdělení rozpočtu pro maximalizaci vlivu. Navrhujeme tři algoritmy ( \FBIM1 , \FBIM2 a \FBIM3 ) kombinací několika strategií, jako je prahový greedy algoritmus, dynamická technika stop-and-stare, generování vzorkování pomocí rámce vzorkování s obráceným vlivem a semena výběr pro zajištění maximálního pokrytí. \FBIM1 , \FBIM2 a \FBIM3 fungují efektivně na velkých datech, poskytují (1/2ϵ)(1/2-\epsilon)-přiblížení optimálním řešením a vyžadují složitost srovnávacích algoritmů. Nakonec jsme navrhli dva efektivní streamovací algoritmy ( \StrI a \StrII ), abychom maximalizovali submodulární (DR-submodulární) funkci klesající návraty s omezením mohutnosti na celočíselné mřížce pro třetí problém. \StrI a \StrII poskytují poměr přiblížení (1/2ϵ) (1/2-\epsilon) a poměr přiblížení (11/eϵ) (1-1/e-\epsilon). Současně mají tyto dva algoritmy ve srovnání s nejmodernějšími algoritmy sníženou složitost, vyšší výkon za běhu a zanedbatelný rozdíl v hodnotách objektivních funkcí. V každém problému dále zkoumáme výkon námi navrhovaných algoritmů prováděním mnoha experimentů. Experimentální výsledky ukázaly, že naše aproximační algoritmy poskytují vysoce účinná řešení, překonávají nejmodernější algoritmy ve složitosti, době běhu a splňují specifikovaná omezení. Některé z výsledků byly potvrzeny prostřednictvím pěti publikací na mezinárodních konferencích Scopus (RIVF 2021, ICABDE 2021) a v časopisech SCIE (Computer Standards & \& Interfaces (Elsevier) a Mathematics (MDPI)).460 - Katedra informatikyvyhově

    Analyzing adverse events of drugs

    Get PDF

    Ranking with Fairness Constraints

    Get PDF
    Ranking algorithms are deployed widely to order a set of items in applications such as search engines, news feeds, and recommendation systems. Recent studies, however, have shown that, left unchecked, the output of ranking algorithms can result in decreased diversity in the type of content presented, promote stereotypes, and polarize opinions. In order to address such issues, we study the following variant of the traditional ranking problem when, in addition, there are fairness or diversity constraints. Given a collection of items along with 1) the value of placing an item in a particular position in the ranking, 2) the collection of sensitive attributes (such as gender, race, political opinion) of each item and 3) a collection of fairness constraints that, for each k, bound the number of items with each attribute that are allowed to appear in the top k positions of the ranking, the goal is to output a ranking that maximizes the value with respect to the original rank quality metric while respecting the constraints. This problem encapsulates various well-studied problems related to bipartite and hypergraph matching as special cases and turns out to be hard to approximate even with simple constraints. Our main technical contributions are fast exact and approximation algorithms along with complementary hardness results that, together, come close to settling the approximability of this constrained ranking maximization problem. Unlike prior work on the approximability of constrained matching problems, our algorithm runs in linear time, even when the number of constraints is (polynomially) large, its approximation ratio does not depend on the number of constraints, and it produces solutions with small constraint violations. Our results rely on insights about the constrained matching problem when the objective function satisfies certain properties that appear in common ranking metrics such as discounted cumulative gain (DCG), Spearman\u27s rho or Bradley-Terry, along with the nested structure of fairness constraints
    corecore