6 research outputs found

    Context and Learning in Novelty Detection

    Get PDF
    We demonstrate the value of using context in a new-information detection system that achieved the highest precision scores at the Text Retrieval Conference's Novelty Track in 2004. In order to determine whether information within a sentence has been seen in material read previously, our system integrates information about the context of the sentence with novel words and named entities within the sentence, and uses a specialized learning algorithm to tune the system parameters

    Deviation detection in text using conceptual graph interchange format and error tolerance dissimilarity function

    Get PDF
    The rapid increase in the amount of textual data has brought forward a growing research interest towards mining text to detect deviations. Specialized methods for specific domains have emerged to satisfy various needs in discovering rare patterns in text. This paper focuses on a graph-based approach for text representation and presents a novel error tolerance dissimilarity algorithm for deviation detection. We resolve two non-trivial problems, i.e. semantic representation of text and the complexity of graph matching. We employ conceptual graphs interchange format (CGIF) – a knowledge representation formalism to capture the structure and semantics of sentences. We propose a novel error tolerance dissimilarity algorithm to detect deviations in the CGIFs. We evaluate our method in the context of analyzing real world financial statements for identifying deviating performance indicators. We show that our method performs better when compared with two related text based graph similarity measuring methods. Our proposed method has managed to identify deviating sentences and it strongly correlates with expert judgments. Furthermore, it offers error tolerance matching of CGIFs and retains a linear complexity with the increasing number of CGIFs

    Comparative Summarization of Document Collections

    Get PDF
    Comparing documents is an important task that help us in understanding the differences between documents. Example of document comparisons include comparing laws on same related subject matter in different jurisdictions, comparing the specifications of similar product from different manufacturers. One can see that the need for comparison does not stop at individual documents, and extends to large collections of documents. For example comparing the writing styles of an author early vs late in their life, identifying linguistic and lexical patterns of different political ideologies, or discover commonalities of political arguments in disparate events. Comparing large document collections calls for automated algorithms to do so. Every day a huge volume of documents are produced in social and news media. There has been a lot of research in summarizing individual document such as a news article, or document collections such as a collection of news articles on a related topic or event. However, comparatively summarizing different document collections, or comparative summarization is under-explored problem in terms of methodology, datasets, evaluations and applicability in different domains. To address this, in this thesis, we make three types of contributions to comparative summarization, methodology, datasets and evaluation, and empirical measurements on a range of settings where comparative summarization can be applied. We propose a new formulation the problem of comparative summarization as competing binary classifiers. This formulation help us to develop new unsupervised and supervised methods for comparative summarization. Our methods are based on Maximum Mean Discrepancy (MMD), a metric that measures the distance between two sets of datapoints (or documents). The unsupervised methods incorporate information coverage, information diversity and discriminativeness of the prototypes based on global-model of sentence-sentence similarity, and be optimized with greedy and gradient methods. We show the efficacy of the approach in summarizing a long running news topic over time. Our supervised method improves the unsupervised methods, and can learn the importance of prototypes based on surface features (e.g., position, length, presence of cue words) and combine different text feature representations. Our supervised method meets or exceeds the state-of-the-art performance in benchmark datasets. We design new scalable automatic and crowd-sourced extrinsic evaluations of comparative summaries when human written ground truth summaries are not available. To evaluate our methods, we develop two new datasets on controversial news topics -- CONTROVNEWS2017 and NEWS2019+BIAS datasets which we use in different experiments. We use CONTROVNEWS2017, which consists of news articles on controversial topics to evaluate our unsupervised methods in summarizing over time. We use NEWS2019+BIAS, which again consists of news articles on controversial news topics, along with media bias labels to empirically study the applicability of methods. Finally, we measure the distinguishability and summarizability of document collections to quantify the applicability of our methods in different domains. We measure these metrics in a newly curated NEWS2019+BIAS dataset in comparing articles over time, and across ideological leanings of media outlets. First, we observe that the summarizability is proportional to the distinguishability, and identify the groups of articles that are less or more distinguishable.Second, better distinguishability and summarizability is amenable to the choice of document representations according to the comparisons we make, either over time, or across ideological leanings of media outlets. We also apply the comparative summarization method to the task of comparing stances in the social media domain

    Context and Learning in Novelty Detection

    Get PDF
    We demonstrate the value of using context in a new-information detection system that achieved the highest precision scores at the Text Retrieval Conference's Novelty Track in 2004. In order to determine whether information within a sentence has been seen in material read previously, our system integrates information about the context of the sentence with novel words and named entities within the sentence, and uses a specialized learning algorithm to tune the system parameters

    Context and Learning in Novelty Detection

    No full text
    We demonstrate the value of using context in a new-information detection system that achieved the highest precision scores at the Text Retrieval Conference’s Novelty Track in 2004. In order to determine whether information within a sentence has been seen in material read previously, our system integrates information about the context of the sentence with novel words and named entities within the sentence, and uses a specialized learning algorithm to tune the system parameters.

    Context and Learning in Novelty Detection

    No full text
    We demonstrate the value of using context in a new-information detection system that achieved the highest precision scores at the Text Retrieval Conference's Novelty Track in 2004. In order to determine whether information within a sentence has been seen in material read previously, our system integrates information about the context of the sentence with novel words and named entities within the sentence, and uses a specialized learning algorithm to tune the system parameters