14 research outputs found

    Towards A Practically Useful Text Simplification System

    Get PDF
    While there is a vast amount of text written about nearly any topic, this is often difficult for someone unfamiliar with a specific field to understand. Automated text simplification aims to reduce the complexity of a document, making it more comprehensible to a broader audience. Much of the research in this field has traditionally focused on simplification sub-tasks, such as lexical, syntactic, or sentence-level simplification. However, current systems struggle to consistently produce high-quality simplifications. Phrase-based models tend to make too many poor transformations; on the other hand, recent neural models, while producing grammatical output, often do not make all needed changes to the original text. In this thesis, I discuss novel approaches for improving lexical and sentence-level simplification systems. Regarding sentence simplification models, after noting that encouraging diversity at inference time leads to significant improvements, I take a closer look at the idea of diversity and perform an exhaustive comparison of diverse decoding techniques on other generation tasks. I also discuss the limitations in the framing of current simplification tasks, which prevent these models from yet being practically useful. Thus, I also propose a retrieval-based reformulation of the problem. Specifically, starting with a document, I identify concepts critical to understanding its content, and then retrieve documents relevant for each concept, re-ranking them based on the desired complexity level

    MultiVENT: Multilingual Videos of Events with Aligned Natural Text

    Full text link
    Everyday news coverage has shifted from traditional broadcasts towards a wide range of presentation formats such as first-hand, unedited video footage. Datasets that reflect the diverse array of multimodal, multilingual news sources available online could be used to teach models to benefit from this shift, but existing news video datasets focus on traditional news broadcasts produced for English-speaking audiences. We address this limitation by constructing MultiVENT, a dataset of multilingual, event-centric videos grounded in text documents across five target languages. MultiVENT includes both news broadcast videos and non-professional event footage, which we use to analyze the state of online news videos and how they can be leveraged to build robust, factually accurate models. Finally, we provide a model for complex, multilingual video retrieval to serve as a baseline for information retrieval using MultiVENT

    Did that happen? predicting social media posts that are indicative of what happened in a scene: a case study of a TV show

    Get PDF
    While popular Television (TV) shows are airing, some users interested in these shows publish social media posts about the show. Analyzing social media posts related to a TV show can be beneficial for gaining insights about what happened during scenes of the show. This is a challenging task partly because a significant number of social media posts associated with a TV show or event may not clearly describe what happened during the event. In this work, we propose a method to predict social media posts (associated with scenes of a TV show) that are indicative of what transpired during the scenes of the show. We evaluate our method on social media (Twitter) posts associated with an episode of a popular TV show, Game of Thrones. We show that for each of the identified scenes, with high AUC’s, our method can predict posts that are indicative of what happened in a scene from those that are not-indicative. Based on Twitters policy, we will make the Tweeter ID’s of the Twitter posts used for this work publicly available.000000000000000000000000000000000000000000000000000000577484 - The Trustees of the University of Pennsylvaniahttps://aclanthology.org/2022.lrec-1.781/Published versio

    GEMv2 : Multilingual NLG benchmarking in a single line of code

    Get PDF
    Evaluation in machine learning is usually informed by past choices, for example which datasets or metrics to use. This standardization enables the comparison on equal footing using leaderboards, but the evaluation choices become sub-optimal as better alternatives arise. This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, and human evaluation to make definitive claims. To make following best model evaluation practices easier, we introduce GEMv2. The new version of the Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers to benefit from each others work. GEMv2 supports 40 documented datasets in 51 languages. Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.Peer reviewe

    GEMv2 : Multilingual NLG benchmarking in a single line of code

    Get PDF
    Evaluation in machine learning is usually informed by past choices, for example which datasets or metrics to use. This standardization enables the comparison on equal footing using leaderboards, but the evaluation choices become sub-optimal as better alternatives arise. This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, and human evaluation to make definitive claims. To make following best model evaluation practices easier, we introduce GEMv2. The new version of the Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers to benefit from each others work. GEMv2 supports 40 documented datasets in 51 languages. Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.Peer reviewe

    Towards a Practically Useful Text Simplification System

    Get PDF
    While there is a vast amount of text written about nearly any topic, this is often difficult for someone unfamiliar with a specific field to understand. Automated text simplification aims to reduce the complexity of a document, making it more comprehensible to a broader audience. Much of the research in this field has traditionally focused on simplification sub-tasks, such as lexical, syntactic, or sentence-level simplification. However, current systems struggle to consistently produce high-quality simplifications. Phrase-based models tend to make too many poor transformations; on the other hand, recent neural models, while producing grammatical output, often do not make all needed changes to the original text. In this thesis, I discuss novel approaches for improving lexical and sentence-level simplification systems. Regarding sentence simplification models, after noting that encouraging diversity at inference time leads to significant improvements, I take a closer look at the idea of diversity and perform an exhaustive comparison of diverse decoding techniques on other generation tasks. I also discuss the limitations in the framing of current simplification tasks, which prevent these models from yet being practically useful. Thus, I also propose a retrieval-based reformulation of the problem. Specifically, starting with a document, I identify concepts critical to understanding its content, and then retrieve documents relevant for each concept, re-ranking them based on the desired complexity level

    Simplification Using Paraphrases and Context-Based Lexical Substitution

    No full text
    International audienceLexical simplification involves identifying complex words or phrases that need to be simplified, and recommending simpler meaning-preserving substitutes that can be more easily understood. We propose a complex word identification (CWI) model that exploits both lexical and contextual features, and a simplification mechanism which relies on a word-embedding lexical substitution model to replace the detected complex words with simpler paraphrases. We compare our CWI and lexical simplification models to several baselines, and evaluate the performance of our simplification system against human judgments. The results show that our models are able to detect complex words with higher accuracy than other commonly used methods, and propose good simplification substitutes in context. They also highlight the limited contribution of context features for CWI, which nonetheless improve simplification compared to context-unaware models
    corecore