5,000 research outputs found

    Grouping business news stories based on salience of named entities

    Get PDF
    In news aggregation systems focused on broad news domains, certain stories may appear in multiple articles. Depending on the relative importance of the story, the number of versions can reach dozens or hundreds within a day. The text in these versions may be nearly identical or quite different. Linking multiple versions of a story into a single group brings several important benefits to the end-user—reducing the cognitive load on the reader, as well as signaling the relative importance of the story. We present a grouping algorithm, and explore several vector-based representations of input documents: from a baseline using keywords, to a method using salience—a measure of importance of named entities in the text. We demonstrate that features beyond keywords yield substantial improvements, verified on a manually-annotated corpus of business news stories.Peer reviewe

    Classification and clustering in media monitoring : from knowledge engineering to deep learning

    Get PDF
    This thesis addresses information extraction from financial news for decision support in the business domain. News is an important source of information for business decision makers, which reflects investors’ expectations and affects companies’ reputations. A vast amount of various news sources forces development of text mining algorithms to collect most crucial information and present to a user in a condensed form. The thesis presents the PULS media monitoring system and describes several news mining tasks, namely document clustering, multi-label news classification and text polarity detection. For each task, we present an end-to-end processing pipeline, starting from data preprocessing and clean-up. A particular attention is given to named entities (NEs), that are used as one of the inputs for all presented algorithms. Chapter 1 overviews the PULS news monitoring system and its niche within text mining for business intelligence. In Chapter 2 we propose a novel algorithm for news grouping, which uses NE salience and exploits a specific structure of news articles. In Chapter 3 we use automatically extracted NEs and entity descriptors in combination with keywords to improve SVM classifiers for large-scale multi-label text classification. Then, we propose a convolutional neural network (CNN) architecture that outperforms an ensemble of SVM classifiers for two different datasets. We compared various ways to represent NEs for CNN classifiers. In Chapter 4 we use a CNN classifier for entity-level business polarity detection. We compare three methods of re-using data annotated for a different though remotely related task and demonstrate that unsupervised knowledge transfer works better than other techniques that involve manual mapping.Tämä väitöskirja käsittelee sitä, kuinka taloutta kuvaavista uutisartikkeleista voidaan eristää tietoja, joita voidaan käyttää liiketoimintaan liittyvän päätöksenteon tukena. Uutisartikkelit ovat liike-elämän päättäjille tärkeitä tiedonlähteitä, joka kuvastavat sijoittajien odotuksia ja vaikuttavat yritysten maineeseen. Koska erilaisia uutislähteitä on valtava määrä, on uutisartikkelien hallintaan kuitenkin täytynyt kehittää erilaisia tekstitiedon louhinta-algoritmeja, joilla voidaan kerätä uutisartikkeleista kaikkein tärkeimmät tiedot ja esittää ne käyttäjälle tiivistetyssä muodossa. Väitöskirjassa esitellään median monitorointijärjestelmä PULS sekä kuvataan, kuinka uutisartikkelien analysointiin tässä järjestelmässä käytetään kolmea erilaista tiedonlouhintamenetelmää eli dokumenttien klusterointia, moniluokkaista uutisartikkelien luokittelua ja tekstin polaarisuuden havainnointia. Kaikki väitöskirjassa esitetyt louhintamenetelmät käyttävät syötteenään PULS-järjestelmän tiedon eristämisvaiheessa prosessoituja tekstejä, jossa alkuperäisistä teksteistä on etsitty niihin liittyvät erilaiset nimientiteetit ja muut alemman tason entiteetit. Väitöskirjassa osoitetaan, että lähes jokaisessa median monitorointiin liittyvässä tehtävässä on hyötyä näiden nimientiteettien käyttämisestä. PULS-järjestelmän tiedon eristämisvaiheessa siis tuotetaan piirteitä, joita järjestelmän koneoppimisvaiheen eri komponentit sitten hyödyntävät. Tässä väitöskirjassa tutustutaan useisiin tällaisiin komponentteihin, joissa käytetään sekä ohjattuja että ohjaamattomia oppimismenetelmiä samoin kuin kehittyneitä syväoppimismalleja. Väitöskirjassa myös osoitetaan, kuinka tällaista kaksivaiheista arkkitehtuuria voidaan käyttää tuhansien uutisartikkelien reaaliaikaiseen prosessointiin, kun tavoitteena on tarjota loppukäyttäjälle syvällinen ymmärrys kyseisen aihealueen tapahtumista

    Grounding event references in news

    Get PDF
    Events are frequently discussed in natural language, and their accurate identification is central to language understanding. Yet they are diverse and complex in ontology and reference; computational processing hence proves challenging. News provides a shared basis for communication by reporting events. We perform several studies into news event reference. One annotation study characterises each news report in terms of its update and topic events, but finds that topic is better consider through explicit references to background events. In this context, we propose the event linking task which—analogous to named entity linking or disambiguation—models the grounding of references to notable events. It defines the disambiguation of an event reference as a link to the archival article that first reports it. When two references are linked to the same article, they need not be references to the same event. Event linking hopes to provide an intuitive approximation to coreference, erring on the side of over-generation in contrast with the literature. The task is also distinguished in considering event references from multiple perspectives over time. We diagnostically evaluate the task by first linking references to past, newsworthy events in news and opinion pieces to an archive of the Sydney Morning Herald. The intensive annotation results in only a small corpus of 229 distinct links. However, we observe that a number of hyperlinks targeting online news correspond to event links. We thus acquire two large corpora of hyperlinks at very low cost. From these we learn weights for temporal and term overlap features in a retrieval system. These noisy data lead to significant performance gains over a bag-of-words baseline. While our initial system can accurately predict many event links, most will require deep linguistic processing for their disambiguation

    Grounding event references in news

    Get PDF
    Events are frequently discussed in natural language, and their accurate identification is central to language understanding. Yet they are diverse and complex in ontology and reference; computational processing hence proves challenging. News provides a shared basis for communication by reporting events. We perform several studies into news event reference. One annotation study characterises each news report in terms of its update and topic events, but finds that topic is better consider through explicit references to background events. In this context, we propose the event linking task which—analogous to named entity linking or disambiguation—models the grounding of references to notable events. It defines the disambiguation of an event reference as a link to the archival article that first reports it. When two references are linked to the same article, they need not be references to the same event. Event linking hopes to provide an intuitive approximation to coreference, erring on the side of over-generation in contrast with the literature. The task is also distinguished in considering event references from multiple perspectives over time. We diagnostically evaluate the task by first linking references to past, newsworthy events in news and opinion pieces to an archive of the Sydney Morning Herald. The intensive annotation results in only a small corpus of 229 distinct links. However, we observe that a number of hyperlinks targeting online news correspond to event links. We thus acquire two large corpora of hyperlinks at very low cost. From these we learn weights for temporal and term overlap features in a retrieval system. These noisy data lead to significant performance gains over a bag-of-words baseline. While our initial system can accurately predict many event links, most will require deep linguistic processing for their disambiguation

    TSPOONS: Tracking Salience Profiles Of Online News Stories

    Get PDF
    News space is a relatively nebulous term that describes the general discourse concerning events that affect the populace. Past research has focused on qualitatively analyzing news space in an attempt to answer big questions about how the populace relates to the news and how they respond to it. We want to ask when do stories begin? What stories stand out among the noise? In order to answer the big questions about news space, we need to track the course of individual stories in the news. By analyzing the specific articles that comprise stories, we can synthesize the information gained from several stories to see a more complete picture of the discourse. The individual articles, the groups of articles that become stories, and the overall themes that connect stories together all complete the narrative about what is happening in society. TSPOONS provides a framework for analyzing news stories and answering two main questions: what were the important stories during some time frame and what were the important stories involving some topic. Drawing technical news stories from Techmeme.com, TSPOONS generates profiles of each news story, quantitatively measuring the importance, or salience, of news stories as well as quantifying the impact of these stories over time

    Proceedings of the First Workshop on Computing News Storylines (CNewsStory 2015)

    Get PDF
    This volume contains the proceedings of the 1st Workshop on Computing News Storylines (CNewsStory 2015) held in conjunction with the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2015) at the China National Convention Center in Beijing, on July 31st 2015. Narratives are at the heart of information sharing. Ever since people began to share their experiences, they have connected them to form narratives. The study od storytelling and the field of literary theory called narratology have developed complex frameworks and models related to various aspects of narrative such as plots structures, narrative embeddings, characters’ perspectives, reader response, point of view, narrative voice, narrative goals, and many others. These notions from narratology have been applied mainly in Artificial Intelligence and to model formal semantic approaches to narratives (e.g. Plot Units developed by Lehnert (1981)). In recent years, computational narratology has qualified as an autonomous field of study and research. Narrative has been the focus of a number of workshops and conferences (AAAI Symposia, Interactive Storytelling Conference (ICIDS), Computational Models of Narrative). Furthermore, reference annotation schemes for narratives have been proposed (NarrativeML by Mani (2013)). The workshop aimed at bringing together researchers from different communities working on representing and extracting narrative structures in news, a text genre which is highly used in NLP but which has received little attention with respect to narrative structure, representation and analysis. Currently, advances in NLP technology have made it feasible to look beyond scenario-driven, atomic extraction of events from single documents and work towards extracting story structures from multiple documents, while these documents are published over time as news streams. Policy makers, NGOs, information specialists (such as journalists and librarians) and others are increasingly in need of tools that support them in finding salient stories in large amounts of information to more effectively implement policies, monitor actions of “big players” in the society and check facts. Their tasks often revolve around reconstructing cases either with respect to specific entities (e.g. person or organizations) or events (e.g. hurricane Katrina). Storylines represent explanatory schemas that enable us to make better selections of relevant information but also projections to the future. They form a valuable potential for exploiting news data in an innovative way.JRC.G.2-Global security and crisis managemen

    Local vs. National: How Twitter Reflects News Coverage of Colin Kaepernick Protests

    Get PDF
    Local and national media dedicate different levels of coverage to issues depending on its relevancy to their audiences. This study uses news outlets’ social media activity to show that coverage discrepancies occurred with former NFL quarterback Colin Kaepernick’s National Anthem protest. Because his protest reached national headlines, Kaepernick suffered the same fate of many protesting athletes in the past. This study will show how national media carried his story to national headlines and framed his protest negatively. The findings show that local media were the least active among the three media levels, local, regional and national, in covering the Kaepernick protest, and national media provided the most political-protest coverage among. Additionally, the results show how media outlets with ties to sports entities may limit their independence, thus limiting their coverage

    From icon to naturalised icon:a linguistic analysis of media representations of the BP Deepwater Horizon crisis

    Get PDF
    This research explores how news media reports construct representations of a business crisis through language. In an innovative approach to dealing with the vast pool of potentially relevant texts, media texts concerning the BP Deepwater Horizon oil spill are gathered from three different time points: immediately after the explosion in 2010, one year later in 2011 and again in 2012. The three sets of 'BP texts' are investigated using discourse analysis and semi-quantitative methods within a semiotic framework that gives an account of language at the semiotic levels of sign, code, mythical meaning and ideology. The research finds in the texts three discourses of representation concerning the crisis that show a movement from the ostensibly representational to the symbolic and conventional: a discourse of 'objective factuality', a discourse of 'positioning' and a discourse of 'redeployment'. This progression can be shown to have useful parallels with Peirce's sign classes of Icon, Index and Symbol, with their implied movement from a clear motivation by the Object (in this case the disaster events), to an arbitrary, socially-agreed connection. However, the naturalisation of signs, whereby ideologies are encoded in ways of speaking and writing that present them as 'taken for granted' is at its most complete when it is least discernible. The findings suggest that media coverage is likely to move on from symbolic representation to a new kind of iconicity, through a fourth discourse of 'naturalisation'. Here the representation turns back towards ostensible factuality or iconicity, to become the 'naturalised icon'. This work adds to the study of media representation a heuristic for understanding how the meaning-making of a news story progresses. It offers a detailed account of what the stages of this progression 'look like' linguistically, and suggests scope for future research into both language characteristics of phases and different news-reported phenomena
    corecore