9 research outputs found

    Navigating multilingual news collections using automatically extracted information

    Get PDF
    We are presenting a text analysis tool set that allows analysts in various fields to sieve through large collections of multilingual news items quickly and to find information that is of relevance to them. For a given document collection, the tool set automatically clusters the texts into groups of similar articles, extracts names of places, people and organisations, lists the user-defined specialist terms found, links clusters and entities, and generates hyperlinks. Through its daily news analysis operating on thousands of articles per day, the tool also learns relationships between people and other entities. The fully functional prototype system allows users to explore and navigate multilingual document collections across languages and time.Comment: This paper describes the main functionality of the JRC's fully-automatic news analysis system NewsExplorer, which is freely accessible in currently thirteen languages at http://press.jrc.it/NewsExplorer/ . 8 page

    A tool set for the quick and efficient exploration of large document collections

    Full text link
    We are presenting a set of multilingual text analysis tools that can help analysts in any field to explore large document collections quickly in order to determine whether the documents contain information of interest, and to find the relevant text passages. The automatic tool, which currently exists as a fully functional prototype, is expected to be particularly useful when users repeatedly have to sieve through large collections of documents such as those downloaded automatically from the internet. The proposed system takes a whole document collection as input. It first carries out some automatic analysis tasks (named entity recognition, geo-coding, clustering, term extraction), annotates the texts with the generated meta-information and stores the meta-information in a database. The system then generates a zoomable and hyperlinked geographic map enhanced with information on entities and terms found. When the system is used on a regular basis, it builds up a historical database that contains information on which names have been mentioned together with which other names or places, and users can query this database to retrieve information extracted in the past.Comment: 10 page

    Reading News Data

    Get PDF
    Information overflow is both inspiring and depressing. Inspiring is more or less the easy access to various information and communication resources, which in turn facilitate the exchange of ideas, knowledge and creativity. However, the impossible fathomability of infinite information space creates a feeling of depression and anxiety. Thanks to digital technology, the speed at which information is generated and distributed significantly exceeds the speed at which it can be perceived. The more the hypertext ocean of the web is filled with content, the more impossible it is to be ‘tamed’ by human senses. In the ambivalent nature of information super-abundance, technological optimism and technological pessimism constantly compete. In fact, the two perspectives on the role of technology in the world of people have always been in conflict, but are now strongly intensified with the evolution and spread of the internet.1 This text looks at a conditional technological optimism, aiming not at postulating utopian aspirations, but at illustrating how scientific and technical elites do not lose their desire to overcome depressing complexity by seeking bold optimization solutions. The focus of this paper is on an innovative technological system for processing online news, namely the publicly available Europe Media Monitor (EMM).2 The interest in this monitoring tool is multifaceted. In most general terms it is interesting to trace the technological solutions with which computer science specialists are trying to discipline the information flow. However, EMM is also interesting as a powerful tool for understanding reality on the basis of statistically processed news databases. This study provides examples of how EMM-enabled media content processing options may be used as a basis for further detailed analysis. Of importance are also the social and institutional intentions behind the development of such a system: what are the motives and the uses associated with such an intersection between news and intelligent software

    Multilingual person name recognition and transliteration

    Get PDF
    Nous présentons ici un outil de repérage des noms de personnes, à partir d’articles de la presse internationale, capable de reconnaître les différentes variantes d’un même nom. L’originalité de notre approche vient de l’identification des variantes de noms à travers les langues et systèmes d’écriture, grec, cyrillique et arabe compris. Étant donné notre contexte multilingue, nous utilisons une représentation interne standard de chaque nom ainsi qu’une même mesure de similarité (au lieu d’adopter l’approche bilingue habituelle de la translittération). Ce module fait partie d’un outil plus général qui analyse en moyenne 15.000 articles de journaux chaque jour, afin de regrouper les documents similaires, aussi bien dans une même langue que dans des langues différentes.We present an exploratory tool that extracts person names from multilingual news collections, matches name variants referring to the same person, and infers relationships between people based on the co-occurrence of their names in related news. A novel feature is the matching of name variants across languages and writing systems, including names written with the Greek, Cyrillic and Arabic writing system. Due to our highly multilingual setting, we use an internal standard representation for name representation and matching, instead of adopting the traditional bilingual approach to transliteration. This work is part of a news analysis system that clusters an average of 25,000 news articles per day to detect related news within the same and across different languages

    Named entities as privileged information for hierarchical text clustering

    Get PDF
    Text clustering is a text mining task which is often used to aid the organization, knowledge extraction, and exploratory search of text collections. Nowadays, the automatic text clustering becomes essential as the volume and variety of digital text documents increase, either in social networks and the Web or inside organizations. This paper explores the use of named entities as privileged information in a hierarchical clustering process, so as to improve clusters quality and interpretation. We carried out an experimental evaluation on three text collections (one written in Portuguese and two written in English) and the results show that named entities can be applied as privileged information to power clustering solution in dynamic text collection scenarios.FAPESP (grant #2010/20564-8, #2012/13830-9, #2013/14757-6 and #2013/16039-3

    New event detection and topic tracking in turkish

    Get PDF
    Topic detection and tracking (TDT) applications aim to organize the temporally ordered stories of a news stream according to the events. Two major problems in TDT are new event detection (NED) and topic tracking (TT). These problems focus on finding the first stories of new events and identifying all subsequent stories on a certain topic defined by a small number of sample stories. In this work, we introduce the first large-scale TDT test collection for Turkish, and investigate the NED and TT problems in this language. We present our test-collection-construction approach, which is inspired by the TDT research initiative. We show that in TDT for Turkish with some similarity measures, a simple word truncation stemming method can compete with a lemmatizer-based stemming approach. Our findings show that contrary to our earlier observations on Turkish information retrieval, in NED word stopping has an impact on effectiveness. We demonstrate that the confidence scores of two different similarity measures can be combined in a straightforward manner for higher effectiveness. The influence of several similarity measures on effectiveness also is investigated. We show that it is possible to deploy TT applications in Turkish that can be used in operational settings. © 2010 ASIS&T

    Distance,Time and Terms in First Story Detection

    Get PDF
    First Story Detection (FSD) is an important application of online novelty detection within Natural Language Processing (NLP). Given a stream of documents, or stories, about news events in a chronological order, the goal of FSD is to identify the very first story for each event. While a variety of NLP techniques have been applied to the task, FSD remains challenging because it is still not clear what is the most crucial factor in defining the “story novelty”. Giventhesechallenges,thethesisaddressedinthisdissertationisthat the notion of novelty in FSD is multi-dimensional. To address this, the work presented has adopted a three dimensional analysis of the relative qualities of FSD systems and gone on to propose a specific method that wearguesignificantlyimprovesunderstandingandperformanceofFSD. FSD is of course not a new problem type; therefore, our first dimen sion of analysis consists of a systematic study of detection models for firststorydetectionandthedistancesthatareusedinthedetectionmod els for defining novelty. This analysis presents a tripartite categorisa tion of the detection models based on the end points of the distance calculation. The study also considers issues of document representation explicitly, and shows that even in a world driven by distributed repres iv entations,thenearestneighbourdetectionmodelwithTF-IDFdocument representations still achieves the state-of-the-art performance for FSD. Weprovideanalysisofthisimportantresultandsuggestpotentialcauses and consequences. Events are introduced and change at a relatively slow rate relative to the frequency at which words come in and out of usage on a docu ment by document basis. Therefore we argue that the second dimen sion of analysis should focus on the temporal aspects of FSD. Here we are concerned with not only the temporal nature of the detection pro cess, e.g., the time/history window over the stories in the data stream, but also the processes that underpin the representational updates that underpin FSD. Through a systematic investigation of static representa tions, and also dynamic representations with both low and high update frequencies, we show that while a dynamic model unsurprisingly out performs static models, the dynamic model in fact stops improving but stays steady when the update frequency gets higher than a threshold. Our third dimension of analysis moves across to the particulars of lexicalcontent,andcriticallytheaffectoftermsinthedefinitionofstory novelty. Weprovideaspecificanalysisofhowtermsarerepresentedfor FSD, including the distinction between static and dynamic document representations, and the affect of out-of-vocabulary terms and the spe cificity of a word in the calculation of the distance. Our investigation showed that term distributional similarity rather than scale of common v terms across the background and target corpora is the most important factor in selecting background corpora for document representations in FSD. More crucially, in this work the simple idea of the new terms emerged as a vital factor in defining novelty for the first story

    Multilingual and cross-lingual news topic tracking

    No full text
    corecore