282 research outputs found

    Multifaceted Geotagging for Streaming News

    Get PDF
    News sources on the Web generate constant streams of information, describing the events that shape our world. In particular, geography plays a key role in the news, and understanding the geographic information present in news allows for its useful spatial browsing and retrieval. This process of understanding is called geotagging, and involves first finding in the document all textual references to geographic locations, known as toponyms, and second, assigning the correct lat/long values to each toponym, steps which are termed toponym recognition and toponym resolution, respectively. These steps are difficult due to ambiguities in natural language: some toponyms share names with non-location entities, and further, a given toponym can have many location interpretations. Removing these ambiguities is crucial for successful geotagging. To this end, geotagging methods are described which were developed for streaming news. First, a spatio-textual search engine named STEWARD, and an interactive map-based news browsing system named NewsStand are described, which feature geotaggers as central components, and served as motivating systems and experimental testbeds for developing geotagging methods. Next, a geotagging methodology is presented that follows a multifaceted approach involving a variety of techniques. First, a multifaceted toponym recognition process is described that uses both rule-based and machine learning–based methods to ensure high toponym recall. Next, various forms of toponym resolution evidence are explored. One such type of evidence is lists of toponyms, termed comma groups, whose toponyms share a common thread in their geographic properties that enables correct resolution. In addition to explicit evidence, authors take advantage of the implicit geographic knowledge of their audiences. Understanding the local places known by an audience, termed its local lexicon, affords great performance gains when geotagging articles from local newspapers, which account for the vast majority of news on the Web. Finally, considering windows of text of varying size around each toponym, termed adaptive context, allows for a tradeoff between geotagging execution speed and toponym resolution accuracy. Extensive experimental evaluations of all the above methods, using existing and two newly-created, large corpora of streaming news, show great performance gains over several competing prominent geotagging methods

    A survey on the geographic scope of textual documents.

    Get PDF
    Recognizing references to places in texts is needed in many applications, such assearch engines,loca- tion-based social media and document classification. In this paper we present a survey of methods and techniques for there cognition and identification of places referenced in texts. We discuss concept sand terminology, and propose a classification of the solutions given in the literature. We introduce a definition of the Geographic Scope Resolution (GSR) problem, dividing it in three steps: geoparsing, reference resolution, and grounding references. Solutions to the first two steps are organized according to the method used, and solutions to the third step are organized according to the type of out put produced. We found that it is difficult to compare existing solutions directly to one another, because they of ten create their own bench marking data, targeted to their own problem

    A Word Cloud Model based on Hate Speech in an Online Social Media Environment

    Get PDF
    تُعرف وسائل التواصل الاجتماعي باسم منصة الكاشفات التي تُستخدم لقياس أنشطة المستخدمين في العالم الحقيقي. ومع ذلك ، فإن التغذية الضخمة وغير المصفاة للرسائل المنشورة على وسائل التواصل الاجتماعي تثير تحذيرات اجتماعية ، لا سيما عندما تحتوي هذه الرسائل على خطاب كراهية تجاه فرد أو مجتمع معين. التأثير السلبي لهذه الرسائل على الأفراد أو المجتمع ككل يشكل مصدر قلق كبير للحكومات والمنظمات غير الحكومية. توفر سحابات الكلمات وسيلة بسيطة وفعالة لنقل الكلمات الأكثر شيوعًا من المستندات النصية بصريًا. يهدف هذا البحث إلى تطوير نموذج سحابة الكلمات بناءً على الكلمات البغيضة في بيئة الوسائط الاجتماعية عبر الإنترنت مثل أخبار كوكل. وقد تم اتخاذ عدة خطوات بما في ذلك الحصول على البيانات والمعالجة المسبقة واستخراج الميزات وتطوير النموذج والتصور وعرض نتيجة نموذج سحابة الكلمات. تعرض النتائج صورة في سلسلة نصية تصف أهم الكلمات. ويمكن اعتبار هذا النموذج طريقة بسيطة لتبادل المعلومات عالية المستوى دون زيادة تحميل تفاصيل المستخدم.Social media is known as detectors platform that are used to measure the activities of the users in the real world. However, the huge and unfiltered feed of messages posted on social media trigger social warnings, particularly when these messages contain hate speech towards specific individual or community. The negative effect of these messages on individuals or the society at large is of great concern to governments and non-governmental organizations. Word clouds provide a simple and efficient means of visually transferring the most common words from text documents. This research aims to develop a word cloud model based on hateful words on online social media environment such as Google News. Several steps are involved including data acquisition and pre-processing, feature extraction, model development, visualization and viewing of word cloud model result. The results present an image in a series of text describing the top words. This model can be considered as a simple way to exchange high-level information without overloading the user's details

    The GENIE System: classifying documents by combining mixed-techniques

    Get PDF
    Today, the automatic text classification is still an open problem and its implementation in companies and organizations with large volumes of data in text format is not a trivial matter. To achieve optimum results many parameters come into play, such as the language, the context, the level of knowledge of the issues discussed, the format of the documents, or the type of language that has been used in the documents to be classified. In this paper we describe a multi-language rule-based pipeline system, called GENIE, used for automatic document categorisation. We have used several business corpora in order to test the real capabilities of our proposal, and we have studied the results of applying different stages of the pipeline over the same data to test the influence of each step in the categorization process. The results obtained by this system are very promising, and in fact, the GENIE system is already being used on real production environments with very good results

    Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources

    Get PDF
    Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen

    Evaluation in natural language processing

    Get PDF
    quot; European Summer School on Language Logic and Information(ESSLLI 2007)(Trinity College Dublin Ireland 6-17 August 2007

    APREGOAR: Development of a geospatial database applied to local news in Lisbon

    Get PDF
    Project Work presented as the partial requirement for obtaining a Master's degree in Geographic Information Systems and ScienceHá informações valiosas em formato de texto não estruturado sobre a localização, calendarização e a essências dos eventos disponíveis no conteúdo de notícias digitais. Vários trabalhos em curso já tentam extrair detalhes de eventos de fontes de notícias digitais, mas muitas vezes não com a nuance necssária para representar com precisão onde as coisas realmente acontecem. Alternativamente, os jornalistas poderiam associar manualmente atributos a eventos descritos nos seus artigos enquanto publicam, melhorando a exatidão e a confiança nestes atributos espaciais e temporais. Estes atributos poderiam então estar imediatamente disponíveis para avaliar a cobertura temática, temporal e espacial do conteúdo de uma agência, bem como melhorar a experiência do utilizador na exploração do conteúdo, fornecendo dimensões adicionais que podem ser filtradas. Embora a tecnologia de atribuição de dimensões geoespaciais e temporais para o emprego de aplicaçãoes voltadas para o consumidor não seja novidade, tem ainda de ser aplicada à escala das notícias. Além disso, a maioria dos sistemas existentes suporta apenas uma definição pontual da localização dos artigos, que pode não representar bem o(s) local(is) real(ais) dos eventos descritos. Este trabalho define uma aplicação web de código aberto e uma base de dados espacial subjacente que suporta i) a associação de múltiplos polígonos a representar o local onde cada evento ocorre, os prazos associados aos eventos, em linha com os atributos temáticos tradicionais associados aos artigos de notícias; ii) a contextualização de cada artigo através da adição de mapas de eventos em linha para esclarecer aos leitores onde os eventos do artigo ocorrem; e iii) a exploração dos corpora adicionados através de filtros temáticos, espaciais e temporais que exibem os resultados em mapas de cobertura interactivos e listas de artigos e eventos. O projeto foi aplicado na área da grande Lisboa de Portugal. Para além da funcionalidade acima referida, este projeto constroi gazetteers progressivos que podem ser reutilizados como associações de lugares, ou para uma meta-análise mais aprofundada do lugar, tal como é percebido coloquialmente. Demonstra a facilidade com que estas dimensões adicionais podem ser incorporadas com grade confiança na precisão da definição, geridas, e alavancadas para melhorar a gestão de conteúdo das agências noticiosas, a compreensão dos leitores, a exploração dos investigadores, ou extraídas para combinação com outros conjuntos dos dados para fornecer conhecimentos adicionais.There is valuable information in unstructured text format about the location, timing, and nature of events available in digital news content. Several ongoing efforts already attempt to extract event details from digital news sources, but often not with the nuance needed to accurately represent the where things actually happen. Alternatively, journalists could manually associate attributes to events described in their articles while publishing, improving accuracy and confidence in these spatial and temporal attributes. These attributes could then be immediately available for evaluating thematic, temporal, and spatial coverage of an agency’s content, as well as improve the user experience of content exploration by providing additional dimensions that can be filtered. Though the technology of assigning geospatial and temporal dimensions for the employ of consumer-facing applications is not novel, it has yet to be applied at scale to the news. Additionally, most existing systems only support a single point definition of article locations, which may not well represent the actual place(s) of events described within. This work defines an open source web application and underlying spatial database that supports i) the association of multiple polygons representing where each event occurs, time frames associated with the events, inline with the traditional thematic attributes associated with news articles; ii) the contextualization of each article via the addition of inline event maps to clarify to readers where the events of the article occur; and iii) the exploration of the added corpora via thematic, spatial, and temporal filters that display results in interactive coverage maps and lists of articles and events. The project was applied to the greater Lisbon area of Portugal. In addition to the above functionality, this project builds progressive gazetteers that can be reused as place associations, or for further meta analysis of place as it is colloquially understood. It demonstrates the ease of which these additional dimensions may be incorporated with a high confidence in definition accuracy, managed, and leveraged to improve news agency content management, reader understanding, researcher exploration, or extracted for combination with other datasets to provide additional insights
    corecore