Search CORE

26,764 research outputs found

Classifying Attitude by Topic Aspect for English and Chinese Document Collections

Author: Wu Yejun
Publication venue
Publication date: 25/04/2008
Field of study

The goal of this dissertation is to explore the design of tools to help users make sense of subjective information in English and Chinese by comparing attitudes on aspects of a topic in English and Chinese document collections. This involves two coupled challenges: topic aspect focus and attitude characterization. The topic aspect focus is specified by using information retrieval techniques to obtain documents on a topic that are of interest to a user and then allowing the user to designate a few segments of those documents to serve as examples for aspects that she wishes to see characterized. A novel feature of this work is that the examples can be drawn from documents in two languages (English and Chinese). A bilingual aspect classifier which applies monolingual and cross-language classification techniques is used to assemble automatically a large set of document segments on those same aspects. A test collection was designed for aspect classification by annotating consecutive sentences in documents from the Topic Detection and Tracking collections as aspect instances. Experiments show that classification effectiveness can often be increased by using training examples from both languages. Attitude characterization is achieved by classifiers which determine the subjectivity and polarity of document segments. Sentence attitude classification is the focus of the experiments in the dissertation because the best presently available test collection for Chinese attitude classification (the NTCIR-6 Chinese Opinion Analysis Pilot Task) is focused on sentence-level classification. A large Chinese sentiment lexicon was constructed by leveraging existing Chinese and English lexical resources, and an existing character-based approach for estimating the semantic orientation of other Chinese words was extended. A shallow linguistic analysis approach was adopted to classify the subjectivity and polarity of a sentence. Using the large sentiment lexicon with appropriate handling of negation, and leveraging sentence subjectivity density, sentence positivity and negativity, the resulting sentence attitude classifier was more effective than the best previously reported systems

Digital Repository at the University of Maryland

Large scale evaluations of multimedia information retrieval: the TRECVid experience

Author: G. Kazai
J.G. Fiscus
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2005
Field of study

Information Retrieval is a supporting technique which underpins a broad range of content-based applications including retrieval, filtering, summarisation, browsing, classification, clustering, automatic linking, and others. Multimedia information retrieval (MMIR) represents those applications when applied to multimedia information such as image, video, music, etc. In this presentation and extended abstract we are primarily concerned with MMIR as applied to information in digital video format. We begin with a brief overview of large scale evaluations of IR tasks in areas such as text, image and music, just to illustrate that this phenomenon is not just restricted to MMIR on video. The main contribution, however, is a set of pointers and a summarisation of the work done as part of TRECVid, the annual benchmarking exercise for video retrieval tasks

CiteSeerX

Crossref

Irish Universities

DCU Online Research Access Service

Event detection, tracking, and visualization in Twitter: a mention-anomaly-based approach

Author: Favre Cecile
Guille Adrien
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 13/05/2015
Field of study

The ever-growing number of people using Twitter makes it a valuable source of timely information. However, detecting events in Twitter is a difficult task, because tweets that report interesting events are overwhelmed by a large volume of tweets on unrelated topics. Existing methods focus on the textual content of tweets and ignore the social aspect of Twitter. In this paper we propose MABED (i.e. mention-anomaly-based event detection), a novel statistical method that relies solely on tweets and leverages the creation frequency of dynamic links (i.e. mentions) that users insert in tweets to detect significant events and estimate the magnitude of their impact over the crowd. MABED also differs from the literature in that it dynamically estimates the period of time during which each event is discussed, rather than assuming a predefined fixed duration for all events. The experiments we conducted on both English and French Twitter data show that the mention-anomaly-based approach leads to more accurate event detection and improved robustness in presence of noisy Twitter content. Qualitatively speaking, we find that MABED helps with the interpretation of detected events by providing clear textual descriptions and precise temporal descriptions. We also show how MABED can help understanding users' interest. Furthermore, we describe three visualizations designed to favor an efficient exploration of the detected events.Comment: 17 page

arXiv.org e-Print Archive

Crossref

HAL

Hal-Diderot

High-level feature detection from video in TRECVid: a 5-year retrospective of achievements

Author: A. F. Smeaton
A. F. Smeaton
A. Loui
A. P. Natsev
A. Smeulders
C. G. M. Snoek
C. G. Snoek
E. Yilmaz
M. G. Christel
M. Naphade
M. R. Naphade
P. Joly
P. Over
T. Volkmer
W. Kraaij
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2008
Field of study

Successful and effective content-based access to digital video requires fast, accurate and scalable methods to determine the video content automatically. A variety of contemporary approaches to this rely on text taken from speech within the video, or on matching one video frame against others using low-level characteristics like colour, texture, or shapes, or on determining and matching objects appearing within the video. Possibly the most important technique, however, is one which determines the presence or absence of a high-level or semantic feature, within a video clip or shot. By utilizing dozens, hundreds or even thousands of such semantic features we can support many kinds of content-based video navigation. Critically however, this depends on being able to determine whether each feature is or is not present in a video clip. The last 5 years have seen much progress in the development of techniques to determine the presence of semantic features within video. This progress can be tracked in the annual TRECVid benchmarking activity where dozens of research groups measure the effectiveness of their techniques on common data and using an open, metrics-based approach. In this chapter we summarise the work done on the TRECVid high-level feature task, showing the progress made year-on-year. This provides a fairly comprehensive statement on where the state-of-the-art is regarding this important task, not just for one research group or for one approach, but across the spectrum. We then use this past and on-going work as a basis for highlighting the trends that are emerging in this area, and the questions which remain to be addressed before we can achieve large-scale, fast and reliable high-level feature detection on video

CiteSeerX

Crossref

Irish Universities

DCU Online Research Access Service

Growing Story Forest Online from Massive Breaking News

Author: Kong Linglong
Lai Kunfeng
Liu Bang
Niu Di
Xu Yu
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 28/02/2018
Field of study

We describe our experience of implementing a news content organization system at Tencent that discovers events from vast streams of breaking news and evolves news story structures in an online fashion. Our real-world system has distinct requirements in contrast to previous studies on topic detection and tracking (TDT) and event timeline or graph generation, in that we 1) need to accurately and quickly extract distinguishable events from massive streams of long text documents that cover diverse topics and contain highly redundant information, and 2) must develop the structures of event stories in an online manner, without repeatedly restructuring previously formed stories, in order to guarantee a consistent user viewing experience. In solving these challenges, we propose Story Forest, a set of online schemes that automatically clusters streaming documents into events, while connecting related events in growing trees to tell evolving stories. We conducted extensive evaluation based on 60 GB of real-world Chinese news data, although our ideas are not language-dependent and can easily be extended to other languages, through detailed pilot user experience studies. The results demonstrate the superior capability of Story Forest to accurately identify events and organize news text into a logical structure that is appealing to human readers, compared to multiple existing algorithm frameworks.Comment: Accepted by CIKM 2017, 9 page

arXiv.org e-Print Archive

Crossref

Transformational tagging for topic tracking in natural language.

Author
Publication venue
Publication date: 01/01/2000
Field of study

Ip Chun Wah Timmy.Thesis (M.Phil.)--Chinese University of Hong Kong, 2000.Includes bibliographical references (leaves 113-120).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Topic Detection and Tracking --- p.2Chapter 1.1.1 --- What is a Topic? --- p.3Chapter 1.1.2 --- What is Topic Tracking? --- p.4Chapter 1.2 --- Research Contributions --- p.4Chapter 1.2.1 --- Named Entity Tagging --- p.5Chapter 1.2.2 --- Handling Unknown Words --- p.6Chapter 1.2.3 --- Named-Entity Approach in Topic Tracking --- p.7Chapter 1.3 --- Organization of Thesis --- p.7Chapter 2 --- Background --- p.9Chapter 2.1 --- Previous Developments in Topic Tracking --- p.10Chapter 2.1.1 --- BBN's Tracking System --- p.10Chapter 2.1.2 --- CMU's Tracking System --- p.11Chapter 2.1.3 --- Dragon's Tracking System --- p.12Chapter 2.1.4 --- UPenn's Tracking System --- p.13Chapter 2.2 --- Topic Tracking in Chinese --- p.13Chapter 2.3 --- Part-of-Speech Tagging --- p.15Chapter 2.3.1 --- A Brief Overview of POS Tagging --- p.15Chapter 2.3.2 --- Transformation-based Error-Driven Learning --- p.18Chapter 2.4 --- Unknown Word Identification --- p.20Chapter 2.4.1 --- Rule-based approaches --- p.21Chapter 2.4.2 --- Statistical approaches --- p.23Chapter 2.4.3 --- Hybrid approaches --- p.24Chapter 2.5 --- Information Retrieval Models --- p.25Chapter 2.5.1 --- Vector-Space Model --- p.26Chapter 2.5.2 --- Probabilistic Model --- p.27Chapter 2.6 --- Chapter Summary --- p.28Chapter 3 --- System Overview --- p.29Chapter 3.1 --- Segmenter --- p.30Chapter 3.2 --- TEL Tagger --- p.31Chapter 3.3 --- Unknown Words Identifier --- p.32Chapter 3.4 --- Topic Tracker --- p.33Chapter 3.5 --- Chapter Summary --- p.34Chapter 4 --- Named Entity Tagging --- p.36Chapter 4.1 --- Experimental Data --- p.37Chapter 4.2 --- Transformational Tagging --- p.41Chapter 4.2.1 --- Notations --- p.41Chapter 4.2.2 --- Corpus Utilization --- p.42Chapter 4.2.3 --- Lexical Rules --- p.42Chapter 4.2.4 --- Contextual Rules --- p.47Chapter 4.3 --- Experiment and Result --- p.49Chapter 4.3.1 --- Lexical Tag Initialization --- p.50Chapter 4.3.2 --- Contribution of Lexical and Contextual Rules --- p.52Chapter 4.3.3 --- Performance on Unknown Words --- p.56Chapter 4.3.4 --- A Possible Benchmark --- p.57Chapter 4.3.5 --- Comparison between TEL Approach and the Stochas- tic Approach --- p.58Chapter 4.4 --- Chapter Summary --- p.59Chapter 5 --- Handling Unknown Words in Topic Tracking --- p.62Chapter 5.1 --- Overview --- p.63Chapter 5.2 --- Person Names --- p.64Chapter 5.2.1 --- Forming possible named entities from OOV by group- ing n-grams --- p.66Chapter 5.2.2 --- Overlapping --- p.69Chapter 5.3 --- Organization Names --- p.71Chapter 5.4 --- Location Names --- p.73Chapter 5.5 --- Dates and Times --- p.74Chapter 5.6 --- Chapter Summary --- p.75Chapter 6 --- Topic Tracking in Chinese --- p.77Chapter 6.1 --- Introduction of Topic Tracking --- p.78Chapter 6.2 --- Experimental Data --- p.79Chapter 6.3 --- Evaluation Methodology --- p.81Chapter 6.3.1 --- Cost Function --- p.82Chapter 6.3.2 --- DET Curve --- p.83Chapter 6.4 --- The Named Entity Approach --- p.85Chapter 6.4.1 --- Designing the Named Entities Set for Topic Tracking --- p.85Chapter 6.4.2 --- Feature Selection --- p.86Chapter 6.4.3 --- Integrated with Vector-Space Model --- p.87Chapter 6.5 --- Experimental Results and Analysis --- p.91Chapter 6.5.1 --- Notations --- p.92Chapter 6.5.2 --- Stopword Elimination --- p.92Chapter 6.5.3 --- TEL Tagging --- p.95Chapter 6.5.4 --- Unknown Word Identifier --- p.100Chapter 6.5.5 --- Error Analysis --- p.106Chapter 6.6 --- Chapter Summary --- p.108Chapter 7 --- Conclusions and Future Work --- p.110Chapter 7.1 --- Conclusions --- p.110Chapter 7.2 --- Future Work --- p.111Bibliography --- p.113Chapter A --- The POS Tags --- p.121Chapter B --- Surnames and transliterated characters --- p.123Chapter C --- Stopword List for Person Name --- p.126Chapter D --- Organization suffixes --- p.127Chapter E --- Location suffixes --- p.128Chapter F --- Examples of Feature Table (Train set with condition D410) --- p.12

CUHK Digital Repository

Towards cross-lingual alerting for bursty epidemic events

Author: Collier Nigel
Publication venue
Publication date: 01/01/2011
Field of study

Background: Online news reports are increasingly becoming a source for event based early warning systems that detect natural disasters. Harnessing the massive volume of information available from multilingual newswire presents as many challenges as opportunities due to the patterns of reporting complex spatiotemporal events. Results: In this article we study the problem of utilising correlated event reports across languages. We track the evolution of 16 disease outbreaks using 5 temporal aberration detection algorithms on text-mined events classified according to disease and outbreak country. Using ProMED reports as a silver standard, comparative analysis of news data for 13 languages over a 129 day trial period showed improved sensitivity, F1 and timeliness across most models using cross-lingual events. We report a detailed case study analysis for Cholera in Angola 2010 which highlights the challenges faced in correlating news events with the silver standard. Conclusions: The results show that automated health surveillance using multilingual text mining has the potential to turn low value news into high value alerts if informed choices are used to govern the selection of models and data sources. An implementation of the C2 alerting algorithm using multilingual news is available at the BioCaster portal http://born.nii.ac.jp/?page=globalroundup

arXiv.org e-Print Archive

CiteSeerX

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central