26,764 research outputs found
Classifying Attitude by Topic Aspect for English and Chinese Document Collections
The goal of this dissertation is to explore the design of tools to help users make sense of subjective information in English and Chinese by comparing attitudes on aspects of a topic in English and Chinese document collections. This involves two coupled challenges: topic aspect focus and attitude characterization. The topic aspect focus is specified by using information retrieval techniques to obtain documents on a topic that are of interest to a user and then
allowing the user to designate a few segments of those documents to serve as examples for aspects that she wishes to see characterized. A novel feature of this work is that the examples can be drawn from documents in two languages (English and Chinese). A bilingual aspect classifier which applies monolingual and cross-language classification techniques is used to assemble automatically a large set of document segments on those same aspects. A test collection was designed for aspect classification by annotating consecutive sentences in documents from the Topic Detection and Tracking collections as aspect instances. Experiments show that classification effectiveness can often be
increased by using training examples from both languages.
Attitude characterization is achieved by classifiers which determine the subjectivity and polarity of document segments. Sentence attitude classification is the focus of the experiments in
the dissertation because the best presently available test collection for Chinese attitude classification (the NTCIR-6 Chinese Opinion Analysis Pilot Task) is focused on sentence-level
classification. A large Chinese sentiment lexicon was constructed by leveraging existing Chinese and English lexical resources, and an
existing character-based approach for estimating the semantic orientation of other Chinese words was extended. A shallow linguistic analysis approach was adopted to classify the subjectivity and polarity of a sentence. Using the large sentiment lexicon with appropriate handling of negation, and leveraging sentence subjectivity density, sentence positivity and negativity, the resulting sentence attitude classifier was more effective than the best previously reported systems
Large scale evaluations of multimedia information retrieval: the TRECVid experience
Information Retrieval is a supporting technique which underpins a broad range of content-based applications including retrieval, filtering, summarisation, browsing, classification, clustering, automatic linking, and others. Multimedia information retrieval (MMIR) represents those applications when applied to multimedia information such as image, video, music, etc. In this presentation and extended abstract we are primarily concerned with MMIR as applied to information in digital video format. We begin with a brief overview of large scale evaluations of IR tasks in areas such as text, image and music, just to illustrate that this phenomenon is not just restricted to MMIR on video. The main contribution, however, is a set of pointers and a summarisation of the work done as part of TRECVid, the annual benchmarking exercise for video retrieval tasks
Event detection, tracking, and visualization in Twitter: a mention-anomaly-based approach
The ever-growing number of people using Twitter makes it a valuable source of
timely information. However, detecting events in Twitter is a difficult task,
because tweets that report interesting events are overwhelmed by a large volume
of tweets on unrelated topics. Existing methods focus on the textual content of
tweets and ignore the social aspect of Twitter. In this paper we propose MABED
(i.e. mention-anomaly-based event detection), a novel statistical method that
relies solely on tweets and leverages the creation frequency of dynamic links
(i.e. mentions) that users insert in tweets to detect significant events and
estimate the magnitude of their impact over the crowd. MABED also differs from
the literature in that it dynamically estimates the period of time during which
each event is discussed, rather than assuming a predefined fixed duration for
all events. The experiments we conducted on both English and French Twitter
data show that the mention-anomaly-based approach leads to more accurate event
detection and improved robustness in presence of noisy Twitter content.
Qualitatively speaking, we find that MABED helps with the interpretation of
detected events by providing clear textual descriptions and precise temporal
descriptions. We also show how MABED can help understanding users' interest.
Furthermore, we describe three visualizations designed to favor an efficient
exploration of the detected events.Comment: 17 page
High-level feature detection from video in TRECVid: a 5-year retrospective of achievements
Successful and effective content-based access to digital
video requires fast, accurate and scalable methods to determine the video content automatically. A variety of contemporary approaches to this rely on text taken from speech within the video, or on matching one video frame against others using low-level characteristics like
colour, texture, or shapes, or on determining and matching objects appearing within the video. Possibly the most important technique, however, is one which determines the presence or absence of a high-level or semantic feature, within a video clip or shot. By utilizing dozens, hundreds or even thousands of such semantic features we can support many kinds of content-based video navigation. Critically however, this depends on being able to determine whether each feature is or is not present in a video clip.
The last 5 years have seen much progress in the development of techniques to determine the presence of semantic features within video. This progress can be tracked in the annual TRECVid benchmarking activity where dozens of research groups measure the effectiveness of their techniques on common data and using an open, metrics-based approach. In this chapter we summarise the work
done on the TRECVid high-level feature task, showing the
progress made year-on-year. This provides a fairly comprehensive statement on where the state-of-the-art is regarding this important task, not just for one research group or for one approach, but across the spectrum. We then use this past and on-going work as a basis for highlighting the trends that are emerging in this area, and the questions which remain to be addressed before we can
achieve large-scale, fast and reliable high-level feature detection on video
Growing Story Forest Online from Massive Breaking News
We describe our experience of implementing a news content organization system
at Tencent that discovers events from vast streams of breaking news and evolves
news story structures in an online fashion. Our real-world system has distinct
requirements in contrast to previous studies on topic detection and tracking
(TDT) and event timeline or graph generation, in that we 1) need to accurately
and quickly extract distinguishable events from massive streams of long text
documents that cover diverse topics and contain highly redundant information,
and 2) must develop the structures of event stories in an online manner,
without repeatedly restructuring previously formed stories, in order to
guarantee a consistent user viewing experience. In solving these challenges, we
propose Story Forest, a set of online schemes that automatically clusters
streaming documents into events, while connecting related events in growing
trees to tell evolving stories. We conducted extensive evaluation based on 60
GB of real-world Chinese news data, although our ideas are not
language-dependent and can easily be extended to other languages, through
detailed pilot user experience studies. The results demonstrate the superior
capability of Story Forest to accurately identify events and organize news text
into a logical structure that is appealing to human readers, compared to
multiple existing algorithm frameworks.Comment: Accepted by CIKM 2017, 9 page
Transformational tagging for topic tracking in natural language.
Ip Chun Wah Timmy.Thesis (M.Phil.)--Chinese University of Hong Kong, 2000.Includes bibliographical references (leaves 113-120).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Topic Detection and Tracking --- p.2Chapter 1.1.1 --- What is a Topic? --- p.3Chapter 1.1.2 --- What is Topic Tracking? --- p.4Chapter 1.2 --- Research Contributions --- p.4Chapter 1.2.1 --- Named Entity Tagging --- p.5Chapter 1.2.2 --- Handling Unknown Words --- p.6Chapter 1.2.3 --- Named-Entity Approach in Topic Tracking --- p.7Chapter 1.3 --- Organization of Thesis --- p.7Chapter 2 --- Background --- p.9Chapter 2.1 --- Previous Developments in Topic Tracking --- p.10Chapter 2.1.1 --- BBN's Tracking System --- p.10Chapter 2.1.2 --- CMU's Tracking System --- p.11Chapter 2.1.3 --- Dragon's Tracking System --- p.12Chapter 2.1.4 --- UPenn's Tracking System --- p.13Chapter 2.2 --- Topic Tracking in Chinese --- p.13Chapter 2.3 --- Part-of-Speech Tagging --- p.15Chapter 2.3.1 --- A Brief Overview of POS Tagging --- p.15Chapter 2.3.2 --- Transformation-based Error-Driven Learning --- p.18Chapter 2.4 --- Unknown Word Identification --- p.20Chapter 2.4.1 --- Rule-based approaches --- p.21Chapter 2.4.2 --- Statistical approaches --- p.23Chapter 2.4.3 --- Hybrid approaches --- p.24Chapter 2.5 --- Information Retrieval Models --- p.25Chapter 2.5.1 --- Vector-Space Model --- p.26Chapter 2.5.2 --- Probabilistic Model --- p.27Chapter 2.6 --- Chapter Summary --- p.28Chapter 3 --- System Overview --- p.29Chapter 3.1 --- Segmenter --- p.30Chapter 3.2 --- TEL Tagger --- p.31Chapter 3.3 --- Unknown Words Identifier --- p.32Chapter 3.4 --- Topic Tracker --- p.33Chapter 3.5 --- Chapter Summary --- p.34Chapter 4 --- Named Entity Tagging --- p.36Chapter 4.1 --- Experimental Data --- p.37Chapter 4.2 --- Transformational Tagging --- p.41Chapter 4.2.1 --- Notations --- p.41Chapter 4.2.2 --- Corpus Utilization --- p.42Chapter 4.2.3 --- Lexical Rules --- p.42Chapter 4.2.4 --- Contextual Rules --- p.47Chapter 4.3 --- Experiment and Result --- p.49Chapter 4.3.1 --- Lexical Tag Initialization --- p.50Chapter 4.3.2 --- Contribution of Lexical and Contextual Rules --- p.52Chapter 4.3.3 --- Performance on Unknown Words --- p.56Chapter 4.3.4 --- A Possible Benchmark --- p.57Chapter 4.3.5 --- Comparison between TEL Approach and the Stochas- tic Approach --- p.58Chapter 4.4 --- Chapter Summary --- p.59Chapter 5 --- Handling Unknown Words in Topic Tracking --- p.62Chapter 5.1 --- Overview --- p.63Chapter 5.2 --- Person Names --- p.64Chapter 5.2.1 --- Forming possible named entities from OOV by group- ing n-grams --- p.66Chapter 5.2.2 --- Overlapping --- p.69Chapter 5.3 --- Organization Names --- p.71Chapter 5.4 --- Location Names --- p.73Chapter 5.5 --- Dates and Times --- p.74Chapter 5.6 --- Chapter Summary --- p.75Chapter 6 --- Topic Tracking in Chinese --- p.77Chapter 6.1 --- Introduction of Topic Tracking --- p.78Chapter 6.2 --- Experimental Data --- p.79Chapter 6.3 --- Evaluation Methodology --- p.81Chapter 6.3.1 --- Cost Function --- p.82Chapter 6.3.2 --- DET Curve --- p.83Chapter 6.4 --- The Named Entity Approach --- p.85Chapter 6.4.1 --- Designing the Named Entities Set for Topic Tracking --- p.85Chapter 6.4.2 --- Feature Selection --- p.86Chapter 6.4.3 --- Integrated with Vector-Space Model --- p.87Chapter 6.5 --- Experimental Results and Analysis --- p.91Chapter 6.5.1 --- Notations --- p.92Chapter 6.5.2 --- Stopword Elimination --- p.92Chapter 6.5.3 --- TEL Tagging --- p.95Chapter 6.5.4 --- Unknown Word Identifier --- p.100Chapter 6.5.5 --- Error Analysis --- p.106Chapter 6.6 --- Chapter Summary --- p.108Chapter 7 --- Conclusions and Future Work --- p.110Chapter 7.1 --- Conclusions --- p.110Chapter 7.2 --- Future Work --- p.111Bibliography --- p.113Chapter A --- The POS Tags --- p.121Chapter B --- Surnames and transliterated characters --- p.123Chapter C --- Stopword List for Person Name --- p.126Chapter D --- Organization suffixes --- p.127Chapter E --- Location suffixes --- p.128Chapter F --- Examples of Feature Table (Train set with condition D410) --- p.12
Towards cross-lingual alerting for bursty epidemic events
Background: Online news reports are increasingly becoming a source for event
based early warning systems that detect natural disasters. Harnessing the
massive volume of information available from multilingual newswire presents as
many challenges as opportunities due to the patterns of reporting complex
spatiotemporal events. Results: In this article we study the problem of
utilising correlated event reports across languages. We track the evolution of
16 disease outbreaks using 5 temporal aberration detection algorithms on
text-mined events classified according to disease and outbreak country. Using
ProMED reports as a silver standard, comparative analysis of news data for 13
languages over a 129 day trial period showed improved sensitivity, F1 and
timeliness across most models using cross-lingual events. We report a detailed
case study analysis for Cholera in Angola 2010 which highlights the challenges
faced in correlating news events with the silver standard. Conclusions: The
results show that automated health surveillance using multilingual text mining
has the potential to turn low value news into high value alerts if informed
choices are used to govern the selection of models and data sources. An
implementation of the C2 alerting algorithm using multilingual news is
available at the BioCaster portal http://born.nii.ac.jp/?page=globalroundup
- …