5,221 research outputs found
Parsing Thai Social Data: A New Challenge for Thai NLP
Dependency parsing (DP) is a task that analyzes text for syntactic structure
and relationship between words. DP is widely used to improve natural language
processing (NLP) applications in many languages such as English. Previous works
on DP are generally applicable to formally written languages. However, they do
not apply to informal languages such as the ones used in social networks.
Therefore, DP has to be researched and explored with such social network data.
In this paper, we explore and identify a DP model that is suitable for Thai
social network data. After that, we will identify the appropriate linguistic
unit as an input. The result showed that, the transition based model called,
improve Elkared dependency parser outperform the others at UAS of 81.42%.Comment: 7 Pages, 8 figures, to be published in The 14th International Joint
Symposium on Artificial Intelligence and Natural Language Processing
(iSAI-NLP 2019
Growing Story Forest Online from Massive Breaking News
We describe our experience of implementing a news content organization system
at Tencent that discovers events from vast streams of breaking news and evolves
news story structures in an online fashion. Our real-world system has distinct
requirements in contrast to previous studies on topic detection and tracking
(TDT) and event timeline or graph generation, in that we 1) need to accurately
and quickly extract distinguishable events from massive streams of long text
documents that cover diverse topics and contain highly redundant information,
and 2) must develop the structures of event stories in an online manner,
without repeatedly restructuring previously formed stories, in order to
guarantee a consistent user viewing experience. In solving these challenges, we
propose Story Forest, a set of online schemes that automatically clusters
streaming documents into events, while connecting related events in growing
trees to tell evolving stories. We conducted extensive evaluation based on 60
GB of real-world Chinese news data, although our ideas are not
language-dependent and can easily be extended to other languages, through
detailed pilot user experience studies. The results demonstrate the superior
capability of Story Forest to accurately identify events and organize news text
into a logical structure that is appealing to human readers, compared to
multiple existing algorithm frameworks.Comment: Accepted by CIKM 2017, 9 page
DancingLines: An Analytical Scheme to Depict Cross-Platform Event Popularity
Nowadays, events usually burst and are propagated online through multiple
modern media like social networks and search engines. There exists various
research discussing the event dissemination trends on individual medium, while
few studies focus on event popularity analysis from a cross-platform
perspective. Challenges come from the vast diversity of events and media,
limited access to aligned datasets across different media and a great deal of
noise in the datasets. In this paper, we design DancingLines, an innovative
scheme that captures and quantitatively analyzes event popularity between
pairwise text media. It contains two models: TF-SW, a semantic-aware popularity
quantification model, based on an integrated weight coefficient leveraging
Word2Vec and TextRank; and wDTW-CD, a pairwise event popularity time series
alignment model matching different event phases adapted from Dynamic Time
Warping. We also propose three metrics to interpret event popularity trends
between pairwise social platforms. Experimental results on eighteen real-world
event datasets from an influential social network and a popular search engine
validate the effectiveness and applicability of our scheme. DancingLines is
demonstrated to possess broad application potentials for discovering the
knowledge of various aspects related to events and different media
Improving the translation environment for professional translators
When using computer-aided translation systems in a typical, professional translation workflow, there are several stages at which there is room for improvement. The SCATE (Smart Computer-Aided Translation Environment) project investigated several of these aspects, both from a human-computer interaction point of view, as well as from a purely technological side.
This paper describes the SCATE research with respect to improved fuzzy matching, parallel treebanks, the integration of translation memories with machine translation, quality estimation, terminology extraction from comparable texts, the use of speech recognition in the translation process, and human computer interaction and interface design for the professional translation environment. For each of these topics, we describe the experiments we performed and the conclusions drawn, providing an overview of the highlights of the entire SCATE project
What Makes it Difficult to Understand a Scientific Literature?
In the artificial intelligence area, one of the ultimate goals is to make
computers understand human language and offer assistance. In order to achieve
this ideal, researchers of computer science have put forward a lot of models
and algorithms attempting at enabling the machine to analyze and process human
natural language on different levels of semantics. Although recent progress in
this field offers much hope, we still have to ask whether current research can
provide assistance that people really desire in reading and comprehension. To
this end, we conducted a reading comprehension test on two scientific papers
which are written in different styles. We use the semantic link models to
analyze the understanding obstacles that people will face in the process of
reading and figure out what makes it difficult for human to understand a
scientific literature. Through such analysis, we summarized some
characteristics and problems which are reflected by people with different
levels of knowledge on the comprehension of difficult science and technology
literature, which can be modeled in semantic link network. We believe that
these characteristics and problems will help us re-examine the existing machine
models and are helpful in the designing of new one.Comment: Accepted by SKG201
- …