20 research outputs found
Growing Story Forest Online from Massive Breaking News
We describe our experience of implementing a news content organization system
at Tencent that discovers events from vast streams of breaking news and evolves
news story structures in an online fashion. Our real-world system has distinct
requirements in contrast to previous studies on topic detection and tracking
(TDT) and event timeline or graph generation, in that we 1) need to accurately
and quickly extract distinguishable events from massive streams of long text
documents that cover diverse topics and contain highly redundant information,
and 2) must develop the structures of event stories in an online manner,
without repeatedly restructuring previously formed stories, in order to
guarantee a consistent user viewing experience. In solving these challenges, we
propose Story Forest, a set of online schemes that automatically clusters
streaming documents into events, while connecting related events in growing
trees to tell evolving stories. We conducted extensive evaluation based on 60
GB of real-world Chinese news data, although our ideas are not
language-dependent and can easily be extended to other languages, through
detailed pilot user experience studies. The results demonstrate the superior
capability of Story Forest to accurately identify events and organize news text
into a logical structure that is appealing to human readers, compared to
multiple existing algorithm frameworks.Comment: Accepted by CIKM 2017, 9 page
Matching Natural Language Sentences with Hierarchical Sentence Factorization
Semantic matching of natural language sentences or identifying the
relationship between two sentences is a core research problem underlying many
natural language tasks. Depending on whether training data is available, prior
research has proposed both unsupervised distance-based schemes and supervised
deep learning schemes for sentence matching. However, previous approaches
either omit or fail to fully utilize the ordered, hierarchical, and flexible
structures of language objects, as well as the interactions between them. In
this paper, we propose Hierarchical Sentence Factorization---a technique to
factorize a sentence into a hierarchical representation, with the components at
each different scale reordered into a "predicate-argument" form. The proposed
sentence factorization technique leads to the invention of: 1) a new
unsupervised distance metric which calculates the semantic distance between a
pair of text snippets by solving a penalized optimal transport problem while
preserving the logical relationship of words in the reordered sentences, and 2)
new multi-scale deep learning models for supervised semantic training, based on
factorized sentence hierarchies. We apply our techniques to text-pair
similarity estimation and text-pair relationship classification tasks, based on
multiple datasets such as STSbenchmark, the Microsoft Research paraphrase
identification (MSRP) dataset, the SICK dataset, etc. Extensive experiments
show that the proposed hierarchical sentence factorization can be used to
significantly improve the performance of existing unsupervised distance-based
metrics as well as multiple supervised deep learning models based on the
convolutional neural network (CNN) and long short-term memory (LSTM).Comment: Accepted by WWW 2018, 10 page
A User-Centered Concept Mining System for Query and Document Understanding at Tencent
Concepts embody the knowledge of the world and facilitate the cognitive
processes of human beings. Mining concepts from web documents and constructing
the corresponding taxonomy are core research problems in text understanding and
support many downstream tasks such as query analysis, knowledge base
construction, recommendation, and search. However, we argue that most prior
studies extract formal and overly general concepts from Wikipedia or static web
pages, which are not representing the user perspective. In this paper, we
describe our experience of implementing and deploying ConcepT in Tencent QQ
Browser. It discovers user-centered concepts at the right granularity
conforming to user interests, by mining a large amount of user queries and
interactive search click logs. The extracted concepts have the proper
granularity, are consistent with user language styles and are dynamically
updated. We further present our techniques to tag documents with user-centered
concepts and to construct a topic-concept-instance taxonomy, which has helped
to improve search as well as news feeds recommendation in Tencent QQ Browser.
We performed extensive offline evaluation to demonstrate that our approach
could extract concepts of higher quality compared to several other existing
methods. Our system has been deployed in Tencent QQ Browser. Results from
online A/B testing involving a large number of real users suggest that the
Impression Efficiency of feeds users increased by 6.01% after incorporating the
user-centered concepts into the recommendation framework of Tencent QQ Browser.Comment: Accepted by KDD 201
Towards Understanding the External links of Video Sharing Sites: Measurement and Analysis
Recently, many video sharing sites provide external links so that their video or audio contents can be embedded into external web sites. For example, users can copy the embedded URLs of the videos of YouTube and post on their own blogs. Clearly, the purposeofsuchfunctionalityistoincreasethedistributionofthevideos and the associated advertisement. In this paper, we provide a comprehensivemeasurementstudyandanalysisontheseexternallinks. With the traces collected from two major VOD sites,YouTube and Youku of China, we show that the external links have various impact on the popularity of the VOD sites. More specifically, for videos that have been uploaded for eight months in Youku, around 15 % of views can come from external links. Some contents are densely linked, for example, the comedy videos can attract more than 800 external links on average. We also study the relationship between the external links and the internal links. We show that there are correlations; for example, if a video is popular itself, it will likely have a larger number of external links. Another observationisthatwealwaysfindthattheexternallinksofYoukuusually have a higher impact than that of YouTube. We conjecture that a more regional site may enjoy a relatively higher impact from the external links
Understanding the External Links of Video Sharing Sites: Measurement and Analysis
Recently, many video sharing sites provide external links so that their video or audio contents can be embedded into external web sites. For example, users can copy the embedded URLs of the videos of YouTube and post the URL links on their own blogs. Clearly, the purpose of such function is to increase the distribution of the videos and the associated advertisement. Does this function fulfill its purpose and what is the quantification? In this paper, we provide a comprehensive measurement study and analysis on these external links to answer these two questions. With the traces collected from two major video sharing sites, YouTube and Youku of China, we show that the external links have various impacts on the popularity of the video sharing sites. More specifically, for videos that have been uploaded for eight months in Youku, around 15 % of views can come from external links. Some contents are densely linked. For example, comedy videos can attract more than 800 external links on average. We also study the relationship between the external links and the internal links. We show that there are correlations; for example, if a video is popular itself, it is likely to have a large number of external links. Another observation we find is that the external links usually have a higher impact on Youku than that of YouTube. We conjecture that it is more likely that the external links have higher impact for a regional site than a worldwide site