20 research outputs found

    Growing Story Forest Online from Massive Breaking News

    Full text link
    We describe our experience of implementing a news content organization system at Tencent that discovers events from vast streams of breaking news and evolves news story structures in an online fashion. Our real-world system has distinct requirements in contrast to previous studies on topic detection and tracking (TDT) and event timeline or graph generation, in that we 1) need to accurately and quickly extract distinguishable events from massive streams of long text documents that cover diverse topics and contain highly redundant information, and 2) must develop the structures of event stories in an online manner, without repeatedly restructuring previously formed stories, in order to guarantee a consistent user viewing experience. In solving these challenges, we propose Story Forest, a set of online schemes that automatically clusters streaming documents into events, while connecting related events in growing trees to tell evolving stories. We conducted extensive evaluation based on 60 GB of real-world Chinese news data, although our ideas are not language-dependent and can easily be extended to other languages, through detailed pilot user experience studies. The results demonstrate the superior capability of Story Forest to accurately identify events and organize news text into a logical structure that is appealing to human readers, compared to multiple existing algorithm frameworks.Comment: Accepted by CIKM 2017, 9 page

    Matching Natural Language Sentences with Hierarchical Sentence Factorization

    Full text link
    Semantic matching of natural language sentences or identifying the relationship between two sentences is a core research problem underlying many natural language tasks. Depending on whether training data is available, prior research has proposed both unsupervised distance-based schemes and supervised deep learning schemes for sentence matching. However, previous approaches either omit or fail to fully utilize the ordered, hierarchical, and flexible structures of language objects, as well as the interactions between them. In this paper, we propose Hierarchical Sentence Factorization---a technique to factorize a sentence into a hierarchical representation, with the components at each different scale reordered into a "predicate-argument" form. The proposed sentence factorization technique leads to the invention of: 1) a new unsupervised distance metric which calculates the semantic distance between a pair of text snippets by solving a penalized optimal transport problem while preserving the logical relationship of words in the reordered sentences, and 2) new multi-scale deep learning models for supervised semantic training, based on factorized sentence hierarchies. We apply our techniques to text-pair similarity estimation and text-pair relationship classification tasks, based on multiple datasets such as STSbenchmark, the Microsoft Research paraphrase identification (MSRP) dataset, the SICK dataset, etc. Extensive experiments show that the proposed hierarchical sentence factorization can be used to significantly improve the performance of existing unsupervised distance-based metrics as well as multiple supervised deep learning models based on the convolutional neural network (CNN) and long short-term memory (LSTM).Comment: Accepted by WWW 2018, 10 page

    A User-Centered Concept Mining System for Query and Document Understanding at Tencent

    Full text link
    Concepts embody the knowledge of the world and facilitate the cognitive processes of human beings. Mining concepts from web documents and constructing the corresponding taxonomy are core research problems in text understanding and support many downstream tasks such as query analysis, knowledge base construction, recommendation, and search. However, we argue that most prior studies extract formal and overly general concepts from Wikipedia or static web pages, which are not representing the user perspective. In this paper, we describe our experience of implementing and deploying ConcepT in Tencent QQ Browser. It discovers user-centered concepts at the right granularity conforming to user interests, by mining a large amount of user queries and interactive search click logs. The extracted concepts have the proper granularity, are consistent with user language styles and are dynamically updated. We further present our techniques to tag documents with user-centered concepts and to construct a topic-concept-instance taxonomy, which has helped to improve search as well as news feeds recommendation in Tencent QQ Browser. We performed extensive offline evaluation to demonstrate that our approach could extract concepts of higher quality compared to several other existing methods. Our system has been deployed in Tencent QQ Browser. Results from online A/B testing involving a large number of real users suggest that the Impression Efficiency of feeds users increased by 6.01% after incorporating the user-centered concepts into the recommendation framework of Tencent QQ Browser.Comment: Accepted by KDD 201

    Understanding the External Links of Video Sharing Sites: Measurement and Analysis

    No full text

    Towards Understanding the External links of Video Sharing Sites: Measurement and Analysis

    No full text
    Recently, many video sharing sites provide external links so that their video or audio contents can be embedded into external web sites. For example, users can copy the embedded URLs of the videos of YouTube and post on their own blogs. Clearly, the purposeofsuchfunctionalityistoincreasethedistributionofthevideos and the associated advertisement. In this paper, we provide a comprehensivemeasurementstudyandanalysisontheseexternallinks. With the traces collected from two major VOD sites,YouTube and Youku of China, we show that the external links have various impact on the popularity of the VOD sites. More specifically, for videos that have been uploaded for eight months in Youku, around 15 % of views can come from external links. Some contents are densely linked, for example, the comedy videos can attract more than 800 external links on average. We also study the relationship between the external links and the internal links. We show that there are correlations; for example, if a video is popular itself, it will likely have a larger number of external links. Another observationisthatwealwaysfindthattheexternallinksofYoukuusually have a higher impact than that of YouTube. We conjecture that a more regional site may enjoy a relatively higher impact from the external links

    Understanding the External Links of Video Sharing Sites: Measurement and Analysis

    No full text
    Recently, many video sharing sites provide external links so that their video or audio contents can be embedded into external web sites. For example, users can copy the embedded URLs of the videos of YouTube and post the URL links on their own blogs. Clearly, the purpose of such function is to increase the distribution of the videos and the associated advertisement. Does this function fulfill its purpose and what is the quantification? In this paper, we provide a comprehensive measurement study and analysis on these external links to answer these two questions. With the traces collected from two major video sharing sites, YouTube and Youku of China, we show that the external links have various impacts on the popularity of the video sharing sites. More specifically, for videos that have been uploaded for eight months in Youku, around 15 % of views can come from external links. Some contents are densely linked. For example, comedy videos can attract more than 800 external links on average. We also study the relationship between the external links and the internal links. We show that there are correlations; for example, if a video is popular itself, it is likely to have a large number of external links. Another observation we find is that the external links usually have a higher impact on Youku than that of YouTube. We conjecture that it is more likely that the external links have higher impact for a regional site than a worldwide site
    corecore