121 research outputs found

    Growing Story Forest Online from Massive Breaking News

    Full text link
    We describe our experience of implementing a news content organization system at Tencent that discovers events from vast streams of breaking news and evolves news story structures in an online fashion. Our real-world system has distinct requirements in contrast to previous studies on topic detection and tracking (TDT) and event timeline or graph generation, in that we 1) need to accurately and quickly extract distinguishable events from massive streams of long text documents that cover diverse topics and contain highly redundant information, and 2) must develop the structures of event stories in an online manner, without repeatedly restructuring previously formed stories, in order to guarantee a consistent user viewing experience. In solving these challenges, we propose Story Forest, a set of online schemes that automatically clusters streaming documents into events, while connecting related events in growing trees to tell evolving stories. We conducted extensive evaluation based on 60 GB of real-world Chinese news data, although our ideas are not language-dependent and can easily be extended to other languages, through detailed pilot user experience studies. The results demonstrate the superior capability of Story Forest to accurately identify events and organize news text into a logical structure that is appealing to human readers, compared to multiple existing algorithm frameworks.Comment: Accepted by CIKM 2017, 9 page

    An adaptive model checking test for functional linear model

    Full text link
    Numerous studies have been devoted to the estimation and inference problems for functional linear models (FLM). However, few works focus on model checking problem that ensures the reliability of results. Limited tests in this area do not have tractable null distributions or asymptotic analysis under alternatives. Also, the functional predictor is usually assumed to be fully observed, which is impractical. To address these problems, we propose an adaptive model checking test for FLM. It combines regular moment-based and conditional moment-based tests, and achieves model adaptivity via the dimension of a residual-based subspace. The advantages of our test are manifold. First, it has a tractable chi-squared null distribution and higher powers under the alternatives than its components. Second, asymptotic properties under different underlying models are developed, including the unvisited local alternatives. Third, the test statistic is constructed upon finite grid points, which incorporates the discrete nature of collected data. We develop the desirable relationship between sample size and number of grid points to maintain the asymptotic properties. Besides, we provide a data-driven approach to estimate the dimension leading to model adaptivity, which is promising in sufficient dimension reduction. We conduct comprehensive numerical experiments to demonstrate the advantages the test inherits from its two simple components

    Development of a genome-wide multiple duplex-SSR protocol and its applications for the identification of selfed progeny in switchgrass

    Get PDF
    Background: Switchgrass (Panicum virgatum) is a herbaceous crop for the cellulosic biofuel feedstock development in the USA and Europe. As switchgrass is a naturally outcrossing species, accurate identification of selfed progeny is important to producing inbreds, which can be used in the production of heterotic hybrids. Development of a technically reliable, time-saving and easily used marker system is needed to quantify and characterize breeding origin of progeny plants of targeted parents.Results: Genome-wide screening of 915 mapped microsatellite (simple sequence repeat, SSR) markers was conducted, and 842 (92.0%) produced clear and scorable bands on a pooled DNA sample of eight switchgrass varieties. A total of 166 primer pairs were selected on the basis of their relatively even distribution in switchgrass genome and PCR amplification quality on 16 tetraploid genotypes. Mean polymorphic information content value for the 166 markers was 0.810 ranging from 0.116 to 0.959. From them, a core set of 48 loci, which had been mapped on 17 linkage groups, was further tested and optimized to develop 24 sets of duplex markers. Most of (up to 87.5%) targeted, but non-allelic amplicons within each duplex were separated by more than 10-bp. Using the established duplex PCR protocol, selfing ratio (i.e., selfed/all progeny x100%) was identified as 0% for a randomly selected open-pollinated 'Kanlow' genotype grown in the field, 15.4% for 22 field-grown plants of bagged inflorescences, and 77.3% for a selected plant grown in a growth chamber.Conclusions: The study developed a duplex SSR-based PCR protocol consisting of 48 markers, providing ample choices of non-tightly-linked loci in switchgrass whole genome, and representing a powerful, time-saving and easily used method for the identification of selfed progeny in switchgrass. The protocol should be a valuable tool in switchgrass breeding efforts.Peer reviewedPlant and Soil Science

    Online Local Differential Private Quantile Inference via Self-normalization

    Full text link
    Based on binary inquiries, we developed an algorithm to estimate population quantiles under Local Differential Privacy (LDP). By self-normalizing, our algorithm provides asymptotically normal estimation with valid inference, resulting in tight confidence intervals without the need for nuisance parameters to be estimated. Our proposed method can be conducted fully online, leading to high computational efficiency and minimal storage requirements with O(1)\mathcal{O}(1) space. We also proved an optimality result by an elegant application of one central limit theorem of Gaussian Differential Privacy (GDP) when targeting the frequently encountered median estimation problem. With mathematical proof and extensive numerical testing, we demonstrate the validity of our algorithm both theoretically and experimentally

    Gaussian Differential Privacy on Riemannian Manifolds

    Full text link
    We develop an advanced approach for extending Gaussian Differential Privacy (GDP) to general Riemannian manifolds. The concept of GDP stands out as a prominent privacy definition that strongly warrants extension to manifold settings, due to its central limit properties. By harnessing the power of the renowned Bishop-Gromov theorem in geometric analysis, we propose a Riemannian Gaussian distribution that integrates the Riemannian distance, allowing us to achieve GDP in Riemannian manifolds with bounded Ricci curvature. To the best of our knowledge, this work marks the first instance of extending the GDP framework to accommodate general Riemannian manifolds, encompassing curved spaces, and circumventing the reliance on tangent space summaries. We provide a simple algorithm to evaluate the privacy budget μ\mu on any one-dimensional manifold and introduce a versatile Markov Chain Monte Carlo (MCMC)-based algorithm to calculate μ\mu on any Riemannian manifold with constant curvature. Through simulations on one of the most prevalent manifolds in statistics, the unit sphere SdS^d, we demonstrate the superior utility of our Riemannian Gaussian mechanism in comparison to the previously proposed Riemannian Laplace mechanism for implementing GDP

    Exploring the Training Robustness of Distributional Reinforcement Learning against Noisy State Observations

    Full text link
    In real scenarios, state observations that an agent observes may contain measurement errors or adversarial noises, misleading the agent to take suboptimal actions or even collapse while training. In this paper, we study the training robustness of distributional Reinforcement Learning~(RL), a class of state-of-the-art methods that estimate the whole distribution, as opposed to only the expectation, of the total return. Firstly, we validate the contraction of distributional Bellman operators in the State-Noisy Markov Decision Process~(SN-MDP), a typical tabular case that incorporates both random and adversarial state observation noises. In the noisy setting with function approximation, we then analyze the vulnerability of least squared loss in expectation-based RL with either linear or nonlinear function approximation. By contrast, we theoretically characterize the bounded gradient norm of distributional RL loss based on the categorical parameterization equipped with the Kullback-Leibler~(KL) divergence. The resulting stable gradients while the optimization in distributional RL accounts for its better training robustness against state observation noises. Finally, extensive experiments on the suite of environments verified that distributional RL is less vulnerable against both random and adversarial noisy state observations compared with its expectation-based counterpart
    • …
    corecore