121 research outputs found
Growing Story Forest Online from Massive Breaking News
We describe our experience of implementing a news content organization system
at Tencent that discovers events from vast streams of breaking news and evolves
news story structures in an online fashion. Our real-world system has distinct
requirements in contrast to previous studies on topic detection and tracking
(TDT) and event timeline or graph generation, in that we 1) need to accurately
and quickly extract distinguishable events from massive streams of long text
documents that cover diverse topics and contain highly redundant information,
and 2) must develop the structures of event stories in an online manner,
without repeatedly restructuring previously formed stories, in order to
guarantee a consistent user viewing experience. In solving these challenges, we
propose Story Forest, a set of online schemes that automatically clusters
streaming documents into events, while connecting related events in growing
trees to tell evolving stories. We conducted extensive evaluation based on 60
GB of real-world Chinese news data, although our ideas are not
language-dependent and can easily be extended to other languages, through
detailed pilot user experience studies. The results demonstrate the superior
capability of Story Forest to accurately identify events and organize news text
into a logical structure that is appealing to human readers, compared to
multiple existing algorithm frameworks.Comment: Accepted by CIKM 2017, 9 page
An adaptive model checking test for functional linear model
Numerous studies have been devoted to the estimation and inference problems
for functional linear models (FLM). However, few works focus on model checking
problem that ensures the reliability of results. Limited tests in this area do
not have tractable null distributions or asymptotic analysis under
alternatives. Also, the functional predictor is usually assumed to be fully
observed, which is impractical. To address these problems, we propose an
adaptive model checking test for FLM. It combines regular moment-based and
conditional moment-based tests, and achieves model adaptivity via the dimension
of a residual-based subspace. The advantages of our test are manifold. First,
it has a tractable chi-squared null distribution and higher powers under the
alternatives than its components. Second, asymptotic properties under different
underlying models are developed, including the unvisited local alternatives.
Third, the test statistic is constructed upon finite grid points, which
incorporates the discrete nature of collected data. We develop the desirable
relationship between sample size and number of grid points to maintain the
asymptotic properties. Besides, we provide a data-driven approach to estimate
the dimension leading to model adaptivity, which is promising in sufficient
dimension reduction. We conduct comprehensive numerical experiments to
demonstrate the advantages the test inherits from its two simple components
Development of a genome-wide multiple duplex-SSR protocol and its applications for the identification of selfed progeny in switchgrass
Background: Switchgrass (Panicum virgatum) is a herbaceous crop for the cellulosic biofuel feedstock development in the USA and Europe. As switchgrass is a naturally outcrossing species, accurate identification of selfed progeny is important to producing inbreds, which can be used in the production of heterotic hybrids. Development of a technically reliable, time-saving and easily used marker system is needed to quantify and characterize breeding origin of progeny plants of targeted parents.Results: Genome-wide screening of 915 mapped microsatellite (simple sequence repeat, SSR) markers was conducted, and 842 (92.0%) produced clear and scorable bands on a pooled DNA sample of eight switchgrass varieties. A total of 166 primer pairs were selected on the basis of their relatively even distribution in switchgrass genome and PCR amplification quality on 16 tetraploid genotypes. Mean polymorphic information content value for the 166 markers was 0.810 ranging from 0.116 to 0.959. From them, a core set of 48 loci, which had been mapped on 17 linkage groups, was further tested and optimized to develop 24 sets of duplex markers. Most of (up to 87.5%) targeted, but non-allelic amplicons within each duplex were separated by more than 10-bp. Using the established duplex PCR protocol, selfing ratio (i.e., selfed/all progeny x100%) was identified as 0% for a randomly selected open-pollinated 'Kanlow' genotype grown in the field, 15.4% for 22 field-grown plants of bagged inflorescences, and 77.3% for a selected plant grown in a growth chamber.Conclusions: The study developed a duplex SSR-based PCR protocol consisting of 48 markers, providing ample choices of non-tightly-linked loci in switchgrass whole genome, and representing a powerful, time-saving and easily used method for the identification of selfed progeny in switchgrass. The protocol should be a valuable tool in switchgrass breeding efforts.Peer reviewedPlant and Soil Science
Online Local Differential Private Quantile Inference via Self-normalization
Based on binary inquiries, we developed an algorithm to estimate population
quantiles under Local Differential Privacy (LDP). By self-normalizing, our
algorithm provides asymptotically normal estimation with valid inference,
resulting in tight confidence intervals without the need for nuisance
parameters to be estimated. Our proposed method can be conducted fully online,
leading to high computational efficiency and minimal storage requirements with
space. We also proved an optimality result by an elegant
application of one central limit theorem of Gaussian Differential Privacy (GDP)
when targeting the frequently encountered median estimation problem. With
mathematical proof and extensive numerical testing, we demonstrate the validity
of our algorithm both theoretically and experimentally
Gaussian Differential Privacy on Riemannian Manifolds
We develop an advanced approach for extending Gaussian Differential Privacy
(GDP) to general Riemannian manifolds. The concept of GDP stands out as a
prominent privacy definition that strongly warrants extension to manifold
settings, due to its central limit properties. By harnessing the power of the
renowned Bishop-Gromov theorem in geometric analysis, we propose a Riemannian
Gaussian distribution that integrates the Riemannian distance, allowing us to
achieve GDP in Riemannian manifolds with bounded Ricci curvature. To the best
of our knowledge, this work marks the first instance of extending the GDP
framework to accommodate general Riemannian manifolds, encompassing curved
spaces, and circumventing the reliance on tangent space summaries. We provide a
simple algorithm to evaluate the privacy budget on any one-dimensional
manifold and introduce a versatile Markov Chain Monte Carlo (MCMC)-based
algorithm to calculate on any Riemannian manifold with constant
curvature. Through simulations on one of the most prevalent manifolds in
statistics, the unit sphere , we demonstrate the superior utility of our
Riemannian Gaussian mechanism in comparison to the previously proposed
Riemannian Laplace mechanism for implementing GDP
Exploring the Training Robustness of Distributional Reinforcement Learning against Noisy State Observations
In real scenarios, state observations that an agent observes may contain
measurement errors or adversarial noises, misleading the agent to take
suboptimal actions or even collapse while training. In this paper, we study the
training robustness of distributional Reinforcement Learning~(RL), a class of
state-of-the-art methods that estimate the whole distribution, as opposed to
only the expectation, of the total return. Firstly, we validate the contraction
of distributional Bellman operators in the State-Noisy Markov Decision
Process~(SN-MDP), a typical tabular case that incorporates both random and
adversarial state observation noises. In the noisy setting with function
approximation, we then analyze the vulnerability of least squared loss in
expectation-based RL with either linear or nonlinear function approximation. By
contrast, we theoretically characterize the bounded gradient norm of
distributional RL loss based on the categorical parameterization equipped with
the Kullback-Leibler~(KL) divergence. The resulting stable gradients while the
optimization in distributional RL accounts for its better training robustness
against state observation noises. Finally, extensive experiments on the suite
of environments verified that distributional RL is less vulnerable against both
random and adversarial noisy state observations compared with its
expectation-based counterpart
- …