3,777 research outputs found
Strong Optical and UV Intermediate-Width Emission Lines in the Quasar SDSS J232444.80-094600.3: Dust-Free and Intermediate-Density Gas at the Skin of Dusty Torus ?
Emission lines from the broad emission line region (BELR) and the narrow
emission line region (NELR) of active galactic nuclei (AGNs) are extensively
studied. However, between these two regions emission lines are rarely detected.
We present a detailed analysis of a quasar SDSS J232444.80-094600.3 (SDSS
J23240946), which is remarkable for its strong intermediate-width emission
lines (IELs) with FWHM 1800 \kmps. The IEL component is presented in
different emission lines, including the permitted lines \lya\ 1216,
\civ\ 1549, semiforbidden line \ciii\ 1909, and forbidden
lines \oiii\ 4959, 5007. With the aid of photo-ionization
models, we found that the IELs are produced by gas with a hydrogen density of
, a distance to the central
ionizing source of pc, a covering factor of CF 6\%, and a
dust-to-gas ratio of times of SMC. We suggest that the strong IELs
of this quasar are produced by nearly dust-free and intermediate-density gas
located at the skin of the dusty torus. Such strong IELs, served as a useful
diagnose, can provide an avenue to study the properties of gas between the BELR
and the NELR
Provable Benefits of Policy Learning from Human Preferences in Contextual Bandit Problems
A crucial task in decision-making problems is reward engineering. It is
common in practice that no obvious choice of reward function exists. Thus, a
popular approach is to introduce human feedback during training and leverage
such feedback to learn a reward function. Among all policy learning methods
that use human feedback, preference-based methods have demonstrated substantial
success in recent empirical applications such as InstructGPT. In this work, we
develop a theory that provably shows the benefits of preference-based methods
in offline contextual bandits. In particular, we improve the modeling and
suboptimality analysis for running policy learning methods on human-scored
samples directly. Then, we compare it with the suboptimality guarantees of
preference-based methods and show that preference-based methods enjoy lower
suboptimality
Sample Complexity of Neural Policy Mirror Descent for Policy Optimization on Low-Dimensional Manifolds
Policy-based algorithms equipped with deep neural networks have achieved
great success in solving high-dimensional policy optimization problems in
reinforcement learning. However, current analyses cannot explain why they are
resistant to the curse of dimensionality. In this work, we study the sample
complexity of the neural policy mirror descent (NPMD) algorithm with
convolutional neural networks (CNN) as function approximators. Motivated by the
empirical observation that many high-dimensional environments have state spaces
possessing low-dimensional structures, such as those taking images as states,
we consider the state space to be a -dimensional manifold embedded in the
-dimensional Euclidean space with intrinsic dimension . We show that
in each iteration of NPMD, both the value function and the policy can be well
approximated by CNNs. The approximation errors are controlled by the size of
the networks, and the smoothness of the previous networks can be inherited. As
a result, by properly choosing the network size and hyperparameters, NPMD can
find an -optimal policy with
samples in expectation, where
indicates the smoothness of environment. Compared to previous
work, our result exhibits that NPMD can leverage the low-dimensional structure
of state space to escape from the curse of dimensionality, providing an
explanation for the efficacy of deep policy-based algorithms
- …