420 research outputs found
Robustness Guarantees for Mode Estimation with an Application to Bandits
Mode estimation is a classical problem in statistics with a wide range of
applications in machine learning. Despite this, there is little understanding
in its robustness properties under possibly adversarial data contamination. In
this paper, we give precise robustness guarantees as well as privacy guarantees
under simple randomization. We then introduce a theory for multi-armed bandits
where the values are the modes of the reward distributions instead of the mean.
We prove regret guarantees for the problems of top arm identification, top
m-arms identification, contextual modal bandits, and infinite continuous arms
top arm recovery. We show in simulations that our algorithms are robust to
perturbation of the arms by adversarial noise sequences, thus rendering modal
bandits an attractive choice in situations where the rewards may have outliers
or adversarial corruptions.Comment: 12 pages, 7 figures, 14 appendix page
Online and Distribution-Free Robustness: Regression and Contextual Bandits with Huber Contamination
In this work we revisit two classic high-dimensional online learning
problems, namely linear regression and contextual bandits, from the perspective
of adversarial robustness. Existing works in algorithmic robust statistics make
strong distributional assumptions that ensure that the input data is evenly
spread out or comes from a nice generative model. Is it possible to achieve
strong robustness guarantees even without distributional assumptions
altogether, where the sequence of tasks we are asked to solve is adaptively and
adversarially chosen?
We answer this question in the affirmative for both linear regression and
contextual bandits. In fact our algorithms succeed where conventional methods
fail. In particular we show strong lower bounds against Huber regression and
more generally any convex M-estimator. Our approach is based on a novel
alternating minimization scheme that interleaves ordinary least-squares with a
simple convex program that finds the optimal reweighting of the distribution
under a spectral constraint. Our results obtain essentially optimal dependence
on the contamination level , reach the optimal breakdown point, and
naturally apply to infinite dimensional settings where the feature vectors are
represented implicitly via a kernel map.Comment: 66 pages, 1 figure, v3: refined exposition and improved rate
Tsallis-INF: An Optimal Algorithm for Stochastic and Adversarial Bandits
We derive an algorithm that achieves the optimal (within constants)
pseudo-regret in both adversarial and stochastic multi-armed bandits without
prior knowledge of the regime and time horizon. The algorithm is based on
online mirror descent (OMD) with Tsallis entropy regularization with power
and reduced-variance loss estimators. More generally, we define an
adversarial regime with a self-bounding constraint, which includes stochastic
regime, stochastically constrained adversarial regime (Wei and Luo), and
stochastic regime with adversarial corruptions (Lykouris et al.) as special
cases, and show that the algorithm achieves logarithmic regret guarantee in
this regime and all of its special cases simultaneously with the adversarial
regret guarantee.} The algorithm also achieves adversarial and stochastic
optimality in the utility-based dueling bandit setting. We provide empirical
evaluation of the algorithm demonstrating that it significantly outperforms
UCB1 and EXP3 in stochastic environments. We also provide examples of
adversarial environments, where UCB1 and Thompson Sampling exhibit almost
linear regret, whereas our algorithm suffers only logarithmic regret. To the
best of our knowledge, this is the first example demonstrating vulnerability of
Thompson Sampling in adversarial environments. Last, but not least, we present
a general stochastic analysis and a general adversarial analysis of OMD
algorithms with Tsallis entropy regularization for and explain
the reason why works best
์ฑ๋ฅ์ด ๊ฐ์ ๋ ํจ์จ์ ์ธ ์ ํ ๋ค์ค ์ฌ๋กฏ ๋จธ์ ์๊ณ ๋ฆฌ์ฆ
ํ์๋
ผ๋ฌธ(๋ฐ์ฌ) -- ์์ธ๋ํ๊ต๋ํ์ : ์์ฐ๊ณผํ๋ํ ํต๊ณํ๊ณผ, 2022.2. Myunghee Cho Paik.This thesis contains two proposed efficient algorithms: (i) Doubly Robust
Thompson Sampling (DRTS) and (ii) Hybridization by Randomization (HyRan).
DRTS employs the doubly-robust method used in missing data literature to
Thompson Sampling with contexts (LinTS). A challenging aspect of the bandit
problem is that a stochastic reward is observed only for the chosen arm and
the rewards of other arms remain missing. The dependence of the arm choice
on the past context and reward pairs compounds the complexity of regret
analysis. Different from previous works relying on missing data techniques
[Dimakopoulou et al., 2019, Kim and Paik, 2019], the proposed algorithm is
designed to allow a novel additive regret decomposition leading to an improved
regret bound with the order of \tilde{O}(\phi^{-2}\sqrt{T}) where \phi^2 is the minimum eigenvalue
of the covariance matrix of contexts and T is the time horizon. This is the
first regret bound of LinTS using \phi^{2} without the dimension of the context, d and the regret bound of the proposed algorithm is \tilde{O}(d\sqrt{T}) in many practical
scenarios, improving the bound of LinTS by a factor of \sqrt{d}. A benefit of the
proposed method is that it utilizes all the context data, chosen or not chosen,
thus allowing to circumvent the technical definition of unsaturated arms used
in theoretical analysis of LinTS. Empirical studies show the advantage of the
proposed algorithm over LinTS.
HyRan is a novel bandit algorithm and show that our proposed algorithm
establish the regret bound of \tilde{O}(\sqrt{dT}), which is optimal up to the logarithmic
ifactors. The novelty comes from the two modifications where the first is to
utilize all contexts, both selected and unselected, and the second is to randomize
the contribution to the estimator. These modifications render a novel
decomposition of the cumulative regret into two main additive terms whose
bounds can be derived by employing the structure of the compounding estimator.
While previous algorithms such as SupLinUCB [Chu et al., 2011] have
shown \tilde{O}(\sqrt{dT}) regret, exploiting independence via a phased algorithm, HyRan
is the first to achieve \tilde{O}(\sqrt{dT}) regret keeping the practical advantage without
resorting to generating independent samples. The numerical experiments show
that the practical performance of our proposed algorithm is in line with the
theoretical guarantees.๋ณธ ํ์๋
ผ๋ฌธ์ ์์ฐจ์ ๊ฒฐ์ ๋ฌธ์ (Sequential decision making problem)๋ฅผ
์ํ ํจ์จ์ ์ธ ์ ํ ๋ค์ค ์ฌ๋กฏ ๋จธ์ ์๊ณ ๋ฆฌ์ฆ(Linear Contextual Bandit
Algorithm)์ ์ ์ํ๋ค. ์ ํ ๋ค์ค ์ฌ๋กฏ ๋จธ์ ์๊ณ ๋ฆฌ์ฆ์ ์ ํ ๊ฐ์
์ ํ์ง(Arm)๊ฐ ์ฃผ์ด์ง ํน์ ํ๊ฒฝ ์์์ ํ์ต์๊ฐ ๊ทธ ์ ํ์ง์
๋ด์ฉ(Context)์ ๊ด์ฐฐํ๊ณ ์ด๋ค ์ค ๋ณด์ (Reward)์ ์ต๋ํํ๋ ํ๋์
ํ์
ํ๊ณ ์ ํํ๋ ๋ฐฉ๋ฒ๋ก ์ด๋ค. ๋ณด์์ ์ ํ์ง์ ๋ด์ฉ๊ณผ ์ ํ ๊ด๊ณ๋ฅผ
๊ฐ์ง๊ณ ์๋ค. ํ์ฌ๊น์ง ์ ์๋ ์ ํ ๋ค์ค ์ฌ๋กฏ ๋จธ์ ์๊ณ ๋ฆฌ์ฆ์ ๋ด์ฉ๊ณผ
๋ณด์์ ๊ด๊ณ๋ฅผ ์ถ์ ํ ๋, ์ ํ๋ ๋ด์ฉ๊ณผ ๋ณด์์ผ๋ก๋ง ์ถ์ ํ๊ณ ์๋ค.
์ด๋ ์ ํ๋์ง ์์ ๋ด์ฉ๋ค์ ๊ด์ฐฐ๋ง ํ๊ณ ์ถ์ ์๋ ์ฌ์ฉํ ์ ์๋
๋นํจ์จ์ฑ์ ์ ๋ฐํ๋ค. ์ด๋ก ์ธํด ๋ค์ค ์ฌ๋กฏ ๋จธ์ ์ด ํ์ฉ๋๋ ๋ด์ค ๊ธฐ์ฌ
๋ฐฐ์น ์๊ณ ๋ฆฌ์ฆ์ด๋ ๊ด๊ณ ์ถ์ฒ ์๊ณ ๋ฆฌ์ฆ์ด๋ ๋ชจ๋ฐ์ผ ๊ฑด๊ฐ๊ด๋ฆฌ ์์คํ
๋ฑ์์ ์ ํ๋ฐ์ง ๋ชปํ ๊ธฐ์ฌ, ๊ด๊ณ , ๊ฑด๊ฐ๊ด๋ฆฌ๋ฒ๊ณผ ๊ฐ์ ๋ด์ฉ์ด ์ถ์ ์
์ฌ์ฉ๋ ์ ์๋ ๋นํจ์จ์ฑ์ด ๋ฐ์ํ๋ค.
๋ณธ ํ์๋
ผ๋ฌธ์์๋ ์ ํ๋ฐ์ง ๋ชปํ ๋ด์ฉ๋ค๋ ์ถ์ ์ ํ์ฉํ ์ ์๋
์๋ก์ด ์ ํ ๋ค์ค ์ฌ๋กฏ ๋จธ์ ์๊ณ ๋ฆฌ์ฆ ๋ ๊ฐ์ง๋ฅผ ์ ์ํ์๋ค. ์ฒซ์งธ๋
๊ฒฐ์ธก์๋ฃ ๋ถ์๋ฒ ์ค ์ด์ค ๊ฐ๊ฑด๋ฒ(Doubly Robust)์ ์ ์ฉํ์ฌ ๊ด์ธกํ์ง
๋ชปํ ๋ณด์์ ์ ์ฌ ๋ณด์(Pseudo-reward)์ผ๋ก ๋์ฒดํ๋ฉด์ ์ ํ๋์ง ๋ชปํ
๋ด์ฉ๋ ์ถ์ ์ ํ์ฉํ ์ ์๋๋ก ํ์๊ณ , ์ด๋ฅผ ํตํด ๋ด์ฉ์ ์ฐจ์์
์ ๊ณฑ๊ทผ๋งํผ ํํ ์ํ (Regret bound)๋ฅผ ๊ฐ์ ํ์๋ค. ๋์งธ๋ ๊ฐ๋จํ
๋๋คํ(Randomization)๋ฅผ ์ ์ฉํ์ฌ ์ ํ๋ฐ์ง ๋ชปํ ๋ด์ฉ์ ํ์ฉํ๋
๋ฐฉ๋ฒ๊ณผ ์ ํํ ๋ด์ฉ๋ง ์ฌ์ฉํ๋ ๋ฐฉ๋ฒ์ ํผํฉํ์ฌ ๋ง๋ ํผํฉ
์ถ์ ๋(Compound Estimator)์ ์ ์ํ๊ณ , ์ด ์๊ณ ๋ฆฌ์ฆ์ด ์ต์ (Optimal
rate)์ ํํ ์ํ์ ๊ฐ์ก์์ ์ฆ๋ช
ํ์๋ค. ๋ณธ ํ์๋
ผ๋ฌธ์ ์ ์๊ณ ๋ฆฌ์ฆ์ด
์ ํ๋ฐ์ง ๋ชปํ ๋ด์ฉ์ ํ์ฉํ๋ฉด์ ์ด๋ก ์ ์ผ๋ก ์ฑ๋ฅ์ด ๊ฐ์ ๋์์์
์ฆ๋ช
ํ์๊ณ , ์๋ฎฌ๋ ์ด์
๋ฐ์ดํฐ์ ์ ์ฉํ ๊ฒฐ๊ณผ๋ฅผ ํตํด์๋ ๊ธฐ์กด
์๊ณ ๋ฆฌ์ฆ๋ณด๋ค ์ฑ๋ฅ์ด ๊ฐ์ ๋์์์ ํ์ธํ์๋ค.1 Doubly Robust Thompson Sampling with Linear Payoffs 1
1.1 Introduction 1
1.2 Related Works 4
1.3 Proposed Estimator and Algorithm 6
1.3.1 Settings and Assumptions 6
1.3.2 Doubly Robust Estimator 7
1.3.3 Algorithm 8
1.4 Theoretical Results 9
1.4.1 An Improved Regret Bound 10
1.4.2 Super-unsaturated Arms and a Novel Regret Decomposition 11
1.4.3 Bounds for the Cumulative Regret 13
1.5 Simulation Studies 17
1.6 Conclusion 18
1.7 Appendix 20
1.7.1 Detailed Analysis of the Resampling 20
1.7.1.1 Precise Definition of Action Selection 20
1.7.1.2 Computing the Probability of Selection 20
1.7.1.3 The Number of Maximum Possible Resampling 21
1.7.2 Technical Lemmas 23
iii1.7.3 Proofs of Theoretical Results 25
1.7.3.1 Proof Theorem 1.1 25
1.7.3.2 Proof of Lemma 1.2 27
1.7.3.3 Proof of Theorem 1.3 28
1.7.3.4 Proof of Lemma 1.4 33
1.7.3.5 Proof of Lemma 1.6 35
1.7.4 Implementation Details 36
1.7.4.1 Efficient Calculation of the Sampling Probability 36
1.7.5 A Review of Approaches to Missing Data and Doublyrobust
Method 38
1.7.5.1 Doubly-robust Method in Missing Data 38
1.7.5.2 Application to Bandit Settings 41
2 Near-optimal Algorithm for Linear Contextual Bandits with
Compounding Estimator 44
2.1 Introduction 44
2.2 Related Works 47
2.3 Linear Contextual Bandit Problem 48
2.4 Proposed methods 49
2.4.1 Compounding Estimator 49
2.4.2 HyRan Algorithm 51
2.5 Main Results 52
2.5.1 Regret Bound of HyRan 53
2.5.2 Regret Decomposition 54
2.5.3 A Matching Lower Bound 58
2.6 Numerical Experiments 58
2.7 Appendix 61
2.7.1 Technical Lemmas 61
2.7.2 Proof of Theorem 2.1 61
iv2.7.3 Proof of Lemma 2.3 66
2.7.4 Proof of Theorem 1.3 69
2.7.5 Proof of Lemma 2.5 81
2.7.6 Proof of Theorem 2.6 83๋ฐ
Motion Planning for Autonomous Vehicles in Partially Observable Environments
Unsicherheiten, welche aus Sensorrauschen oder nicht beobachtbaren Manรถverintentionen anderer Verkehrsteilnehmer resultieren, akkumulieren sich in der Datenverarbeitungskette eines autonomen Fahrzeugs und fรผhren zu einer unvollstรคndigen oder fehlinterpretierten Umfeldreprรคsentation. Dadurch weisen Bewegungsplaner in vielen Fรคllen ein konservatives Verhalten auf.
Diese Dissertation entwickelt zwei Bewegungsplaner, welche die Defizite der vorgelagerten Verarbeitungsmodule durch Ausnutzung der Reaktionsfรคhigkeit des Fahrzeugs kompensieren. Diese Arbeit prรคsentiert zuerst eine ausgiebige Analyse รผber die Ursachen und Klassifikation der Unsicherheiten und zeigt die Eigenschaften eines idealen Bewegungsplaners auf. Anschlieรend befasst sie sich mit der mathematischen Modellierung der Fahrziele sowie den Randbedingungen, welche die Sicherheit gewรคhrleisten. Das resultierende Planungsproblem wird mit zwei unterschiedlichen Methoden in Echtzeit gelรถst: Zuerst mit nichtlinearer Optimierung und danach, indem es als teilweise beobachtbarer Markov-Entscheidungsprozess (POMDP) formuliert und die Lรถsung mit Stichproben angenรคhert wird. Der auf nichtlinearer Optimierung basierende Planer betrachtet mehrere Manรถveroptionen mit individuellen Auftrittswahrscheinlichkeiten und berechnet daraus ein Bewegungsprofil. Er garantiert Sicherheit, indem er die Realisierbarkeit einer zufallsbeschrรคnkten Rรผckfalloption gewรคhrleistet. Der Beitrag zum POMDP-Framework konzentriert sich auf die Verbesserung der Stichprobeneffizienz in der Monte-Carlo-Planung. Erstens werden Informationsbelohnungen definiert, welche die Stichproben zu Aktionen fรผhren, die eine hรถhere Belohnung ergeben. Dabei wird die Auswahl der Stichproben fรผr das reward-shaped Problem durch die Verwendung einer allgemeinen Heuristik verbessert. Zweitens wird die Kontinuitรคt in der Reward-Struktur fรผr die Aktionsauswahl ausgenutzt und dadurch signifikante Leistungsverbesserungen erzielt. Evaluierungen zeigen, dass mit diesen Planern groรe Erfolge in Fahrversuchen und Simulationsstudien mit komplexen Interaktionsmodellen erreicht werden
- โฆ