420 research outputs found

    Robustness Guarantees for Mode Estimation with an Application to Bandits

    Full text link
    Mode estimation is a classical problem in statistics with a wide range of applications in machine learning. Despite this, there is little understanding in its robustness properties under possibly adversarial data contamination. In this paper, we give precise robustness guarantees as well as privacy guarantees under simple randomization. We then introduce a theory for multi-armed bandits where the values are the modes of the reward distributions instead of the mean. We prove regret guarantees for the problems of top arm identification, top m-arms identification, contextual modal bandits, and infinite continuous arms top arm recovery. We show in simulations that our algorithms are robust to perturbation of the arms by adversarial noise sequences, thus rendering modal bandits an attractive choice in situations where the rewards may have outliers or adversarial corruptions.Comment: 12 pages, 7 figures, 14 appendix page

    Online and Distribution-Free Robustness: Regression and Contextual Bandits with Huber Contamination

    Full text link
    In this work we revisit two classic high-dimensional online learning problems, namely linear regression and contextual bandits, from the perspective of adversarial robustness. Existing works in algorithmic robust statistics make strong distributional assumptions that ensure that the input data is evenly spread out or comes from a nice generative model. Is it possible to achieve strong robustness guarantees even without distributional assumptions altogether, where the sequence of tasks we are asked to solve is adaptively and adversarially chosen? We answer this question in the affirmative for both linear regression and contextual bandits. In fact our algorithms succeed where conventional methods fail. In particular we show strong lower bounds against Huber regression and more generally any convex M-estimator. Our approach is based on a novel alternating minimization scheme that interleaves ordinary least-squares with a simple convex program that finds the optimal reweighting of the distribution under a spectral constraint. Our results obtain essentially optimal dependence on the contamination level ฮท\eta, reach the optimal breakdown point, and naturally apply to infinite dimensional settings where the feature vectors are represented implicitly via a kernel map.Comment: 66 pages, 1 figure, v3: refined exposition and improved rate

    Tsallis-INF: An Optimal Algorithm for Stochastic and Adversarial Bandits

    Full text link
    We derive an algorithm that achieves the optimal (within constants) pseudo-regret in both adversarial and stochastic multi-armed bandits without prior knowledge of the regime and time horizon. The algorithm is based on online mirror descent (OMD) with Tsallis entropy regularization with power ฮฑ=1/2\alpha=1/2 and reduced-variance loss estimators. More generally, we define an adversarial regime with a self-bounding constraint, which includes stochastic regime, stochastically constrained adversarial regime (Wei and Luo), and stochastic regime with adversarial corruptions (Lykouris et al.) as special cases, and show that the algorithm achieves logarithmic regret guarantee in this regime and all of its special cases simultaneously with the adversarial regret guarantee.} The algorithm also achieves adversarial and stochastic optimality in the utility-based dueling bandit setting. We provide empirical evaluation of the algorithm demonstrating that it significantly outperforms UCB1 and EXP3 in stochastic environments. We also provide examples of adversarial environments, where UCB1 and Thompson Sampling exhibit almost linear regret, whereas our algorithm suffers only logarithmic regret. To the best of our knowledge, this is the first example demonstrating vulnerability of Thompson Sampling in adversarial environments. Last, but not least, we present a general stochastic analysis and a general adversarial analysis of OMD algorithms with Tsallis entropy regularization for ฮฑโˆˆ[0,1]\alpha\in[0,1] and explain the reason why ฮฑ=1/2\alpha=1/2 works best

    ์„ฑ๋Šฅ์ด ๊ฐœ์„ ๋œ ํšจ์œจ์ ์ธ ์„ ํ˜• ๋‹ค์ค‘ ์Šฌ๋กฏ ๋จธ์‹  ์•Œ๊ณ ๋ฆฌ์ฆ˜

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ์ž์—ฐ๊ณผํ•™๋Œ€ํ•™ ํ†ต๊ณ„ํ•™๊ณผ, 2022.2. Myunghee Cho Paik.This thesis contains two proposed efficient algorithms: (i) Doubly Robust Thompson Sampling (DRTS) and (ii) Hybridization by Randomization (HyRan). DRTS employs the doubly-robust method used in missing data literature to Thompson Sampling with contexts (LinTS). A challenging aspect of the bandit problem is that a stochastic reward is observed only for the chosen arm and the rewards of other arms remain missing. The dependence of the arm choice on the past context and reward pairs compounds the complexity of regret analysis. Different from previous works relying on missing data techniques [Dimakopoulou et al., 2019, Kim and Paik, 2019], the proposed algorithm is designed to allow a novel additive regret decomposition leading to an improved regret bound with the order of \tilde{O}(\phi^{-2}\sqrt{T}) where \phi^2 is the minimum eigenvalue of the covariance matrix of contexts and T is the time horizon. This is the first regret bound of LinTS using \phi^{2} without the dimension of the context, d and the regret bound of the proposed algorithm is \tilde{O}(d\sqrt{T}) in many practical scenarios, improving the bound of LinTS by a factor of \sqrt{d}. A benefit of the proposed method is that it utilizes all the context data, chosen or not chosen, thus allowing to circumvent the technical definition of unsaturated arms used in theoretical analysis of LinTS. Empirical studies show the advantage of the proposed algorithm over LinTS. HyRan is a novel bandit algorithm and show that our proposed algorithm establish the regret bound of \tilde{O}(\sqrt{dT}), which is optimal up to the logarithmic ifactors. The novelty comes from the two modifications where the first is to utilize all contexts, both selected and unselected, and the second is to randomize the contribution to the estimator. These modifications render a novel decomposition of the cumulative regret into two main additive terms whose bounds can be derived by employing the structure of the compounding estimator. While previous algorithms such as SupLinUCB [Chu et al., 2011] have shown \tilde{O}(\sqrt{dT}) regret, exploiting independence via a phased algorithm, HyRan is the first to achieve \tilde{O}(\sqrt{dT}) regret keeping the practical advantage without resorting to generating independent samples. The numerical experiments show that the practical performance of our proposed algorithm is in line with the theoretical guarantees.๋ณธ ํ•™์œ„๋…ผ๋ฌธ์€ ์ˆœ์ฐจ์  ๊ฒฐ์ • ๋ฌธ์ œ(Sequential decision making problem)๋ฅผ ์œ„ํ•œ ํšจ์œจ์ ์ธ ์„ ํ˜• ๋‹ค์ค‘ ์Šฌ๋กฏ ๋จธ์‹  ์•Œ๊ณ ๋ฆฌ์ฆ˜(Linear Contextual Bandit Algorithm)์„ ์ œ์•ˆํ•œ๋‹ค. ์„ ํ˜• ๋‹ค์ค‘ ์Šฌ๋กฏ ๋จธ์‹  ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์œ ํ•œ ๊ฐœ์˜ ์„ ํƒ์ง€(Arm)๊ฐ€ ์ฃผ์–ด์ง„ ํŠน์ • ํ™˜๊ฒฝ ์•ˆ์—์„œ ํ•™์Šต์ž๊ฐ€ ๊ทธ ์„ ํƒ์ง€์˜ ๋‚ด์šฉ(Context)์„ ๊ด€์ฐฐํ•˜๊ณ  ์ด๋“ค ์ค‘ ๋ณด์ƒ (Reward)์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ํ–‰๋™์„ ํŒŒ์•…ํ•˜๊ณ  ์„ ํƒํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก ์ด๋‹ค. ๋ณด์ƒ์€ ์„ ํƒ์ง€์˜ ๋‚ด์šฉ๊ณผ ์„ ํ˜• ๊ด€๊ณ„๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ํ˜„์žฌ๊นŒ์ง€ ์ œ์•ˆ๋œ ์„ ํ˜• ๋‹ค์ค‘ ์Šฌ๋กฏ ๋จธ์‹  ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋‚ด์šฉ๊ณผ ๋ณด์ƒ์˜ ๊ด€๊ณ„๋ฅผ ์ถ”์ •ํ•  ๋•Œ, ์„ ํƒ๋œ ๋‚ด์šฉ๊ณผ ๋ณด์ƒ์œผ๋กœ๋งŒ ์ถ”์ •ํ•˜๊ณ  ์žˆ๋‹ค. ์ด๋Š” ์„ ํƒ๋˜์ง€ ์•Š์€ ๋‚ด์šฉ๋“ค์„ ๊ด€์ฐฐ๋งŒ ํ•˜๊ณ  ์ถ”์ •์—๋Š” ์‚ฌ์šฉํ•  ์ˆ˜ ์—†๋Š” ๋น„ํšจ์œจ์„ฑ์„ ์œ ๋ฐœํ•œ๋‹ค. ์ด๋กœ ์ธํ•ด ๋‹ค์ค‘ ์Šฌ๋กฏ ๋จธ์‹ ์ด ํ™œ์šฉ๋˜๋Š” ๋‰ด์Šค ๊ธฐ์‚ฌ ๋ฐฐ์น˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‚˜ ๊ด‘๊ณ  ์ถ”์ฒœ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‚˜ ๋ชจ๋ฐ”์ผ ๊ฑด๊ฐ•๊ด€๋ฆฌ ์‹œ์Šคํ…œ ๋“ฑ์—์„œ ์„ ํƒ๋ฐ›์ง€ ๋ชปํ•œ ๊ธฐ์‚ฌ, ๊ด‘๊ณ , ๊ฑด๊ฐ•๊ด€๋ฆฌ๋ฒ•๊ณผ ๊ฐ™์€ ๋‚ด์šฉ์ด ์ถ”์ •์— ์‚ฌ์šฉ๋  ์ˆ˜ ์—†๋Š” ๋น„ํšจ์œจ์„ฑ์ด ๋ฐœ์ƒํ•œ๋‹ค. ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์—์„œ๋Š” ์„ ํƒ๋ฐ›์ง€ ๋ชปํ•œ ๋‚ด์šฉ๋“ค๋„ ์ถ”์ •์— ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์ƒˆ๋กœ์šด ์„ ํ˜• ๋‹ค์ค‘ ์Šฌ๋กฏ ๋จธ์‹  ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋‘ ๊ฐ€์ง€๋ฅผ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ฒซ์งธ๋Š” ๊ฒฐ์ธก์ž๋ฃŒ ๋ถ„์„๋ฒ• ์ค‘ ์ด์ค‘ ๊ฐ•๊ฑด๋ฒ•(Doubly Robust)์„ ์ ์šฉํ•˜์—ฌ ๊ด€์ธกํ•˜์ง€ ๋ชปํ•œ ๋ณด์ƒ์„ ์œ ์‚ฌ ๋ณด์ƒ(Pseudo-reward)์œผ๋กœ ๋Œ€์ฒดํ•˜๋ฉด์„œ ์„ ํƒ๋˜์ง€ ๋ชปํ•œ ๋‚ด์šฉ๋„ ์ถ”์ •์— ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜์˜€๊ณ , ์ด๋ฅผ ํ†ตํ•ด ๋‚ด์šฉ์˜ ์ฐจ์›์˜ ์ œ๊ณฑ๊ทผ๋งŒํผ ํ›„ํšŒ ์ƒํ•œ (Regret bound)๋ฅผ ๊ฐœ์„ ํ•˜์˜€๋‹ค. ๋‘˜์งธ๋Š” ๊ฐ„๋‹จํ•œ ๋žœ๋คํ™”(Randomization)๋ฅผ ์ ์šฉํ•˜์—ฌ ์„ ํƒ๋ฐ›์ง€ ๋ชปํ•œ ๋‚ด์šฉ์„ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•๊ณผ ์„ ํƒํ•œ ๋‚ด์šฉ๋งŒ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ํ˜ผํ•ฉํ•˜์—ฌ ๋งŒ๋“  ํ˜ผํ•ฉ ์ถ”์ •๋Ÿ‰(Compound Estimator)์„ ์ •์˜ํ•˜๊ณ , ์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์ตœ์ (Optimal rate)์˜ ํ›„ํšŒ ์ƒํ•œ์„ ๊ฐ€์กŒ์Œ์„ ์ฆ๋ช…ํ•˜์˜€๋‹ค. ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์€ ์ƒˆ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์„ ํƒ๋ฐ›์ง€ ๋ชปํ•œ ๋‚ด์šฉ์„ ํ™œ์šฉํ•˜๋ฉด์„œ ์ด๋ก ์ ์œผ๋กœ ์„ฑ๋Šฅ์ด ๊ฐœ์„ ๋˜์—ˆ์Œ์„ ์ฆ๋ช…ํ•˜์˜€๊ณ , ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ฐ์ดํ„ฐ์— ์ ์šฉํ•œ ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด์„œ๋„ ๊ธฐ์กด ์•Œ๊ณ ๋ฆฌ์ฆ˜๋ณด๋‹ค ์„ฑ๋Šฅ์ด ๊ฐœ์„ ๋˜์—ˆ์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค.1 Doubly Robust Thompson Sampling with Linear Payoffs 1 1.1 Introduction 1 1.2 Related Works 4 1.3 Proposed Estimator and Algorithm 6 1.3.1 Settings and Assumptions 6 1.3.2 Doubly Robust Estimator 7 1.3.3 Algorithm 8 1.4 Theoretical Results 9 1.4.1 An Improved Regret Bound 10 1.4.2 Super-unsaturated Arms and a Novel Regret Decomposition 11 1.4.3 Bounds for the Cumulative Regret 13 1.5 Simulation Studies 17 1.6 Conclusion 18 1.7 Appendix 20 1.7.1 Detailed Analysis of the Resampling 20 1.7.1.1 Precise Definition of Action Selection 20 1.7.1.2 Computing the Probability of Selection 20 1.7.1.3 The Number of Maximum Possible Resampling 21 1.7.2 Technical Lemmas 23 iii1.7.3 Proofs of Theoretical Results 25 1.7.3.1 Proof Theorem 1.1 25 1.7.3.2 Proof of Lemma 1.2 27 1.7.3.3 Proof of Theorem 1.3 28 1.7.3.4 Proof of Lemma 1.4 33 1.7.3.5 Proof of Lemma 1.6 35 1.7.4 Implementation Details 36 1.7.4.1 Efficient Calculation of the Sampling Probability 36 1.7.5 A Review of Approaches to Missing Data and Doublyrobust Method 38 1.7.5.1 Doubly-robust Method in Missing Data 38 1.7.5.2 Application to Bandit Settings 41 2 Near-optimal Algorithm for Linear Contextual Bandits with Compounding Estimator 44 2.1 Introduction 44 2.2 Related Works 47 2.3 Linear Contextual Bandit Problem 48 2.4 Proposed methods 49 2.4.1 Compounding Estimator 49 2.4.2 HyRan Algorithm 51 2.5 Main Results 52 2.5.1 Regret Bound of HyRan 53 2.5.2 Regret Decomposition 54 2.5.3 A Matching Lower Bound 58 2.6 Numerical Experiments 58 2.7 Appendix 61 2.7.1 Technical Lemmas 61 2.7.2 Proof of Theorem 2.1 61 iv2.7.3 Proof of Lemma 2.3 66 2.7.4 Proof of Theorem 1.3 69 2.7.5 Proof of Lemma 2.5 81 2.7.6 Proof of Theorem 2.6 83๋ฐ•

    Motion Planning for Autonomous Vehicles in Partially Observable Environments

    Get PDF
    Unsicherheiten, welche aus Sensorrauschen oder nicht beobachtbaren Manรถverintentionen anderer Verkehrsteilnehmer resultieren, akkumulieren sich in der Datenverarbeitungskette eines autonomen Fahrzeugs und fรผhren zu einer unvollstรคndigen oder fehlinterpretierten Umfeldreprรคsentation. Dadurch weisen Bewegungsplaner in vielen Fรคllen ein konservatives Verhalten auf. Diese Dissertation entwickelt zwei Bewegungsplaner, welche die Defizite der vorgelagerten Verarbeitungsmodule durch Ausnutzung der Reaktionsfรคhigkeit des Fahrzeugs kompensieren. Diese Arbeit prรคsentiert zuerst eine ausgiebige Analyse รผber die Ursachen und Klassifikation der Unsicherheiten und zeigt die Eigenschaften eines idealen Bewegungsplaners auf. AnschlieรŸend befasst sie sich mit der mathematischen Modellierung der Fahrziele sowie den Randbedingungen, welche die Sicherheit gewรคhrleisten. Das resultierende Planungsproblem wird mit zwei unterschiedlichen Methoden in Echtzeit gelรถst: Zuerst mit nichtlinearer Optimierung und danach, indem es als teilweise beobachtbarer Markov-Entscheidungsprozess (POMDP) formuliert und die Lรถsung mit Stichproben angenรคhert wird. Der auf nichtlinearer Optimierung basierende Planer betrachtet mehrere Manรถveroptionen mit individuellen Auftrittswahrscheinlichkeiten und berechnet daraus ein Bewegungsprofil. Er garantiert Sicherheit, indem er die Realisierbarkeit einer zufallsbeschrรคnkten Rรผckfalloption gewรคhrleistet. Der Beitrag zum POMDP-Framework konzentriert sich auf die Verbesserung der Stichprobeneffizienz in der Monte-Carlo-Planung. Erstens werden Informationsbelohnungen definiert, welche die Stichproben zu Aktionen fรผhren, die eine hรถhere Belohnung ergeben. Dabei wird die Auswahl der Stichproben fรผr das reward-shaped Problem durch die Verwendung einer allgemeinen Heuristik verbessert. Zweitens wird die Kontinuitรคt in der Reward-Struktur fรผr die Aktionsauswahl ausgenutzt und dadurch signifikante Leistungsverbesserungen erzielt. Evaluierungen zeigen, dass mit diesen Planern groรŸe Erfolge in Fahrversuchen und Simulationsstudien mit komplexen Interaktionsmodellen erreicht werden
    • โ€ฆ
    corecore