In machine learning, Domain Adaptation (DA) arises when the distribution gen-
erating the test (target) data differs from the one generating the learning
(source) data. It is well known that DA is an hard task even under strong
assumptions, among which the covariate-shift where the source and target
distributions diverge only in their marginals, i.e. they have the same labeling
function. Another popular approach is to consider an hypothesis class that
moves closer the two distributions while implying a low-error for both tasks.
This is a VC-dim approach that restricts the complexity of an hypothesis class
in order to get good generalization. Instead, we propose a PAC-Bayesian
approach that seeks for suitable weights to be given to each hypothesis in
order to build a majority vote. We prove a new DA bound in the PAC-Bayesian
context. This leads us to design the first DA-PAC-Bayesian algorithm based on
the minimization of the proposed bound. Doing so, we seek for a \rho-weighted
majority vote that takes into account a trade-off between three quantities. The
first two quantities being, as usual in the PAC-Bayesian approach, (a) the
complexity of the majority vote (measured by a Kullback-Leibler divergence) and
(b) its empirical risk (measured by the \rho-average errors on the source
sample). The third quantity is (c) the capacity of the majority vote to
distinguish some structural difference between the source and target samples.Comment: https://sites.google.com/site/multitradeoffs2012

Germain, Pascal

Habrard, Amaury

Laviolette, François

Morvant, Emilie

English

arXiv

A more advanced version has been published in ICML 2013: http://jmlr.org/proceedings/papers/v28/germain13.html https://sites.google.com/site/multitradeoffs2012/In machine learning, Domain Adaptation (DA) arises when the distribution gen- erating the test (target) data differs from the one generating the learning (source) data. It is well known that DA is an hard task even under strong assumptions, among which the covariate-shift where the source and target distributions diverge only in their marginals, i.e. they have the same labeling function. Another popular approach is to consider an hypothesis class that moves closer the two distributions while implying a low-error for both tasks. This is a VC-dim approach that restricts the complexity of an hypothesis class in order to get good generalization. Instead, we propose a PAC-Bayesian approach that seeks for suitable weights to be given to each hypothesis in order to build a majority vote. We prove a new DA bound in the PAC-Bayesian context. This leads us to design the first DA-PAC-Bayesian algorithm based on the minimization of the proposed bound. Doing so, we seek for a ρ-weighted majority vote that takes into account a trade-off between three quantities. The first two quantities being, as usual in the PAC-Bayesian approach, (a) the complexity of the majority vote (measured by a Kullback-Leibler divergence) and (b) its empirical risk (measured by the ρ-average errors on the source sample). The third quantity is (c) the capacity of the majority vote to distinguish some structural difference between the source and target samples

HAL-UJM

PAC-Bayesian Learning and Domain AdaptationPascal Germain, Amaury Habrard, Franc¸ois Laviolette, Emilie MorvantTo cite this version:Pascal Germain, Amaury Habrard, Franc¸ois Laviolette, Emilie Morvant. PAC-Bayesian Learn-ing and Domain Adaptation. Multi-Trade-offs in Machine Learning, NIPS 2012 Workshop, Dec2012, Lake Tahoe, United States. <hal-00749366>HAL Id: hal-00749366https://hal.archives-ouvertes.fr/hal-00749366Submitted on 10 Dec 2012HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.L’archive ouverte pluridisciplinaire HAL, estdestine´e au de´poˆt et a` la diffusion de documentsscientifiques de niveau recherche, publie´s ou non,e´manant des e´tablissements d’enseignement et derecherche franc¸ais ou e´trangers, des laboratoirespublics ou prive´s.PAC-Bayesian Learning and Domain AdaptationPascal GermainDe´partement d’informatique et de ge´nie logicielUniversite´ Laval, Que´bec, Canadapascal.germain@ift.ulaval.caAmaury HabrardLaboratoire Hubert Curien UMR CNRS 5516,Univ. Jean Monnet, 42000 St-Etienne, Franceamaury.habrard@univ-st-etienne.frFranc¸ois LavioletteDe´partement d’informatique et de ge´nie logicielUniversite´ Laval, Que´bec, Canadafrancois.laviolette@ift.ulaval.caEmilie MorvantAix-Marseille Univ., LIF-QARMA, CNRS,UMR 7279, 13013, Marseille, Franceemilie.morvant@lif.univ-mrs.frAbstractIn machine learning, Domain Adaptation (DA) arises when the distribution gen-erating the test (target) data differs from the one generating the learning (source)data. It is well known that DA is an hard task even under strong assumptions [1],among which the covariate-shift where the source and target distributions divergeonly in their marginals, i.e. they have the same labeling function. Another popularapproach is to consider an hypothesis class that moves closer the two distributionswhile implying a low-error for both tasks [2]. This is a VC-dim approach thatrestricts the complexity of an hypothesis class in order to get good generalization.Instead, we propose a PAC-Bayesian approach that seeks for suitable weights tobe given to each hypothesis in order to build a majority vote. We prove a new DAbound in the PAC-Bayesian context. This leads us to design the first DA-PAC-Bayesian algorithm based on the minimization of the proposed bound. Doing so,we seek for a ρ-weighted majority vote that takes into account a trade-off betweenthree quantities. The first two quantities being, as usual in the PAC-Bayesian ap-proach, (a) the complexity of the majority vote (measured by a Kullback-Leiblerdivergence) and (b) its empirical risk (measured by the ρ-average errors on thesource sample). The third quantity is (c) the capacity of the majority vote to dis-tinguish some structural difference between the source and target samples.PreliminariesDomain Adaptation. We consider DA for binary classification tasks where X ⊆Rd is the inputspace of dimension d and Y={−1, 1} is the label set. We have two different distributions overX×Ycalled the source domain PS and the target domain PT . DS and DT are the respective marginaldistributions over X . We tackle the challenging task where we have no information about the labelon PT . A learning algorithm is then provided with a labeled source sample S={(xsi , ysi )}mi=1 drawni.i.d. from PS , and an unlabeled target sample T ={xtj}m′j=1 drawn i.i.d. fromDT . Let h :X→Y bean hypothesis function. The expected source error of h over PS is the probability that h commits anerror, RPS (h)=E(xs,ys)∼PS I(h(xs) 6= ys), where I(a) = 1 if predicate a is true and 0 otherwise.The expected target errorRPT over PT is defined in a similar way. RS is the empirical source error.The DA objective is then to find a low error target hypothesis, even if no label information is avail-able about the target domain. Clearly this task can be infeasible in general. However, under theassumption that there exists hypothesis in the hypothesis class H that do perform well on both thesource and the target domain, Ben David et al. [2] provide the following guarantee,∀h ∈ H, RPT (h) ≤ RPS (h) +12dH∆H(DS , DT ) + ν, (1)1where ν def= argminh∈H (RPS (h) + RPT (h)) is the error of the best joint hypothesis, anddH∆H(DS , DT ), called the H∆H-distance between the domain marginal distributions, quantifieshow hypothesis from H can “detect” differences between those two distributions. According toEquation (1), the lower this detection capability is for some given H, the better are the generaliza-tion guarantees. Hence, as pointed out in [2], Equation (1) together with the usual VC-bound theory,express a multiple trade-off between the accuracy of some particular hypothesis h, the complexityof the hypothesis class H, and the “incapacity” of hypothesis of H to detect difference between thesource and the target domain.PAC-Bayesian Learning of Linear Classifier. The PAC-Bayesian theory, first introduced byMcAllester [3], traditionally considers majority votes over a set H of binary hypothesis. Given aprior distribution pi overH and a training set S, the learning process consists in finding the posteriordistribution ρ overH leading to a good generalization. Indeed, the essence of this theory is to boundthe risk of the stochastic Gibbs classifier Gρ associated with ρ. In order to predict the label of anexample x, the Gibbs classifier first draws a hypothesis h from H according to ρ, then returns h(x)as the predicted label. Note that the error of the Gibbs classifier corresponds to the expectation of theerrors over ρ: RPS (Gρ) = Eh∼ρRPS (h). The classical PAC-Bayesian theorem bounds the expecta-tion of errorRPS (Gρ) in term of two major quantities: The empirical errorRS(Gρ) = Eh∼ρRS(h)on a sample S and the Kullback-Leibler divergence KL(ρ ‖pi) def= Eh∼ρ ρ(h)pi(h) .Theorem 1 (as presented in [4]). For any domain PS ⊆ X × Y , for any set H of hypothesis, forany prior distribution pi overH, and any δ ∈ (0, 1], we have,PrS∼(PS)m(∀ρ onH : kl(RS(Gρ)∥∥RPS (Gρ)) ≤ 1m[KL(ρ ‖pi) + ln ξ(m)δ])≥ 1− δ ,where kl(q ‖ p) def= q ln qp + (1− q) ln 1−q1−p , and ξ(m)def=∑mk=0(mk) (km)k (1− km)m−k.Now, let H be a set of linear classifiers hv(x) def= sgn (v · x) such that v ∈ Rd is a weight vector.By restricting the prior and the posterior to be Gaussian distributions, Langford an Shawe-Taylor [5]have specialized the PAC-Bayesian theory in order to bound the expected risk of any linear classifierhw ∈ H identified by a weight vector w. More precisely, for a prior pi0 and a posterior ρw definedas spherical Gaussians with identity covariance matrix respectively centered on vectors 0 and w, i.e.for any hv ∈ H , pi0(hv) def=(1√2pi)de−12‖v‖2 and ρw(hv)def=(1√2pi)de−12‖v−w‖2 , (2)we obtain that the expected risk of the Gibbs classifier Gρw on a domain PS is given by,RPS (Gρw) = E(x,y)∼PSEhv∼ρwI(hv 6= y) = E(x,y)∼PSΦ(y w·x‖x‖),where Φ(a) def= 12 [1− Erf( a√2 )]. Moreover, the KL-divergence between the posterior and the priordistributions becomes simply KL(ρw ‖pi0) = 12‖w‖2. In this context, Theorem 1 becomes,Corollary 1. For any domain PS ⊆ Rd × Y and any δ ∈ (0, 1], we have,PrS∼(PS)m(∀w ∈ Rd : kl(RS(Gρw)∥∥RPS (Gρw)) ≤ 1m[12‖w‖2 + lnξ(m)δ])≥ 1− δ .Based on this specialization of the PAC-Bayesian theory to linear classifiers, Germain et al. [4]suggested to minimize the bound on RPS (Gρw) given by Corollary 1. The resulting learning al-gorithm, called PBGD, performs a gradient descent in order to find an optimal weight vector w.Doing so, PBGD realizes a trade-off between the empirical accuracy (expressed by RS(Gρw)) andthe complexity (expressed by ‖w‖2) of the learned linear classifier.PAC-Bayesian Learning of Adapted Linear ClassifierDA Bound for the Gibbs Classifier. The originality of our contribution is to combine PAC-Bayesian and DA frameworks. We define the notion of domain disagreement disρ(DS , DT ) to mea-sure the structural difference between domain marginals in terms of posterior distribution ρ ∼ H,disρ(DS , DT )def= Eh1,h2∼ρ2[RDT (h1, h2)−RDS (h1, h2) ] ,2where RD′(h1, h2)def= Ex∼D′ I(h1(x) 6= h2(x)). Unlike the distance dH∆H suggested by [2], our“distance” measure disρ takes into account a ρ-average over all pairs of hypothesis in H instead offocusing on a single particular pair of hypothesis. However, it nevertheless allows us to derive thefollowing bound which proposes a similar trade-off as in Equation (1), but relates the source andtarget errors of the Gibbs classifier. For all probability distribution ρ onH, we have,RPT (Gρ) ≤ RPS (Gρ) + disρ(DS , DT ) + λρ , (3)where λρ=RPT (h?) +RPS (h?), with h?=argminh∈H {Eh′∼ρ (RDT (h, h′)−RDS (h, h′))}, mea-sures the joint error of the hypothesis which minimizes the domain disagreement. Hence, similarlyto Equation (1), we provide evidences that a good DA is possible if disρ(DS , DT ) and λρ are low.Under this assumption, we propose to design the first DA-PAC-Bayesian algorithm inspired fromthe PAC-Bayesian learning of linear classifiers [4]. We focus on the two first terms of Inequality (3),and we refer to this quantity as the expected adaptation loss,BP〈S,T〉(Gρ)def= RPS (Gρ) + disρ(DS , DT ) ,where P〈S,T 〉 denotes the joint distribution over PS ×DT . The independence of each draw from PSand DT allows us to rewrite BP〈S,T〉 as the expectation of the domain adaptation loss LDA,BP〈S,T〉(Gρ) = Eh1,h2∼ρ2E(xs,ys,xt)∼P〈S,T〉LDA(h1, h2,xs, ys,xt) , (4)LDA(h1, h2,xs, ys,xt) def= I(h1(xs) 6= ys) + I(h1(xt) 6= h2(xt))− I(h1(xs) 6= h2(xs)) .Given 〈S, T 〉 = {(xsi , ysi ,xti)}mi=1, a sample of m source-target pairs drawn i.i.d. from P〈S,T 〉, theempirical adaptation loss of Gρ is B〈S,T 〉(Gρ) = Eh1,h2∼ρ2∑mi=1 LDA(h1, h2,xsi , ysi ,xti).PAC-Bayesian Bounds for Domain Adaptation. We restrict ourselves to the case exhibited byEquation (2) whereH is a set of linear classifiers, and posterior and prior distributions are Gaussians.First, we compute the expected adaptation lossBP〈S,T〉(Gρw) of the Gibbs classifierGρw (rememberthat the posterior distribution is centered on the linear hw). With Φdis(a)def=2Φ(a)Φ(−a), we obtain,BP〈S,T〉(Gρw) = E(xs,ys,xt)∼P〈S,T〉[Φ(ys w·xs‖xs‖)+ Φdis(w·xs‖xs‖)− Φdis(w·xt‖xt‖)].Now, we derive a new PAC-Bayesian theorem to bound the expected adaptation loss of linear classi-fiers. Theorem 2 is obtained by two key results. First, we use the specialization of the PAC-Bayesiantheory to linear classifiers introduced by Corollary 1. Second, we need the methodology developedby [6, Theorem 5] to bound a loss relying on a pair of hypothesis h1, h2 ∼ ρ2 (like our domainadaptation loss of Equation (4)). We then obtain KL(ρ2w ‖pi20) = 2 KL(ρw ‖pi0) = ‖w‖2.Theorem 2. For any domain P〈S,T 〉 ⊆ Rd × Y × Rd and any δ ∈ (0, 1], we have,Pr〈S,T 〉∼(P〈S,T〉)m(∀w ∈ Rd : kl(B∗〈S,T 〉∥∥B∗P〈S,T〉) ≤ 1m[‖w‖2 + ln ξ(m)δ])≥ 1− δ ,where B∗〈S,T 〉def= 12B〈S,T 〉(Gρw) +14 and B∗P〈S,T〉def= 12BP〈S,T〉(Gρw) +14 ensure that the valuesprovided to the kl(·‖·) function are in interval [0, 1].Designing the Algorithm. The algorithm DA-PBGD, described here, minimizes the upper boundgiven by Theorem 2 by gradient descent. The corresponding objective function is,B(〈S, T 〉,w, δ) def= sup{ : kl(B∗〈S,T 〉 ‖ ) ≤1m[‖w‖2 + ln ξ(m)δ]},for a fixed value of δ. Consequently, our problem is to find weight vector w∗ that minimizes Bsubject to the constraints B > B∗〈S,T 〉 and kl(B∗〈S,T 〉 ‖B) = 1m[‖w‖2 + ln ξ(m)δ ]. The gradient isobtained by computing the partial derivative of both sides of the latter equation with respect to wj(the jth component of w). After solving for ∂B/∂wj , we find that the gradient is,B(1−B)2m(B−B∗〈S,T〉)[4w+ln(B(1−B∗〈S,T〉)B∗〈S,T〉(1−B))m∑i=1[Φ′(ysiw·xsi‖xsi‖)ysixsi‖xsi‖+Φ′dis(w·xti‖xti‖)xti‖xti‖−Φ′dis(w·xsi‖xsi‖)xsi‖xsi‖]],where Φ′(a) and Φ′dis(a) denote respectively the derivatives of Φ and Φdis evaluated at a. The kerneltrick applied to DA-PBGD allows us to work with dual weight vectorα ∈ Rd that is a linear classifierin an augmented space. Given a kernel k : Rd×Rd → R, we have hw(x) =∑mi=1 αik(xi,x).3Figure 1: Illustration of the decision of DA-PBGD on 4 rotations angles: From left to right 20◦, 30◦,40◦, 50◦. In green and pink is the source sample, in grey is the target sample.Experimental Results. Our DA-PBGD has been evaluated on a toy problem called inter-twinningmoon and compared with: PBGD and SVM with no adaptation, the semi-supervised Transductive-SVM (TSVM) [7], the iterative DA algorithms DASVM [8] and the non-iterative version ofDASF [9] based on the bound (1). We used a Gaussian kernel for all the methods. These pre-liminary results – illustrated on Tab. 1 and on Fig. 1 – are very promising. Moreover on Fig. 2, weclearly see the trade-off between the difficulty of the task and the minimization of the source risk inaction: When the DA task is feasible DA-PBGD prefers to minimize the domain disagreement evenif it implies an increase of the empirical source error, but when this minimization becomes hard, i.e.the complexity of the task is high, it prefers to focus only on the empirical source error.Among all the possible exciting perspectives, we notably aim to theoretically define elegant andrelevant assumptions allowing one to control the λρ term of Eq. (3) to make our DA bound very tight.Table 1: Average accuracy results for 4 rotation an-gles. DA-PBGD is more stable than the others andoutperforms all the methods for 2 angles.Rotation angle 20◦ 30◦ 40◦ 50◦PBGD 99.5 89.8 78.6 60SVM 89.6 76 68.8 60TSVM 100 78.9 74.6 70.9DASVM 100 78.4 71.6 66.6DASF 98 92 83 70DA-PBGD 97.7 97.6 97.4 53.210 20 30 40 50 60 70 80 9000.10.20.30.40.50.60.70.8source errortarget errorFigure 2: The trade-off between targetand source errors according to the diffi-culty of the task (i.e. the rotation angle).Acknowledgments This work was supported in part by the french project VideoSense ANR-09-CORD-026, inpart by the IST Programme of the European Community, under the PASCAL2 Network of Excellence IST-2007-216886 and in part by NSERC discovery grant 262067. This publication only reflects authors’ views.References[1] S. Ben-David and R. Urner. On the hardness of domain adaptation and the utility of unlabeled targetsamples. In Proceedings of Algorithmic Learning Theory, pages 139–153, 2012.[2] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J.W. Vaughan. A theory of learning fromdifferent domains. Machine Learning Journal, 79(1-2):151–175, 2010.[3] David A. McAllester. Some PAC-bayesian theorems. Machine Learning, 37:355–363, 1999.[4] P. Germain, A. Lacasse, F. Laviolette, and M. Marchand. PAC-Bayesian Learning of Linear Classifiers. InProceedings of ICML, 2009.[5] J. Langford and J. Shawe-Taylor. PAC-bayes & margins. In Advances in Neural Information ProcessingSystems 15, pages 439–446. MIT Press, 2002.[6] Alexandre Lacasse, Franc¸ois Laviolette, Mario Marchand, Pascal Germain, and Nicolas Usunier. PAC-bayes bounds for the risk of the majority vote and the variance of the Gibbs classifier. In NIPS, 2007.[7] T. Joachims. Transductive inference for text classification using support vector machines. In ICML, 1999.[8] L. Bruzzone and M. Marconcini. Domain adaptation problems: A DASVM classification technique and acircular validation strategy. IEEE Trans. Pattern Anal. Mach. Intell., 32(5), 2010.[9] E. Morvant, A. Habrard, and S. Ayache. Parsimonious Unsupervised and Semi-Supervised Domain Adap-tation with Good Similarity Functions. Knowledge and Information Systems, 33(2):309–349, 2012.4

PAC-Bayesian Learning and Domain Adaptation

HAL AMU

https://hal.archives-ouvertes.fr/hal-00749366

PAC-Bayesian Learning and Domain Adaptation

Abstract

Similar works

Full text

Available Versions

HAL-UJM

HAL AMU