3,216 research outputs found
Apprentissage automatique avec garanties de généralisation à l'aide de méthodes d'ensemble maximisant le désaccord
Nous nous intĂ©ressons au domaine de lâapprentissage automatique, une branche de lâintelligence artificielle. Pour rĂ©soudre une tĂąche de classification, un algorithme dâapprentissage observe des donnĂ©es Ă©tiquetĂ©es et a comme objectif dâapprendre une fonction qui sera en mesure de classifier automatiquement les donnĂ©es qui lui seront prĂ©sentĂ©es dans le futur. Plusieurs algorithmes classiques dâapprentissage cherchent Ă combiner des classificateurs simples en construisant avec ceux-ci un classificateur par vote de majoritĂ©. Dans cette thĂšse, nous explorons lâutilisation dâune borne sur le risque du classificateur par vote de majoritĂ©, nommĂ©e la C-borne. Celle-ci est dĂ©finie en fonction de deux quantitĂ©s : la performance individuelle des votants, et la corrĂ©lation de leurs erreurs (leur dĂ©saccord). Nous explorons dâune part son utilisation dans des bornes de gĂ©nĂ©ralisation des classificateurs par vote de majoritĂ©. Dâautre part, nous lâĂ©tendons de la classification binaire vers un cadre gĂ©nĂ©ralisĂ© de votes de majoritĂ©. Nous nous en inspirons finalement pour dĂ©velopper de nouveaux algorithmes dâapprentissage automatique, qui offrent des performances comparables aux algorithmes de lâĂ©tat de lâart, en retournant des votes de majoritĂ© qui maximisent le dĂ©saccord entre les votants, tout en contrĂŽlant la performance individuelle de ceux-ci. Les garanties de gĂ©nĂ©ralisation que nous dĂ©veloppons dans cette thĂšse sont de la famille des bornes PAC-bayĂ©siennes. Nous gĂ©nĂ©ralisons celles-ci en introduisant une borne gĂ©nĂ©rale, Ă partir de laquelle peuvent ĂȘtre retrouvĂ©es les bornes de la littĂ©rature. De cette mĂȘme borne gĂ©nĂ©rale, nous introduisons des bornes de gĂ©nĂ©ralisation basĂ©es sur la C-borne. Nous simplifions Ă©galement le processus de preuve des thĂ©orĂšmes PAC-bayĂ©siens, nous permettant dâobtenir deux nouvelles familles de bornes. Lâune est basĂ©e sur une diffĂ©rente notion de complexitĂ©, la divergence de RĂ©nyi plutĂŽt que la divergence Kullback-Leibler classique, et lâautre est spĂ©cialisĂ©e au cadre de lâapprentissage transductif plutĂŽt que lâapprentissage inductif. Les deux algorithmes dâapprentissage que nous introduisons, MinCq et CqBoost, retournent un classificateur par vote de majoritĂ© maximisant le dĂ©saccord des votants. Un hyperparamĂštre permet de directement contrĂŽler leur performance individuelle. Ces deux algorithmes Ă©tant construits pour minimiser une borne PAC-bayĂ©sienne, ils sont rigoureusement justifiĂ©s thĂ©oriquement. Ă lâaide dâune Ă©valuation empirique, nous montrons que MinCq et CqBoost ont une performance comparable aux algorithmes classiques de lâĂ©tat de lâart.We focus on machine learning, a branch of artificial intelligence. When solving a classification problem, a learning algorithm is provided labelled data and has the task of learning a function that will be able to automatically classify future, unseen data. Many classical learning algorithms are designed to combine simple classifiers by building a weighted majority vote classifier out of them. In this thesis, we extend the usage of the C-bound, bound on the risk of the majority vote classifier. This bound is defined using two quantities : the individual performance of the voters, and the correlation of their errors (their disagreement). First, we design majority vote generalization bounds based on the C-bound. Then, we extend this bound from binary classification to generalized majority votes. Finally, we develop new learning algorithms with state-of-the-art performance, by constructing majority votes that maximize the votersâ disagreement, while controlling their individual performance. The generalization guarantees that we develop in this thesis are in the family of PAC-Bayesian bounds. We generalize the PAC-Bayesian theory by introducing a general theorem, from which the classical bounds from the literature can be recovered. Using this same theorem, we introduce generalization bounds based on the C-bound. We also simplify the proof process of PAC-Bayesian theorems, easing the development of new families of bounds. We introduce two new families of PAC-Bayesian bounds. One is based on a different notion of complexity than usual bounds, the RĂ©nyi divergence, instead of the classical Kullback-Leibler divergence. The second family is specialized to transductive learning, instead of inductive learning. The two learning algorithms that we introduce, MinCq and CqBoost, output a majority vote classifier that maximizes the disagreement between voters. An hyperparameter of the algorithms gives a direct control over the individual performance of the voters. These two algorithms being designed to minimize PAC-Bayesian generalization bounds on the risk of the majority vote classifier, they come with rigorous theoretical guarantees. By performing an empirical evaluation, we show that MinCq and CqBoost perform as well as classical stateof- the-art algorithms
On the Generalization of the C-Bound to Structured Output Ensemble Methods
This paper generalizes an important result from the PAC-Bayesian literature for binary classification to the case of ensemble methods for structured outputs. We prove a generic version of the \Cbound, an upper bound over the risk of models expressed as a weighted majority vote that is based on the first and second statistical moments of the vote's margin. This bound may advantageously be applied on more complex outputs such as multiclass labels and multilabel, and allow to consider margin relaxations. These results open the way to develop new ensemble methods for structured output prediction with PAC-Bayesian guarantees
SIGIR 2021 E-Commerce Workshop Data Challenge
The 2021 SIGIR workshop on eCommerce is hosting the Coveo Data Challenge for
"In-session prediction for purchase intent and recommendations". The challenge
addresses the growing need for reliable predictions within the boundaries of a
shopping session, as customer intentions can be different depending on the
occasion. The need for efficient procedures for personalization is even clearer
if we consider the e-commerce landscape more broadly: outside of giant digital
retailers, the constraints of the problem are stricter, due to smaller user
bases and the realization that most users are not frequently returning
customers. We release a new session-based dataset including more than 30M
fine-grained browsing events (product detail, add, purchase), enriched by
linguistic behavior (queries made by shoppers, with items clicked and items not
clicked after the query) and catalog meta-data (images, text, pricing
information). On this dataset, we ask participants to showcase innovative
solutions for two open problems: a recommendation task (where a model is shown
some events at the start of a session, and it is asked to predict future
product interactions); an intent prediction task, where a model is shown a
session containing an add-to-cart event, and it is asked to predict whether the
item will be bought before the end of the session.Comment: SIGIR eCOM 2021 Data Challeng
PAC-Bayesian Bounds based on the RĂ©nyi Divergence
International audienceWe propose a simplified proof process for PAC-Bayesian generalization bounds, that allows to divide the proof in four successive inequalities, easing the "customization" of PAC-Bayesian theorems. We also propose a family of PAC-Bayesian bounds based on the RĂ©nyi divergence between the prior and posterior distributions, whereas most PAC-Bayesian bounds are based on the Kullback-Leibler divergence. Finally, we present an empirical evaluation of the tightness of each inequality of the simplified proof, for both the classical PAC-Bayesian bounds and those based on the RĂ©nyi divergence
Recent Developments of an Opto-Electronic THz Spectrometer for High-Resolution Spectroscopy
A review is provided of sources and detectors that can be employed in the THz range before the description of an opto-electronic source of monochromatic THz radiation. The realized spectrometer has been applied to gas phase spectroscopy. Air-broadening coefficients of HCN are determined and the insensitivity of this technique to aerosols is demonstrated by the analysis of cigarette smoke. A multiple pass sample cell has been used to obtain a sensitivity improvement allowing transitions of the volatile organic compounds to be observed. A solution to the frequency metrology is presented and promises to yield accurate molecular line center measurements
Recommended from our members
Pioglitazone together with imatinib in chronic myeloid leukemia: A proof of concept study
BACKGROUND We recently reported that peroxisome proliferatorâactivated receptor Îł agonists target chronic myeloid leukemia (CML) quiescent stem cells in vitro by decreasing transcription of STAT5. Here in the ACTIM phase 2 clinical trial, we asked whether pioglitazone addâon therapy to imatinib would impact CML residual disease, as assessed by BCRâABL1 transcript quantification. METHODS CML patients were eligible if treated with imatinib for at least 2 years at a stable daily dose, having yielded major molecular response (MMR) but not having achieved molecular response 4.5 (MR4.5) defined by BCRâABL1/ABL1 IS RNA levels †0.0032%. After inclusion, patients started pioglitazone at a dosage of 30 to 45 mg/day in addition to imatinib. The primary objective was to evaluate the cumulative incidence of patients having progressed from MMR to MR4.5 over 12 months. RESULTS Twentyâfour patients were included (age range, 24â79 years). No pharmacological interaction was observed between the drugs. The main adverse events were weight gain in 12 patients and a mean decrease of 0.4 g/dL in hemoglobin concentration. The cumulative incidence of MR4.5 was 56% (95% confidence interval, 37%â76%) by 12 months, despite a wide range of therapy duration (1.9â15.5 months), and 88% of 17 evaluable patients who were still on imatinib reached MR4.5 by 48 months. The cumulative incidence of MMR to MR4.5 spontaneous conversions over 12 months was estimated to be 23% with imatinib alone in a parallel cohort of patients. CONCLUSION Pioglitazone in combination with imatinib was well tolerated and yielded a favorable 56% rate. These results provide a proof of concept needing confirmation within a randomized clinical trial (EudraCT 2009â011675â79). Cancer 2017;123:1791â1799. © 2016 The Authors. Cancer published by Wiley Periodicals, Inc. on behalf of American Cancer Society. This is an open access article under the terms of the Creative Commons Attribution NonCommercial License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited and is not used for commercial purposes
Pan-Cancer Analysis of lncRNA Regulation Supports Their Targeting of Cancer Genes in Each Tumor Context
Long noncoding RNAs (lncRNAs) are commonly dys-regulated in tumors, but only a handful are known toplay pathophysiological roles in cancer. We inferredlncRNAs that dysregulate cancer pathways, onco-genes, and tumor suppressors (cancer genes) bymodeling their effects on the activity of transcriptionfactors, RNA-binding proteins, and microRNAs in5,185 TCGA tumors and 1,019 ENCODE assays.Our predictions included hundreds of candidateonco- and tumor-suppressor lncRNAs (cancerlncRNAs) whose somatic alterations account for thedysregulation of dozens of cancer genes and path-ways in each of 14 tumor contexts. To demonstrateproof of concept, we showed that perturbations tar-geting OIP5-AS1 (an inferred tumor suppressor) andTUG1 and WT1-AS (inferred onco-lncRNAs) dysre-gulated cancer genes and altered proliferation ofbreast and gynecologic cancer cells. Our analysis in-dicates that, although most lncRNAs are dysregu-lated in a tumor-specific manner, some, includingOIP5-AS1, TUG1, NEAT1, MEG3, and TSIX, synergis-tically dysregulate cancer pathways in multiple tumorcontexts
Pan-cancer Alterations of the MYC Oncogene and Its Proximal Network across the Cancer Genome Atlas
Although theMYConcogene has been implicated incancer, a systematic assessment of alterations ofMYC, related transcription factors, and co-regulatoryproteins, forming the proximal MYC network (PMN),across human cancers is lacking. Using computa-tional approaches, we define genomic and proteo-mic features associated with MYC and the PMNacross the 33 cancers of The Cancer Genome Atlas.Pan-cancer, 28% of all samples had at least one ofthe MYC paralogs amplified. In contrast, the MYCantagonists MGA and MNT were the most frequentlymutated or deleted members, proposing a roleas tumor suppressors.MYCalterations were mutu-ally exclusive withPIK3CA,PTEN,APC,orBRAFalterations, suggesting that MYC is a distinct onco-genic driver. Expression analysis revealed MYC-associated pathways in tumor subtypes, such asimmune response and growth factor signaling; chro-matin, translation, and DNA replication/repair wereconserved pan-cancer. This analysis reveals insightsinto MYC biology and is a reference for biomarkersand therapeutics for cancers with alterations ofMYC or the PMN
- âŠ