3,216 research outputs found

    Apprentissage automatique avec garanties de généralisation à l'aide de méthodes d'ensemble maximisant le désaccord

    Get PDF
    Nous nous intĂ©ressons au domaine de l’apprentissage automatique, une branche de l’intelligence artificielle. Pour rĂ©soudre une tĂąche de classification, un algorithme d’apprentissage observe des donnĂ©es Ă©tiquetĂ©es et a comme objectif d’apprendre une fonction qui sera en mesure de classifier automatiquement les donnĂ©es qui lui seront prĂ©sentĂ©es dans le futur. Plusieurs algorithmes classiques d’apprentissage cherchent Ă  combiner des classificateurs simples en construisant avec ceux-ci un classificateur par vote de majoritĂ©. Dans cette thĂšse, nous explorons l’utilisation d’une borne sur le risque du classificateur par vote de majoritĂ©, nommĂ©e la C-borne. Celle-ci est dĂ©finie en fonction de deux quantitĂ©s : la performance individuelle des votants, et la corrĂ©lation de leurs erreurs (leur dĂ©saccord). Nous explorons d’une part son utilisation dans des bornes de gĂ©nĂ©ralisation des classificateurs par vote de majoritĂ©. D’autre part, nous l’étendons de la classification binaire vers un cadre gĂ©nĂ©ralisĂ© de votes de majoritĂ©. Nous nous en inspirons finalement pour dĂ©velopper de nouveaux algorithmes d’apprentissage automatique, qui offrent des performances comparables aux algorithmes de l’état de l’art, en retournant des votes de majoritĂ© qui maximisent le dĂ©saccord entre les votants, tout en contrĂŽlant la performance individuelle de ceux-ci. Les garanties de gĂ©nĂ©ralisation que nous dĂ©veloppons dans cette thĂšse sont de la famille des bornes PAC-bayĂ©siennes. Nous gĂ©nĂ©ralisons celles-ci en introduisant une borne gĂ©nĂ©rale, Ă  partir de laquelle peuvent ĂȘtre retrouvĂ©es les bornes de la littĂ©rature. De cette mĂȘme borne gĂ©nĂ©rale, nous introduisons des bornes de gĂ©nĂ©ralisation basĂ©es sur la C-borne. Nous simplifions Ă©galement le processus de preuve des thĂ©orĂšmes PAC-bayĂ©siens, nous permettant d’obtenir deux nouvelles familles de bornes. L’une est basĂ©e sur une diffĂ©rente notion de complexitĂ©, la divergence de RĂ©nyi plutĂŽt que la divergence Kullback-Leibler classique, et l’autre est spĂ©cialisĂ©e au cadre de l’apprentissage transductif plutĂŽt que l’apprentissage inductif. Les deux algorithmes d’apprentissage que nous introduisons, MinCq et CqBoost, retournent un classificateur par vote de majoritĂ© maximisant le dĂ©saccord des votants. Un hyperparamĂštre permet de directement contrĂŽler leur performance individuelle. Ces deux algorithmes Ă©tant construits pour minimiser une borne PAC-bayĂ©sienne, ils sont rigoureusement justifiĂ©s thĂ©oriquement. À l’aide d’une Ă©valuation empirique, nous montrons que MinCq et CqBoost ont une performance comparable aux algorithmes classiques de l’état de l’art.We focus on machine learning, a branch of artificial intelligence. When solving a classification problem, a learning algorithm is provided labelled data and has the task of learning a function that will be able to automatically classify future, unseen data. Many classical learning algorithms are designed to combine simple classifiers by building a weighted majority vote classifier out of them. In this thesis, we extend the usage of the C-bound, bound on the risk of the majority vote classifier. This bound is defined using two quantities : the individual performance of the voters, and the correlation of their errors (their disagreement). First, we design majority vote generalization bounds based on the C-bound. Then, we extend this bound from binary classification to generalized majority votes. Finally, we develop new learning algorithms with state-of-the-art performance, by constructing majority votes that maximize the voters’ disagreement, while controlling their individual performance. The generalization guarantees that we develop in this thesis are in the family of PAC-Bayesian bounds. We generalize the PAC-Bayesian theory by introducing a general theorem, from which the classical bounds from the literature can be recovered. Using this same theorem, we introduce generalization bounds based on the C-bound. We also simplify the proof process of PAC-Bayesian theorems, easing the development of new families of bounds. We introduce two new families of PAC-Bayesian bounds. One is based on a different notion of complexity than usual bounds, the RĂ©nyi divergence, instead of the classical Kullback-Leibler divergence. The second family is specialized to transductive learning, instead of inductive learning. The two learning algorithms that we introduce, MinCq and CqBoost, output a majority vote classifier that maximizes the disagreement between voters. An hyperparameter of the algorithms gives a direct control over the individual performance of the voters. These two algorithms being designed to minimize PAC-Bayesian generalization bounds on the risk of the majority vote classifier, they come with rigorous theoretical guarantees. By performing an empirical evaluation, we show that MinCq and CqBoost perform as well as classical stateof- the-art algorithms

    On the Generalization of the C-Bound to Structured Output Ensemble Methods

    No full text
    This paper generalizes an important result from the PAC-Bayesian literature for binary classification to the case of ensemble methods for structured outputs. We prove a generic version of the \Cbound, an upper bound over the risk of models expressed as a weighted majority vote that is based on the first and second statistical moments of the vote's margin. This bound may advantageously (i)(i) be applied on more complex outputs such as multiclass labels and multilabel, and (ii)(ii) allow to consider margin relaxations. These results open the way to develop new ensemble methods for structured output prediction with PAC-Bayesian guarantees

    SIGIR 2021 E-Commerce Workshop Data Challenge

    Get PDF
    The 2021 SIGIR workshop on eCommerce is hosting the Coveo Data Challenge for "In-session prediction for purchase intent and recommendations". The challenge addresses the growing need for reliable predictions within the boundaries of a shopping session, as customer intentions can be different depending on the occasion. The need for efficient procedures for personalization is even clearer if we consider the e-commerce landscape more broadly: outside of giant digital retailers, the constraints of the problem are stricter, due to smaller user bases and the realization that most users are not frequently returning customers. We release a new session-based dataset including more than 30M fine-grained browsing events (product detail, add, purchase), enriched by linguistic behavior (queries made by shoppers, with items clicked and items not clicked after the query) and catalog meta-data (images, text, pricing information). On this dataset, we ask participants to showcase innovative solutions for two open problems: a recommendation task (where a model is shown some events at the start of a session, and it is asked to predict future product interactions); an intent prediction task, where a model is shown a session containing an add-to-cart event, and it is asked to predict whether the item will be bought before the end of the session.Comment: SIGIR eCOM 2021 Data Challeng

    PAC-Bayesian Bounds based on the RĂ©nyi Divergence

    Get PDF
    International audienceWe propose a simplified proof process for PAC-Bayesian generalization bounds, that allows to divide the proof in four successive inequalities, easing the "customization" of PAC-Bayesian theorems. We also propose a family of PAC-Bayesian bounds based on the RĂ©nyi divergence between the prior and posterior distributions, whereas most PAC-Bayesian bounds are based on the Kullback-Leibler divergence. Finally, we present an empirical evaluation of the tightness of each inequality of the simplified proof, for both the classical PAC-Bayesian bounds and those based on the RĂ©nyi divergence

    Recent Developments of an Opto-Electronic THz Spectrometer for High-Resolution Spectroscopy

    Get PDF
    A review is provided of sources and detectors that can be employed in the THz range before the description of an opto-electronic source of monochromatic THz radiation. The realized spectrometer has been applied to gas phase spectroscopy. Air-broadening coefficients of HCN are determined and the insensitivity of this technique to aerosols is demonstrated by the analysis of cigarette smoke. A multiple pass sample cell has been used to obtain a sensitivity improvement allowing transitions of the volatile organic compounds to be observed. A solution to the frequency metrology is presented and promises to yield accurate molecular line center measurements

    Pan-Cancer Analysis of lncRNA Regulation Supports Their Targeting of Cancer Genes in Each Tumor Context

    Get PDF
    Long noncoding RNAs (lncRNAs) are commonly dys-regulated in tumors, but only a handful are known toplay pathophysiological roles in cancer. We inferredlncRNAs that dysregulate cancer pathways, onco-genes, and tumor suppressors (cancer genes) bymodeling their effects on the activity of transcriptionfactors, RNA-binding proteins, and microRNAs in5,185 TCGA tumors and 1,019 ENCODE assays.Our predictions included hundreds of candidateonco- and tumor-suppressor lncRNAs (cancerlncRNAs) whose somatic alterations account for thedysregulation of dozens of cancer genes and path-ways in each of 14 tumor contexts. To demonstrateproof of concept, we showed that perturbations tar-geting OIP5-AS1 (an inferred tumor suppressor) andTUG1 and WT1-AS (inferred onco-lncRNAs) dysre-gulated cancer genes and altered proliferation ofbreast and gynecologic cancer cells. Our analysis in-dicates that, although most lncRNAs are dysregu-lated in a tumor-specific manner, some, includingOIP5-AS1, TUG1, NEAT1, MEG3, and TSIX, synergis-tically dysregulate cancer pathways in multiple tumorcontexts

    Pan-cancer Alterations of the MYC Oncogene and Its Proximal Network across the Cancer Genome Atlas

    Get PDF
    Although theMYConcogene has been implicated incancer, a systematic assessment of alterations ofMYC, related transcription factors, and co-regulatoryproteins, forming the proximal MYC network (PMN),across human cancers is lacking. Using computa-tional approaches, we define genomic and proteo-mic features associated with MYC and the PMNacross the 33 cancers of The Cancer Genome Atlas.Pan-cancer, 28% of all samples had at least one ofthe MYC paralogs amplified. In contrast, the MYCantagonists MGA and MNT were the most frequentlymutated or deleted members, proposing a roleas tumor suppressors.MYCalterations were mutu-ally exclusive withPIK3CA,PTEN,APC,orBRAFalterations, suggesting that MYC is a distinct onco-genic driver. Expression analysis revealed MYC-associated pathways in tumor subtypes, such asimmune response and growth factor signaling; chro-matin, translation, and DNA replication/repair wereconserved pan-cancer. This analysis reveals insightsinto MYC biology and is a reference for biomarkersand therapeutics for cancers with alterations ofMYC or the PMN
    • 

    corecore