1,426 research outputs found

    Identification of SNP interactions using logic regression

    Get PDF
    Interactions of single nucleotide polymorphisms (SNPs) are assumed to be responsible for complex diseases such as sporadic breast cancer. Important goals of studies concerned with such genetic data are thus to identify combinations of SNPs that lead to a higher risk of developing a disease and to measure the importance of these interactions. There are many approaches based on classification methods such as CART and Random Forests that allow measuring the importance of single variables. But with none of these methods the importance of combinations of variables can be quantified directly. In this paper, we show how logic regression can be employed to identify SNP interactions explanatory for the disease status in a case- control study and propose two measures for quantifying the importance of these interactions for classification. These approaches are then applied, on the one hand, to simulated data sets, and on the other hand, to the SNP data of the GENICA study, a study dedicated to the identification of genetic and gene-environment interactions associated with sporadic breast cancer. --Single Nucleotide Polymorphism,Feature Selection,Variable Importance Measure,GENICA

    Feedforward deep architectures for classification and synthesis

    Full text link
    Cette thèse par article présente plusieurs contributions au domaine de l'apprentissage de représentations profondes, avec des applications aux problèmes de classification et de synthèse d'images naturelles. Plus spécifiquement, cette thèse présente plusieurs nouvelles techniques pour la construction et l'entraînment de réseaux neuronaux profonds, ainsi q'une étude empirique de la technique de «dropout», une des approches de régularisation les plus populaires des dernières années. Le premier article présente une nouvelle fonction d'activation linéaire par morceau, appellée «maxout», qui permet à chaque unité cachée d'un réseau de neurones d'apprendre sa propre fonction d'activation convexe. Nous démontrons une performance améliorée sur plusieurs tâches d'évaluation du domaine de reconnaissance d'objets, et nous examinons empiriquement les sources de cette amélioration, y compris une meilleure synergie avec la méthode de régularisation «dropout» récemment proposée. Le second article poursuit l'examen de la technique «dropout». Nous nous concentrons sur les réseaux avec fonctions d'activation rectifiées linéaires (ReLU) et répondons empiriquement à plusieurs questions concernant l'efficacité remarquable de «dropout» en tant que régularisateur, incluant les questions portant sur la méthode rapide de rééchelonnement au temps de l´évaluation et la moyenne géometrique que cette méthode approxime, l'interprétation d'ensemble comparée aux ensembles traditionnels, et l'importance d'employer des critères similaires au «bagging» pour l'optimisation. Le troisième article s'intéresse à un problème pratique de l'application à l'échelle industrielle de réseaux neuronaux profonds au problème de reconnaissance d'objets avec plusieurs etiquettes, nommément l'amélioration de la capacité d'un modèle à discriminer entre des étiquettes fréquemment confondues. Nous résolvons le problème en employant la prédiction du réseau des sous-composantes dédiées à chaque sous-ensemble de la partition. Finalement, le quatrième article s'attaque au problème de l'entraînment de modèles génératifs adversariaux (GAN) récemment proposé. Nous présentons une procédure d'entraînment améliorée employant un auto-encodeur débruitant, entraîné dans un espace caractéristiques abstrait appris par le discriminateur, pour guider le générateur à apprendre un encodage qui s'aligne de plus près aux données. Nous évaluons le modèle avec le score «Inception» récemment proposé.This thesis by articles makes several contributions to the field of deep learning, with applications to both classification and synthesis of natural images. Specifically, we introduce several new techniques for the construction and training of deep feedforward networks, and present an empirical investigation into dropout, one of the most popular regularization strategies of the last several years. In the first article, we present a novel piece-wise linear parameterization of neural networks, maxout, which allows each hidden unit of a neural network to effectively learn its own convex activation function. We demonstrate improvements on several object recognition benchmarks, and empirically investigate the source of these improvements, including an improved synergy with the recently proposed dropout regularization method. In the second article, we further interrogate the dropout algorithm in particular. Focusing on networks of the popular rectified linear units (ReLU), we empirically examine several questions regarding dropout’s remarkable effectiveness as a regularizer, including questions surrounding the fast test-time rescaling trick and the geometric mean it approximates, interpretations as an ensemble as compared with traditional ensembles, and the importance of using a bagging-like criterion for optimization. In the third article, we address a practical problem in industrial-scale application of deep networks for multi-label object recognition, namely improving an existing model’s ability to discriminate between frequently confused classes. We accomplish this by using the network’s own predictions to inform a partitioning of the label space, and augment the network with dedicated discriminative capacity addressing each of the partitions. Finally, in the fourth article, we tackle the problem of fitting implicit generative models of open domain collections of natural images using the recently introduced Generative Adversarial Networks (GAN) paradigm. We introduce an augmented training procedure which employs a denoising autoencoder, trained in a high-level feature space learned by the discriminator, to guide the generator towards feature encodings which more closely resemble the data. We quantitatively evaluate our findings using the recently proposed Inception score

    A nonparametric point estimation technique using the m-out-of-n bootstrap

    Get PDF
    We investigate a method which can be used to improve an existing point estimator by a modification of the estimator and by using the m-out-of-n bootstrap. The estimation method used, known as bootstrap robust aggregating (or BRAGGing) in the literature, will be applied in general to the estimators that satisfy the smooth function model (for example, a mean, a variance, a ratio of means or variances, or a correlation coefficient), and then specifically to an estimator for the population mean. BRAGGing estimators based on both a naive and corrected version of them-out-of-n bootstrap will be considered. We conclude with proposed data-based choices of the resample size, m, as well as Monte-Carlo studies illustrating the performance of the estimators when estimating the population mean for various distributions

    The Measure of a MAC: A Quasi-Experimental Protocol for Tokenizing \u3ci\u3eForce Majeure\u3c/i\u3e Clauses in M&A Agreements

    Get PDF
    We develop a protocol for using a well known lawyer-coded data set on Material Adverse Change/Effect clauses in acquisitions agreements to tokenize and calibrate a machine learning algorithm of textual analysis. Our protocol, built on both regular expression (RE) and latent semantic analysis (LSA) approaches, is designed to replicate, correct, and extend the reach of the hand-coded data. Our preliminary results indicate that both approaches perform well, though a hybridized approach improves predictive power even more. We employ Monte Carlo simulations show that our results generally carry over to out-of-sample predictions. We conclude that similar approaches could be used much more broadly in empirical legal scholarship, most specifically in the study of transactional documents in business law

    Asteroseismology from multi-month Kepler photometry: the evolved Sun-like stars KIC 10273246 and KIC 10920273

    Get PDF
    The evolved main-sequence Sun-like stars KIC 10273246 (F-type) and KIC 10920273 (G-type) were observed with the NASA Kepler satellite for approximately ten months with a duty cycle in excess of 90%. Such continuous and long observations are unprecedented for solar-type stars other than the Sun. We aimed mainly at extracting estimates of p-mode frequencies - as well as of other individual mode parameters - from the power spectra of the light curves of both stars, thus providing scope for a full seismic characterization. The light curves were corrected for instrumental effects in a manner independent of the Kepler Science Pipeline. Estimation of individual mode parameters was based both on the maximization of the likelihood of a model describing the power spectrum and on a classic prewhitening method. Finally, we employed a procedure for selecting frequency lists to be used in stellar modeling. A total of 30 and 21 modes of degree l=0,1,2 - spanning at least eight radial orders - have been identified for KIC 10273246 and KIC 10920273, respectively. Two avoided crossings (l=1 ridge) have been identified for KIC 10273246, whereas one avoided crossing plus another likely one have been identified for KIC 10920273. Good agreement is found between observed and predicted mode amplitudes for the F-type star KIC 10273246, based on a revised scaling relation. Estimates are given of the rotational periods, the parameters describing stellar granulation and the global asteroseismic parameters Δν\Delta\nu and νmax\nu_{\rm{max}}.Comment: 15 pages, 15 figures, to be published in Astronomy & Astrophysic

    Ensembles of extremely randomized trees and some generic applications

    Full text link
    peer reviewedIn this paper we present a new tree-based ensemble method called “Extra-Trees”. This algorithm averages predictions of trees obtained by partitioning the inputspace with randomly generated splits, leading to significant improvements of precision, and various algorithmic advantages, in particular reduced computational complexity and scalability. We also discuss two generic applications of this algorithm, namely for time-series classification and for the automatic inference of near-optimal sequential decision policies from experimental data
    corecore