18,163 research outputs found

    Learning the optimal scale for GWAS through hierarchical SNP aggregation

    Full text link
    Motivation: Genome-Wide Association Studies (GWAS) seek to identify causal genomic variants associated with rare human diseases. The classical statistical approach for detecting these variants is based on univariate hypothesis testing, with healthy individuals being tested against affected individuals at each locus. Given that an individual's genotype is characterized by up to one million SNPs, this approach lacks precision, since it may yield a large number of false positives that can lead to erroneous conclusions about genetic associations with the disease. One way to improve the detection of true genetic associations is to reduce the number of hypotheses to be tested by grouping SNPs. Results: We propose a dimension-reduction approach which can be applied in the context of GWAS by making use of the haplotype structure of the human genome. We compare our method with standard univariate and multivariate approaches on both synthetic and real GWAS data, and we show that reducing the dimension of the predictor matrix by aggregating SNPs gives a greater precision in the detection of associations between the phenotype and genomic regions

    Surveys, Astrometric Follow-up & Population Statistics

    Full text link
    Asteroid surveys are the backbone of asteroid science, and with this in mind we begin with a broad review of the impact of asteroid surveys on our field. We then provide a brief history of asteroid discoveries so as to place contemporary and future surveys in perspective. Surveys in the United States have discovered the vast majority of the asteroids and this dominance has been consolidated since the publication of Asteroids III. Our descriptions of the asteroid surveys that have been operational since that time are focussed upon those that have contributed the vast majority of asteroid observations and discoveries. We also provide some insight into upcoming next-generation surveys that are sure to alter our understanding of the small bodies in the inner solar system and provide evidence to untangle their complicated dynamical and physical histories. The Minor Planet Center, the nerve center of the asteroid discovery effort, has improved its operations significantly in the past decade so that it can manage the increasing discovery rate, and ensure that it is well-placed to handle the data rates expected in the next decade. We also consider the difficulties associated with astrometric follow-up of newly identified objects. It seems clear that both of these efforts must operate in new modes in order to keep pace with expected discovery rates of next-generation ground- and space-based surveys.Comment: Chapter to appear in the book ASTEROIDS IV, (University of Arizona Press) Space Science Series, edited by P. Michel, F. DeMeo and W. Bottk

    Outlier identification in radiation therapy knowledge-based planning: A study of pelvic cases.

    Get PDF
    PURPOSE: The purpose of this study was to apply statistical metrics to identify outliers and to investigate the impact of outliers on knowledge-based planning in radiation therapy of pelvic cases. We also aimed to develop a systematic workflow for identifying and analyzing geometric and dosimetric outliers. METHODS: Four groups (G1-G4) of pelvic plans were sampled in this study. These include the following three groups of clinical IMRT cases: G1 (37 prostate cases), G2 (37 prostate plus lymph node cases) and G3 (37 prostate bed cases). Cases in G4 were planned in accordance with dynamic-arc radiation therapy procedure and include 10 prostate cases in addition to those from G1. The workflow was separated into two parts: 1. identifying geometric outliers, assessing outlier impact, and outlier cleaning; 2. identifying dosimetric outliers, assessing outlier impact, and outlier cleaning. G2 and G3 were used to analyze the effects of geometric outliers (first experiment outlined below) while G1 and G4 were used to analyze the effects of dosimetric outliers (second experiment outlined below). A baseline model was trained by regarding all G2 cases as inliers. G3 cases were then individually added to the baseline model as geometric outliers. The impact on the model was assessed by comparing leverages of inliers (G2) and outliers (G3). A receiver-operating-characteristic (ROC) analysis was performed to determine the optimal threshold. The experiment was repeated by training the baseline model with all G3 cases as inliers and perturbing the model with G2 cases as outliers. A separate baseline model was trained with 32 G1 cases. Each G4 case (dosimetric outlier) was subsequently added to perturb the model. Predictions of dose-volume histograms (DVHs) were made using these perturbed models for the remaining 5 G1 cases. A Weighted Sum of Absolute Residuals (WSAR) was used to evaluate the impact of the dosimetric outliers. RESULTS: The leverage of inliers and outliers was significantly different. The Area-Under-Curve (AUC) for differentiating G2 (outliers) from G3 (inliers) was 0.98 (threshold: 0.27) for the bladder and 0.81 (threshold: 0.11) for the rectum. For differentiating G3 (outlier) from G2 (inlier), the AUC (threshold) was 0.86 (0.11) for the bladder and 0.71 (0.11) for the rectum. Significant increase in WSAR was observed in the model with 3 dosimetric outliers for the bladder (P \u3c 0.005 with Bonferroni correction), and in the model with only 1 dosimetric outlier for the rectum (P \u3c 0.005). CONCLUSIONS: We established a systematic workflow for identifying and analyzing geometric and dosimetric outliers, and investigated statistical metrics for outlier detection. Results validated the necessity for outlier detection and clean-up to enhance model quality in clinical practice

    Best practice of risk modelling in motor insurance : using GLM and Machine Learning approach

    Get PDF
    Mestrado em Actuarial ScienceO pricing na atividade seguradora está a tornar-se cada vez mais interessante e desafi- ador pelo facto de a dimensão dos dados a analisar estar a crescer de forma explosiva. Torna-se assim urgente para as seguradoras reconsiderar a forma de lidar com este vol- ume de dados. Para implementar modelos sofisticados de pricing para produtos de seguro automóvel, aplicámos técnicas de machine learning, incluindo modelos GLM penalizados e métodos de boosting, que ajudam a identificar as características mais importantes de entre uma grande quantidade de variáveis candidatas. Estes métodos também permitem detetar potenciais interações sem testar as inúmeras combinações bidimensionais. Para um uso eficiente desses métodos, é necessário compreender o objetivo do modelo, as hipóteses que o suportam e dominar as metodologias estatísticas. Embora haja alguma evidência de um maior poder preditivo dos modelos baseados em machine learning quando comparados com os tradicionais GLM, estes últimos beneficiam de uma estrutura, mais conveniente e mais interpretável. O modelo GLM é mais fácil de ex- plicar às partes interessadas o que nos levou a utilizar os GLM na modelação do risco, mas absorvendo os ensinamentos dados pelos modelos de machine learning. A avaliação dos modelos é realizada pela análise dos resíduos quer na fase de treino quer de validação quer ainda de teste. Após a revisão pela equipa, aplicam-se alguns ajustes em cada modelo para reforçar a sua significância e a sua robustez. Espera-se que eles tenham alto poder preditivo nos dados fora da amostra e possam, portanto, ser usados no futuro.Insurance pricing nowadays is getting more and more interesting and challenging due to the fact that the dimension of analysable data is evolutionarily exploding. It is an urgent call for insurers to reconsider how to deal with the data more accurately and precisely. To implement pricing sophistication in motor insurance products, we apply cutting edge machine learning techniques including penalized GLM and boosting methods, which help us identify the important features among massive amount of candidate variables, and detect potential interactions without trying the endless two-way combinations manually. In order to sufficiently make use of these methods, we need to deeply understand the research objective, preliminary assumptions and statistical backgrounds. Although there is some evidence indicating the existence of higher predictive power of machine learning models compared with traditional GLM (Generalized Linear Models), GLM is more convenient and interpretable, especially for multiplicative models. GLM model is easier to be demonstrated to stakeholder, therefore we still achieve our risk models in GLM, but absorbing the insights from our machine learning results. The evaluation of models is done by progression, it is generally performed by residual analysis of the training or validation dataset, and testing errors for the holdout dataset. After peer review, we apply some adjustment in each model, to get models that are significant and robust. They are expected to have high predictive power in the out-of- sample data, thus can be used in the future.info:eu-repo/semantics/publishedVersio

    Stability

    Full text link
    Reproducibility is imperative for any scientific discovery. More often than not, modern scientific findings rely on statistical analysis of high-dimensional data. At a minimum, reproducibility manifests itself in stability of statistical results relative to "reasonable" perturbations to data and to the model used. Jacknife, bootstrap, and cross-validation are based on perturbations to data, while robust statistics methods deal with perturbations to models. In this article, a case is made for the importance of stability in statistics. Firstly, we motivate the necessity of stability for interpretable and reliable encoding models from brain fMRI signals. Secondly, we find strong evidence in the literature to demonstrate the central role of stability in statistical inference, such as sensitivity analysis and effect detection. Thirdly, a smoothing parameter selector based on estimation stability (ES), ES-CV, is proposed for Lasso, in order to bring stability to bear on cross-validation (CV). ES-CV is then utilized in the encoding models to reduce the number of predictors by 60% with almost no loss (1.3%) of prediction performance across over 2,000 voxels. Last, a novel "stability" argument is seen to drive new results that shed light on the intriguing interactions between sample to sample variability and heavier tail error distribution (e.g., double-exponential) in high-dimensional regression models with pp predictors and nn independent samples. In particular, when p/n→κ∈(0.3,1)p/n\rightarrow\kappa\in(0.3,1) and the error distribution is double-exponential, the Ordinary Least Squares (OLS) is a better estimator than the Least Absolute Deviation (LAD) estimator.Comment: Published in at http://dx.doi.org/10.3150/13-BEJSP14 the Bernoulli (http://isi.cbs.nl/bernoulli/) by the International Statistical Institute/Bernoulli Society (http://isi.cbs.nl/BS/bshome.htm
    • …
    corecore