132 research outputs found

    Robust rank correlation based screening

    Full text link
    Independence screening is a variable selection method that uses a ranking criterion to select significant variables, particularly for statistical models with nonpolynomial dimensionality or "large p, small n" paradigms when p can be as large as an exponential of the sample size n. In this paper we propose a robust rank correlation screening (RRCS) method to deal with ultra-high dimensional data. The new procedure is based on the Kendall \tau correlation coefficient between response and predictor variables rather than the Pearson correlation of existing methods. The new method has four desirable features compared with existing independence screening methods. First, the sure independence screening property can hold only under the existence of a second order moment of predictor variables, rather than exponential tails or alikeness, even when the number of predictor variables grows as fast as exponentially of the sample size. Second, it can be used to deal with semiparametric models such as transformation regression models and single-index models under monotonic constraint to the link function without involving nonparametric estimation even when there are nonparametric functions in the models. Third, the procedure can be largely used against outliers and influence points in the observations. Last, the use of indicator functions in rank correlation screening greatly simplifies the theoretical derivation due to the boundedness of the resulting statistics, compared with previous studies on variable screening. Simulations are carried out for comparisons with existing methods and a real data example is analyzed.Comment: Published in at http://dx.doi.org/10.1214/12-AOS1024 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org). arXiv admin note: text overlap with arXiv:0903.525

    Impact of SST Anomalies on Coral Reefs Damage Based on Copula Analysis

    Get PDF
    The condition of coral reefs in Indonesia is alarming. One of the influenting factors of coral reefs damage is extreme climate change. The aim of this study is to determine the relationship of climate change, that is Sea Surface Temperature (SST) anomaly index, and coral reefs damage in West, Central and East Region of Indonesia. The method used in this study is Copula analysis. Copula is one of the statistical methods used to determine the relationship of two or more variables, in which case the distribution can be normal or not. First, data is transformed into Uniform [0,1] domain. Then, Copula parameter is estimated to get significance parameter. Lastly, the best Copula that has the highest log likelihood value is selected to represent the relationship of data. The result indicates that percentage of coral reefs damage in West and Central Region has relationship with SST Nino 4, while coral reefs damage in East Region does not have relationship with any of SST Nino anomalies. In West Region, the best Copula represents the relationship is Gaussian Copula (parameter = -0.32); it concludes that the higher the value of SST Nino 4, the lower the percentage of coral reefs damage and otherwise. While in Central Indonesia, Frank Copula (parameter = -4.89) is selected; it does not have tail dependency so that the SST Nino 4 and the percentage of coral reefs in damage condition in Central Region has low correlation

    Challenges of Big Data Analysis

    Full text link
    Big Data bring new opportunities to modern society and challenges to data scientists. On one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity, and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This article give overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasis on the viability of the sparsest solution in high-confidence set and point out that exogeneous assumptions in most statistical methods for Big Data can not be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions

    Copula models for epidemiological research and practice

    Get PDF
    Investigating associations between random variables (rvs) is one of many topics in the heart of statistical science. Graphical displays show emerging patterns between rvs, and the strength of their association is conventionally quantified via correlation coefficients. When two or more of these rvs are thought of as outcomes, their association is governed by a joint probability distribution function (pdf). When the joint pdf is bivariate normal, scalar correlation coefficients will produce a satisfactory summary of the association, otherwise alternative measures are needed. Local dependence functions, together with their corresponding graphical displays, quantify and show how the strength of the association varies across the span of the data. Additionally, the multivariate distribution function can be explicitly formulated and explored. Copulas model joint distributions of varying shapes by combining the separate (univariate) marginal cumulative distribution functions of each rv under a specified correlation structure. Copula models can be used to analyse complex relationships and incorporate covariates into their parameters. Therefore, they offer increased flexibility in modelling dependence between rvs. Copula models may also be used to construct bivariate analogues of centiles, an application for which few references are available in the literature though it is of particular interest for many paediatric applications. Population centiles are widely used to highlight children or adults who have unusual univariate outcomes. Whilst the methodology for the construction of univariate centiles is well established there has been very little work in the area of bivariate analogues of centiles where two outcomes are jointly considered. Conditional models can increase the efficiency of centile analogues in detection of individuals who require some form of intervention. Such adjustments can be readily incorporated into the modelling of the marginal distributions and of the dependence parameter within the copula model

    Dialect Use, Language Abilities, and Emergent Literacy Skills of Prekindergarten Children Who Speak African American English

    Get PDF
    Purpose. The purpose of this study was to gain a better understanding of the complex relationship between spoken language and emergent literacy skills for children who speak African American English (AAE). Therefore, this study examined childrenā€™s language proficiency, dialect use, and emergent literacy skills at the beginning of Head Start preschool and throughout the entire academic year. Methods. This study analyzed scores from a database of 120 preschool children who spoke AAE. Data included narrative retells of the wordless picture book Frog Where Are You? that were transcribed utilizing Systematic Analysis of Language Transcript (SALT) Software. Narrative retells were then coded for dialect density (DDM), Narrative Scoring Scheme (NSS) and an adapted Subordination Index (SI) score that accounted for AAE morphosyntactic features. Additional measures included the Peabody Picture Vocabulary Test (PPVT) and two subtests of the Phonological Awareness Literacy Screening for Preschool (PALS-PreK) (i.e. print awareness and alphabet knowledge). Taken together, these measures were analyzed for potential relationships using correlation analyses, repeated measures analysis of variance (ANOVA), and multiple regression analyses. Results. Analysis revealed significant negative correlations between DDM, print awareness, PPVT, and NSS at the beginning of Head Start. However, a multiple regression analysis indicated that there was no unique relationship between DDM and print awareness scores. Upon examining growth across the academic year, children demonstrated significant gains in their NSS and emergent literacy scores when comparing fall and spring performance; however, changes in dialect were not related to changes in NSS scores and emergent literacy gains were again shown to not be exclusively related to dialect. Overall, NSS scores most predicted measures of emergent literacy across analyses, indicating that any relationship between dialect use and emergent literacy skills was fully explained by the childrenā€™s oral language skills alone. Conclusions. Because dialect use did not uniquely predict language or emergent literacy skills, we concluded that, at this early stage in literacy development, dialect use is more of an independent factor. This adds to the work of Terry and Connor (2012), who found dialect use to be independent of word reading, receptive vocabulary abilities, and phonological awareness skills. These findings will help clinicians working with diverse speakers better understand the relationship between dialect use, language skills, and emergent literacy abilities, as well as better support childrenā€™s literacy development at this crucial early stage. Due to small sample sizes and the inclusion of only two dimensions of emergent literacy skills, caution should be used when generalizing and interpreting the findings

    Non Stationarity and Market Structure Dynamics in Financial Time Series

    Get PDF
    This thesis is an investigation of the time changing nature of financial markets. Financial markets are complex systems having an intrinsic structure defined by the interplay of several variables. The technological advancements of the ā€™digital ageā€™ have exponentially increased the amount of data available to financial researchers and industry professionals over the last decade and, as a consequence, it has highlighted the key role of iterations amongst variables. A critical characteristic of the financial system, however, is its time changing nature: the multivariate structure of the systems changes and evolves through time. This feature is critically relevant for classical statistical assumptions and has proven challenging to be investigated and researched. This thesis is devoted to the investigation of this property, providing evidences on the time changing nature of the system, analysing the implications for traditional asset allocation practices and proposing a novel methodology to identify and predict ā€˜market statesā€™. First, I analyse how classical model estimations are affected by time and what are the consequential effects on classical portfolio construction techniques. Focusing on elliptical models of daily returns, I present experiments on both in-sample and out-of-sample likelihood of individual observations and show that the system changes significantly through time. Larger estimation windows lead to stable likelihood in the long run, but at the cost of lower likelihood in the short-term. A key implication of these findings is that the optimality of fit in finance needs to be defined in terms of the holding period. In this context, I also show that sparse models and information filtering significantly cope with the effects of non stationarity avoiding the typical pitfalls of conventional portfolio optimization approaches. Having assessed and documented the time changing nature of the financial system, I propose a novel methodology to segment financial time series into market states that we call ICC - Inverse Covariance Clustering. The ICC methodology allows to study the evolution of the multivariate structure of the system by segmenting the time series based on their correlation structure. In the ICC framework, market states are identified by a reference sparse precision matrix and a vector of expectation values. In the estimation procedure, each multivariate observation is associated to a market state accordingly to a minimisation of a penalized distance measure (e.g. likelihood, mahalanobis distance). The procedure is made computationally very efficient and can be used with a large number of assets. Furthermore, the ICC methodology allows to control for temporal consistency,S making it of high practical relevance for trading systems. I present a set of experiments investigating the features of the discovered clusters and comparing it to standard clustering techniques. I show that the ICC methodology is successful at clustering different states of the markets in an unsupervised manner, outperforming baseline standard models. Further, I show that the procedure can be efficiently used to forecast off-sample future market states with significant prediction accuracy. Lastly, I test the significance of increasing number of states used to model equity returns and how this parameter relates to the number of observations and the time consistency of the states. I present experiments to investigate a) the likelihood of the overall model as more states are spanned, b) the relevance of additional regimes measured by the number of observations clustered. I found that the number of ā€œmarket statesā€ that optimally define the system is increasing with the time spanned and the number of observations considered

    Large-Scale Nonparametric and Semiparametric Inference for Large, Complex, and Noisy Datasets

    Get PDF
    Massive Data bring new opportunities and challenges to data scientists and statisticians. On one hand, Massive Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the size and dimensionality of Massive Data introduce unique statistical challenges and consequences for model misspecification. Some important factors are as follows. Complexity: Since Massive Data are often aggregated from multiple sources, they often exhibit heavy-tailedness behavior with nontrivial tail dependence. Noise: Massive Data usually contain various types of measurement error, outliers, and missing values. Dependence: In many data types, such as financial time series, functional magnetic resonance image (fMRI), and time course microarray data, the samples are dependent with relatively weak signals. These challenges are difficult to address and require new computational and statistical tools. More specifically, to handle these challenges, it is necessary to develop statistical methods that are robust to data complexity, noise, and dependence. Our work aims to make headway in resolving these issues. Notably, we give a unified framework for analyzing high dimensional, complex, noisy datasets having temporal/spatial dependence. The proposed methods enjoy good theoretical properties. Their empirical usefulness is also verified in large-scale neuroimage and financial data analysis

    Nonparametric Statistical Inference with an Emphasis on Information-Theoretic Methods

    Get PDF
    This book addresses contemporary statistical inference issues when no or minimal assumptions on the nature of studied phenomenon are imposed. Information theory methods play an important role in such scenarios. The approaches discussed include various high-dimensional regression problems, time series and dependence analyses
    • ā€¦
    corecore