Search CORE

147 research outputs found

Growth Estimators and Confidence Intervals for the Mean of Negative Binomial Random Variables with Unknown Dispersion

Author: Bean Derek
Shilane David
Publication venue
Publication date: 04/03/2012
Field of study

The Negative Binomial distribution becomes highly skewed under extreme dispersion. Even at moderately large sample sizes, the sample mean exhibits a heavy right tail. The standard Normal approximation often does not provide adequate inferences about the data's mean in this setting. In previous work, we have examined alternative methods of generating confidence intervals for the expected value. These methods were based upon Gamma and Chi Square approximations or tail probability bounds such as Bernstein's Inequality. We now propose growth estimators of the Negative Binomial mean. Under high dispersion, zero values are likely to be overrepresented in the data. A growth estimator constructs a Normal-style confidence interval by effectively removing a small, pre--determined number of zeros from the data. We propose growth estimators based upon multiplicative adjustments of the sample mean and direct removal of zeros from the sample. These methods do not require estimating the nuisance dispersion parameter. We will demonstrate that the growth estimators' confidence intervals provide improved coverage over a wide range of parameter values and asymptotically converge to the sample mean. Interestingly, the proposed methods succeed despite adding both bias and variance to the Normal approximation

arXiv.org e-Print Archive

CiteSeerX

Directory of Open Access Journals

Optimal Hashing in External Memory

Author: Conway Alex
Shilane Philip
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 45th International Colloquium on Automata, Languages, and Programming (ICALP 2018)
Publication date: 01/01/2018
Field of study

Hash tables are a ubiquitous class of dictionary data structures. However, standard hash table implementations do not translate well into the external memory model, because they do not incorporate locality for insertions. Iacono and Patrasu established an update/query tradeoff curve for external-hash tables: a hash table that performs insertions in O(lambda/B) amortized IOs requires Omega(log_lambda N) expected IOs for queries, where N is the number of items that can be stored in the data structure, B is the size of a memory transfer, M is the size of memory, and lambda is a tuning parameter. They provide a complicated hashing data structure, which we call the IP hash table, that meets this curve for lambda that is Omega(log log M + log_M N). In this paper, we present a simpler external-memory hash table, the Bundle of Arrays Hash Table (BOA), that is optimal for a narrower range of lambda. The simplicity of BOAs allows them to be readily modified to achieve the following results: - A new external-memory data structure, the Bundle of Trees Hash Table (BOT), that matches the performance of the IP hash table, while retaining some of the simplicity of the BOAs. - The Cache-Oblivious Bundle of Trees Hash Table (COBOT), the first cache-oblivious hash table. This data structure matches the optimality of BOTs and IP hash tables over the same range of lambda

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Time-Dependent Performance Comparison of Stochastic Optimization Algorithms

Author: Martikainen Jarno
Ovaska Seppo
Shilane David
Publication venue: Collection of Biostatistics Research Archive
Publication date: 27/08/2007
Field of study

This paper proposes a statistical methodology for comparing the performance of stochastic optimization algorithms that iteratively generate candidate optima. The fundamental data structure of the results of these algorithms is a time series. Algorithmic differences may be assessed through a procedure of statistical sampling and multiple hypothesis testing of time series data. Shilane et al. propose a general framework for performance comparison of stochastic optimization algorithms that result in a single candidate optimum. This project seeks to extend this framework to assess performance in time series data structures. The proposed methodology analyzes empirical data to determine the generation intervals in which algorithmic performance differences exist and may be used to guide the selection and design of optimization procedures for the task at hand. Such comparisons may be drawn for general performance metrics of any iterative stochastic optimization algorithm under any (typically unknown) data generating distribution. Additionally, this paper proposes a data reduction procedure to estimate performance differences in a more computationally feasible manner. In doing so, we provide a statistical framework to assess the performance of stochastic optimization algorithms and to design improved procedures for the task at hand

Collection Of Biostatistics Research Archive

A General Framework for Statistical Performance Comparison of Evolutionary Computation Algorithms

Author: Dudoit Sandrine
Martikainen Jarno
Ovaska Seppo
Shilane David
Publication venue: Collection of Biostatistics Research Archive
Publication date: 16/03/2006
Field of study

This paper proposes a statistical methodology for comparing the performance of evolutionary computation algorithms. A two-fold sampling scheme for collecting performance data is introduced, and these data are analyzed using bootstrap-based multiple hypothesis testing procedures. The proposed method is sufficiently flexible to allow the researcher to choose how performance is measured, does not rely upon distributional assumptions, and can be extended to analyze many other randomized numeric optimization routines. As a result, this approach offers a convenient, flexible, and reliable technique for comparing algorithms in a wide variety of applications

Collection Of Biostatistics Research Archive

Recommended from our members

Medication adherence and visit-to-visit variability of systolic blood pressure in African Americans with chronic kidney disease in the AASK trial

Author: Chang Tara I.
Hong Karen
Kronish Ian M.
Muntner Paul
Shilane David
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2016
Field of study

Lower adherence to antihypertensive medications may increase visit-to-visit variability of blood pressure (VVV of BP), a risk factor for cardiovascular events and death. We used data from the African American Study of Kidney Disease and Hypertension (AASK) trial to examine whether lower medication adherence is associated with higher systolic VVV of BP in African Americans with hypertensive chronic kidney disease (CKD). Determinants of VVV of BP were also explored. AASK participants (n=988) were categorized by self-report or pill count as having perfect (100%), moderately high (75–99%), moderately low (50–74%) or low ( < 50%) proportion of study visits with high medication adherence over a 1-year follow-up period. We used multinomial logistic regression to examine determinants of medication adherence, and multivariable-adjusted linear regression to examine the association between medication adherence and systolic VVV of BP, defined as the coefficient of variation or the average real variability (ARV). Participants with lower self-reported adherence were generally younger and had a higher prevalence of comorbid conditions. Compared with perfect adherence, moderately high, moderately low and low adherence was associated with 0.65% (±0.31%), 0.99% (±0.31%) and 1.29% (±0.32%) higher systolic VVV of BP (defined as the coefficient of variation) in fully adjusted models. Results were qualitatively similar when using ARV or when using pill counts as the measure of adherence. Lower medication adherence is associated with higher systolic VVV of BP in African Americans with hypertensive CKD; efforts to improve medication adherence in this population may reduce systolic VVV of BP

Columbia University Academic Commons

PubMed Central

Assert(!Defined(Sequential I/O))

Author: Cheng Li
Darren Sawyer
Fred Douglis
Hyong Shim
Philip Shilane
Publication venue: USENIX Association
Publication date: 01/01/2014
Field of study

The term sequential I/O is widely used in systems research with the intuitive understanding that it means consecutive access. From a survey of the literature, though, this intuitive understanding has translated into numerous, inconsistent definitions. Since sequential I/O is such a fundamental concept in systems research, we believe that a sequentiality metric should allow us to compare access patterns in a meaningful way. We explore access properties that could be incorporated into potential metrics for sequential I/O including: access size, gaps between accesses, multi-stream, and inter-arrival time. We then analyze hundreds of largescale storage traces and discuss how potential metrics compare. Interestingly, we find I/O traces considered highly sequential by one metric can be highly random to another metric. We further demonstrate that many plausible metrics are weakly correlated, though metrics weighted by size have more consistency. While there may not be a single metric for sequential I/O that is best in all cases, we believe systems researchers should more carefully consider, and state, which definition they use

CiteSeerX

Multi-scale Salient Feature Extraction on Mesh Models

Author: C.H. Lee
P. Shilane
R. Gal
S. Katz
T. Cox
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

Abstract. We present a new method of extracting multi-scale salien-t features on meshes. It is based on robust estimation of curvature on multiple scales. The coincidence between salient feature and the scale of interest can be established straightforwardly, where detailed feature appears on small scale and feature with more global shape information shows up on large scale. We demonstrate this multi-scale description of features accords with human perception and can be further used for several applications as feature classification and viewpoint selection. Ex-periments exhibit that our method as a multi-scale analysis tool is very helpful for studying 3D shapes.

CiteSeerX

Crossref

Electors Voting for Fast Automatic Shape Correspondence

Author: Aiger
Au
Bai
Baran
Biasotti
Brown
Chang
Chang
Chui
Demirci
Fischler
Gal
Gal
Hebb
Huang
Li
Lipman
Shilane
Sumner
Tierny
Zhang
Publication venue: 'Wiley'
Publication date
Field of study

Crossref

A survey and classification of storage deduplication systems

Author: Anand Ashok
Arcangeli Andrea
Berliner Brian
Bolosky William J.
Broder Andrei
Chen Feng
Chute Christopher
Clements Austin T.
Collberg Christian
Debnath Biplob
Dong Wei
Douglis Fred
Douglis Fred
Dubnicki Cezary
Dutch
El-Shimi Ahmed
Eshghi Kave
Guo Fanglu
Gupta Aayush
Hong Bo
José Pereira
João Paulo
Kruus Erik
Liguori Anthony
Lillibridge Mark
Lu Guanlin
Manber Udi
Milos Grzegorz
Nath Partho
Ng Chun-Ho
Quinlan Sean
Rhea Sean
Shilane Philip
Srinivasan Kiran
Suzaki Kuniyasu
Tarasov Vasily
Ungureanu Cristian
Wright Jeff
Xia Wen
You Lawrence
Zhu Benjamin
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/07/2014
Field of study

The automatic elimination of duplicate data in a storage system commonly known as deduplication is increasingly accepted as an effective technique to reduce storage costs. Thus, it has been applied to different storage types, including archives and backups, primary storage, within solid state disks, and even to random access memory. Although the general approach to deduplication is shared by all storage types, each poses specific challenges and leads to different trade-offs and solutions. This diversity is often misunderstood, thus underestimating the relevance of new research and development. The first contribution of this paper is a classification of deduplication systems according to six criteria that correspond to key design decisions: granularity, locality, timing, indexing, technique, and scope. This classification identifies and describes the different approaches used for each of them. As a second contribution, we describe which combinations of these design decisions have been proposed and found more useful for challenges in each storage type. Finally, outstanding research challenges and unexplored design points are identified and discussed.This work is funded by the European Regional Development Fund (EDRF) through the COMPETE Programme (operational programme for competitiveness) and by National Funds through the Fundacao para a Ciencia e a Tecnologia (FCT; Portuguese Foundation for Science and Technology) within project RED FCOMP-01-0124-FEDER-010156 and the FCT by PhD scholarship SFRH-BD-71372-2010

Universidade do Minho: RepositoriUM

Crossref