43 research outputs found
Fast Computation of Kernel Estimators
The computational complexity of evaluating the kernel density estimate (or its derivatives) at m evaluation points given n sample points scales quadratically as O(nm)鈥攎aking it prohibitively expensive for large datasets. While approximate methods like binning could speed up the computation, they lack a precise control over the accuracy of the approximation. There is no straightforward way of choosing the binning parameters a priori in order to achieve a desired approximation error. We propose a novel computationally efficient 蔚-exact approximation algorithm for the univariate Gaussian kernel-based density derivative estimation that reduces the computational complexity from O(nm) to linear O(n+m). The user can specify a desired accuracy 蔚. The algorithm guarantees that the actual error between the approximation and the original kernel estimate will always be less than 蔚. We also apply our proposed fast algorithm to speed up automatic bandwidth selection procedures. We compare our method to the best available binning methods in terms of the speed and the accuracy. Our experimental results show that the proposed method is almost twice as fast as the best binning methods and is around five orders of magnitude more accurate. The software for the proposed method is available online
Fast weighted summation of erfc functions
Abstract: Direct computation of the weighted sum of N complementary
error functions at M points scales as O(MN). We present a O(M + N)
-exact approximation algorithm to compute the same
Efficient Kriging via Fast Matrix-Vector Products
Interpolating scattered data points is a problem of wide ranging interest. Ordinary kriging is an optimal scattered data estimator, widely used in geosciences and remote sensing. A generalized version of this technique, called cokriging, can be used for image fusion of remotely sensed data. However, it is computationally very expensive for large data sets. We demonstrate the time efficiency and accuracy of approximating ordinary kriging through the use of fast matrixvector products combined with iterative methods. We used methods based on the fast Multipole methods and nearest neighbor searching techniques for implementations of the fast matrix-vector products
Extracting significant features from the HRTF
Proceedings of the 9th International Conference on Auditory Display (ICAD), Boston, MA, July 7-9, 2003.The Head Related Transfer Function (HRTF) characterizes the auditory cues created by scattering of sound off a person's anatomy. While it is known that features in the HRTF can be associated with various phenomena, such as head diffraction, head and torso reflection, knee reflection and pinna resonances and anti resonances, identification of these phenomena are usually qualitative and/or heuristic. The objective of this paper is to attempt to decompose the HRTF and extract significant features that are perceptually important for source localization. Some of the significant features that have been identified are the pinna resonances and the notches in the spectrum caused by various parts of the body. We develop signal processing algorithms to decompose the HRTF into components, and extract the features corresponding to each component. The support of NSF award ITR-0086075 is gratefully acknowledge
Garbage In, Garbage Out? Do Machine Learning Application Papers in Social Computing Report Where Human-Labeled Training Data Comes From?
Many machine learning projects for new application areas involve teams of
humans who label data for a particular purpose, from hiring crowdworkers to the
paper's authors labeling the data themselves. Such a task is quite similar to
(or a form of) structured content analysis, which is a longstanding methodology
in the social sciences and humanities, with many established best practices. In
this paper, we investigate to what extent a sample of machine learning
application papers in social computing --- specifically papers from ArXiv and
traditional publications performing an ML classification task on Twitter data
--- give specific details about whether such best practices were followed. Our
team conducted multiple rounds of structured content analysis of each paper,
making determinations such as: Does the paper report who the labelers were,
what their qualifications were, whether they independently labeled the same
items, whether inter-rater reliability metrics were disclosed, what level of
training and/or instructions were given to labelers, whether compensation for
crowdworkers is disclosed, and if the training data is publicly available. We
find a wide divergence in whether such practices were followed and documented.
Much of machine learning research and education focuses on what is done once a
"gold standard" of training data is available, but we discuss issues around the
equally-important aspect of whether such data is reliable in the first place.Comment: 18 pages, includes appendi