58,952 research outputs found
Rapid Sampling for Visualizations with Ordering Guarantees
Visualizations are frequently used as a means to understand trends and gather
insights from datasets, but often take a long time to generate. In this paper,
we focus on the problem of rapidly generating approximate visualizations while
preserving crucial visual proper- ties of interest to analysts. Our primary
focus will be on sampling algorithms that preserve the visual property of
ordering; our techniques will also apply to some other visual properties. For
instance, our algorithms can be used to generate an approximate visualization
of a bar chart very rapidly, where the comparisons between any two bars are
correct. We formally show that our sampling algorithms are generally applicable
and provably optimal in theory, in that they do not take more samples than
necessary to generate the visualizations with ordering guarantees. They also
work well in practice, correctly ordering output groups while taking orders of
magnitude fewer samples and much less time than conventional sampling schemes.Comment: Tech Report. 17 pages. Condensed version to appear in VLDB Vol. 8 No.
Random effects compound Poisson model to represent data with extra zeros
This paper describes a compound Poisson-based random effects structure for
modeling zero-inflated data. Data with large proportion of zeros are found in
many fields of applied statistics, for example in ecology when trying to model
and predict species counts (discrete data) or abundance distributions
(continuous data). Standard methods for modeling such data include mixture and
two-part conditional models. Conversely to these methods, the stochastic models
proposed here behave coherently with regards to a change of scale, since they
mimic the harvesting of a marked Poisson process in the modeling steps. Random
effects are used to account for inhomogeneity. In this paper, model design and
inference both rely on conditional thinking to understand the links between
various layers of quantities : parameters, latent variables including random
effects and zero-inflated observations. The potential of these parsimonious
hierarchical models for zero-inflated data is exemplified using two marine
macroinvertebrate abundance datasets from a large scale scientific bottom-trawl
survey. The EM algorithm with a Monte Carlo step based on importance sampling
is checked for this model structure on a simulated dataset : it proves to work
well for parameter estimation but parameter values matter when re-assessing the
actual coverage level of the confidence regions far from the asymptotic
conditions.Comment: 4
Multilingual Twitter Sentiment Classification: The Role of Human Annotators
What are the limits of automated Twitter sentiment classification? We analyze
a large set of manually labeled tweets in different languages, use them as
training data, and construct automated classification models. It turns out that
the quality of classification models depends much more on the quality and size
of training data than on the type of the model trained. Experimental results
indicate that there is no statistically significant difference between the
performance of the top classification models. We quantify the quality of
training data by applying various annotator agreement measures, and identify
the weakest points of different datasets. We show that the model performance
approaches the inter-annotator agreement when the size of the training set is
sufficiently large. However, it is crucial to regularly monitor the self- and
inter-annotator agreements since this improves the training datasets and
consequently the model performance. Finally, we show that there is strong
evidence that humans perceive the sentiment classes (negative, neutral, and
positive) as ordered
Bridging SMT and TM with translation recommendation
We propose a translation recommendation framework to integrate Statistical Machine Translation (SMT) output with Translation Memory (TM) systems. The framework recommends SMT outputs to a TM user when it predicts that SMT outputs are more suitable for post-editing than the hits provided by the TM. We describe an implementation of this framework using an SVM binary classifier. We exploit methods to fine-tune the classifier and investigate a variety of features of different types. We rely on automatic MT evaluation
metrics to approximate human judgements in our experiments. Experimental results show that our system can achieve 0.85 precision at 0.89 recall, excluding exact matches. futhermore, it is possible for the end-user to achieve a desired balance between precision and recall by adjusting
confidence levels
Computationally Efficient and Robust BIC-Based Speaker Segmentation
An algorithm for automatic speaker segmentation based on the Bayesian information criterion (BIC) is presented. BIC tests are not performed for every window shift, as previously, but when a speaker change is most probable to occur. This is done by estimating the next probable change point thanks to a model of utterance durations. It is found that the inverse Gaussian fits best the distribution of utterance durations. As a result, less BIC tests are needed, making the proposed system less computationally demanding in time and memory, and considerably more efficient with respect to missed speaker change points. A feature selection algorithm based on branch and bound search strategy is applied in order to identify the most efficient features for speaker segmentation. Furthermore, a new theoretical formulation of BIC is derived by applying centering and simultaneous diagonalization. This formulation is considerably more computationally efficient than the standard BIC, when the covariance matrices are estimated by other estimators than the usual maximum-likelihood ones. Two commonly used pairs of figures of merit are employed and their relationship is established. Computational efficiency is achieved through the speaker utterance modeling, whereas robustness is achieved by feature selection and application of BIC tests at appropriately selected time instants. Experimental results indicate that the proposed modifications yield a superior performance compared to existing approaches
Approximate selective inference via maximum likelihood
This article considers a conditional approach to selective inference via
approximate maximum likelihood for data described by Gaussian models. There are
two important considerations in adopting a post-selection inferential
perspective. While one of them concerns the effective use of information in
data, the other aspect deals with the computational cost of adjusting for
selection. Our approximate proposal serves both these purposes-- (i) exploits
the use of randomness for efficient utilization of left-over information from
selection; (ii) enables us to bypass potentially expensive MCMC sampling from
conditional distributions. At the core of our method is the solution to a
convex optimization problem which assumes a separable form across multiple
selection queries. This allows us to address the problem of tractable and
efficient inference in many practical scenarios, where more than one learning
query is conducted to define and perhaps redefine models and their
corresponding parameters. Through an in-depth analysis, we illustrate the
potential of our proposal and provide extensive comparisons with other
post-selective schemes in both randomized and non-randomized paradigms of
inference
- …