6,626 research outputs found
Accuracy-based scoring for phrase-based statistical machine translation
Although the scoring features of state-of-the-art Phrase-Based Statistical Machine Translation (PB-SMT) models are weighted so as to optimise an objective function measuring
translation quality, the estimation of the features
themselves does not have any relation to such quality metrics. In this paper, we introduce a translation quality-based feature to PBSMT in a bid to improve the translation quality of the system. Our feature is estimated by averaging
the edit-distance between phrase pairs involved in the translation of oracle sentences, chosen by automatic evaluation metrics from the N-best outputs of a baseline system, and phrase pairs occurring in the N-best list. Using
our method, we report a statistically significant 2.11% relative improvement in BLEU score for the WMT 2009 Spanish-to-English translation task. We also report that using our
method we can achieve statistically significant improvements over the baseline using many other MT evaluation metrics, and a substantial increase in speed and reduction in memory use (due to a reduction in phrase-table size of 87%) while maintaining significant gains in
translation quality
Identifying Patch Correctness in Test-Based Program Repair
Test-based automatic program repair has attracted a lot of attention in
recent years. However, the test suites in practice are often too weak to
guarantee correctness and existing approaches often generate a large number of
incorrect patches.
To reduce the number of incorrect patches generated, we propose a novel
approach that heuristically determines the correctness of the generated
patches. The core idea is to exploit the behavior similarity of test case
executions. The passing tests on original and patched programs are likely to
behave similarly while the failing tests on original and patched programs are
likely to behave differently. Also, if two tests exhibit similar runtime
behavior, the two tests are likely to have the same test results. Based on
these observations, we generate new test inputs to enhance the test suites and
use their behavior similarity to determine patch correctness.
Our approach is evaluated on a dataset consisting of 139 patches generated
from existing program repair systems including jGenProg, Nopol, jKali, ACS and
HDRepair. Our approach successfully prevented 56.3\% of the incorrect patches
to be generated, without blocking any correct patches.Comment: ICSE 201
Classifying the Correctness of Generated White-Box Tests: An Exploratory Study
White-box test generator tools rely only on the code under test to select
test inputs, and capture the implementation's output as assertions. If there is
a fault in the implementation, it could get encoded in the generated tests.
Tool evaluations usually measure fault-detection capability using the number of
such fault-encoding tests. However, these faults are only detected, if the
developer can recognize that the encoded behavior is faulty. We designed an
exploratory study to investigate how developers perform in classifying
generated white-box test as faulty or correct. We carried out the study in a
laboratory setting with 54 graduate students. The tests were generated for two
open-source projects with the help of the IntelliTest tool. The performance of
the participants were analyzed using binary classification metrics and by
coding their observed activities. The results showed that participants
incorrectly classified a large number of both fault-encoding and correct tests
(with median misclassification rate 33% and 25% respectively). Thus the real
fault-detection capability of test generators could be much lower than
typically reported, and we suggest to take this human factor into account when
evaluating generated white-box tests.Comment: 13 pages, 7 figure
Automatic Quality Estimation for ASR System Combination
Recognizer Output Voting Error Reduction (ROVER) has been widely used for
system combination in automatic speech recognition (ASR). In order to select
the most appropriate words to insert at each position in the output
transcriptions, some ROVER extensions rely on critical information such as
confidence scores and other ASR decoder features. This information, which is
not always available, highly depends on the decoding process and sometimes
tends to over estimate the real quality of the recognized words. In this paper
we propose a novel variant of ROVER that takes advantage of ASR quality
estimation (QE) for ranking the transcriptions at "segment level" instead of:
i) relying on confidence scores, or ii) feeding ROVER with randomly ordered
hypotheses. We first introduce an effective set of features to compensate for
the absence of ASR decoder information. Then, we apply QE techniques to perform
accurate hypothesis ranking at segment-level before starting the fusion
process. The evaluation is carried out on two different tasks, in which we
respectively combine hypotheses coming from independent ASR systems and
multi-microphone recordings. In both tasks, it is assumed that the ASR decoder
information is not available. The proposed approach significantly outperforms
standard ROVER and it is competitive with two strong oracles that e xploit
prior knowledge about the real quality of the hypotheses to be combined.
Compared to standard ROVER, the abs olute WER improvements in the two
evaluation scenarios range from 0.5% to 7.3%
Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization
Relative to the large literature on upper bounds on complexity of convex
optimization, lesser attention has been paid to the fundamental hardness of
these problems. Given the extensive use of convex optimization in machine
learning and statistics, gaining an understanding of these complexity-theoretic
issues is important. In this paper, we study the complexity of stochastic
convex optimization in an oracle model of computation. We improve upon known
results and obtain tight minimax complexity estimates for various function
classes
- …