193,476 research outputs found
Ethicist: Targeted Training Data Extraction Through Loss Smoothed Soft Prompting and Calibrated Confidence Estimation
Large pre-trained language models achieve impressive results across many
tasks. However, recent works point out that pre-trained language models may
memorize a considerable fraction of their training data, leading to the privacy
risk of information leakage. In this paper, we propose a method named Ethicist
for targeted training data extraction through loss smoothed soft prompting and
calibrated confidence estimation, investigating how to recover the suffix in
the training data when given a prefix. To elicit memorization in the attacked
model, we tune soft prompt embeddings while keeping the model fixed. We further
propose a smoothing loss that smooths the loss distribution of the suffix
tokens to make it easier to sample the correct suffix. In order to select the
most probable suffix from a collection of sampled suffixes and estimate the
prediction confidence, we propose a calibrated confidence estimation method,
which normalizes the confidence of the generated suffixes with a local
estimation. We show that Ethicist significantly improves the extraction
performance on a recently proposed public benchmark. We also investigate
several factors influencing the data extraction performance, including decoding
strategy, model scale, prefix length, and suffix length. Our code is available
at https://github.com/thu-coai/Targeted-Data-Extraction.Comment: ACL 2023 Long Paper (Main Conference
Credible Set Estimation, Analysis, and Applications in Synthetic Aperture Radar Canonical Feature Extraction
Traditional estimation schemes such as Maximum A Posterior (MAP) or Maximum Likelihood Estimation (MLE) determine the most likely parameter set associated with received signal data. However, traditional schemes do not retain entire posterior distribution, provide no confidence information associated with the final solution, and often rely on simple sampling methods which induce significant errors. Also, traditional schemes perform inadequately when applied to complex signals which often result in multi-modal parameter sets. Credible Set Estimation (CSE) provides a powerful and flexible alternative to traditional estimation schemes. CSE provides an estimation solution that accurately computes posterior distributions, retains confidence information, and provides a complete set of credible solutions. Determination of a credible region becomes especially important in Synthetic Aperture Radar (SAR) Automated Target Recognition (ATR) problems where signal complexity leads to multiple potential parameter sets. The presented research provides validation of methods for CSE, extension to high dimension/large observation sets, incorporation of Bayesian methods with previous work on SAR canonical feature extraction, and evaluation of the CSE algorithm. The results in this thesis show that: the CSE implementation of Gaussian-Quadrature techniques reduces computational error of the posterior distribution by up to twelve orders of magnitude, the presented formula for computation of the posterior distribution enables numerical evaluation for large observation sets (greater than 7,300 observations), and the algorithm is capable of producing M-th dimensional parameter estimates when applied to SAR canonical features. As such, CSE provides an ideal estimation scheme for radar, communications and other statistical problems where retaining the entire posterior distribution and associated confidence intervals is desirable
Bio-inspired speed detection and discrimination
In the field of computer vision, a crucial task is the detection of motion
(also called optical flow extraction). This operation allows analysis such as
3D reconstruction, feature tracking, time-to-collision and novelty detection
among others. Most of the optical flow extraction techniques work within a
finite range of speeds. Usually, the range of detection is extended towards
higher speeds by combining some multiscale information in a serial
architecture. This serial multi-scale approach suffers from the problem of
error propagation related to the number of scales used in the algorithm. On the
other hand, biological experiments show that human motion perception seems to
follow a parallel multiscale scheme. In this work we present a bio-inspired
parallel architecture to perform detection of motion, providing a wide range of
operation and avoiding error propagation associated with the serial
architecture. To test our algorithm, we perform relative error comparisons
between both classical and proposed techniques, showing that the parallel
architecture is able to achieve motion detection with results similar to the
serial approach
Controlling Risk of Web Question Answering
Web question answering (QA) has become an indispensable component in modern
search systems, which can significantly improve users' search experience by
providing a direct answer to users' information need. This could be achieved by
applying machine reading comprehension (MRC) models over the retrieved passages
to extract answers with respect to the search query. With the development of
deep learning techniques, state-of-the-art MRC performances have been achieved
by recent deep methods. However, existing studies on MRC seldom address the
predictive uncertainty issue, i.e., how likely the prediction of an MRC model
is wrong, leading to uncontrollable risks in real-world Web QA applications. In
this work, we first conduct an in-depth investigation over the risk of Web QA.
We then introduce a novel risk control framework, which consists of a qualify
model for uncertainty estimation using the probe idea, and a decision model for
selectively output. For evaluation, we introduce risk-related metrics, rather
than the traditional EM and F1 in MRC, for the evaluation of risk-aware Web QA.
The empirical results over both the real-world Web QA dataset and the academic
MRC benchmark collection demonstrate the effectiveness of our approach.Comment: 42nd International ACM SIGIR Conference on Research and Development
in Information Retrieva
Estimating the maximum possible earthquake magnitude using extreme value methodology: the Groningen case
The area-characteristic, maximum possible earthquake magnitude is
required by the earthquake engineering community, disaster management agencies
and the insurance industry. The Gutenberg-Richter law predicts that earthquake
magnitudes follow a truncated exponential distribution. In the geophysical
literature several estimation procedures were proposed, see for instance Kijko
and Singh (Acta Geophys., 2011) and the references therein. Estimation of
is of course an extreme value problem to which the classical methods for
endpoint estimation could be applied. We argue that recent methods on truncated
tails at high levels (Beirlant et al., Extremes, 2016; Electron. J. Stat.,
2017) constitute a more appropriate setting for this estimation problem. We
present upper confidence bounds to quantify uncertainty of the point estimates.
We also compare methods from the extreme value and geophysical literature
through simulations. Finally, the different methods are applied to the
magnitude data for the earthquakes induced by gas extraction in the Groningen
province of the Netherlands
- …