24 research outputs found
Customs Import Declaration Datasets
Given the huge volume of cross-border flows, effective and efficient control
of trade becomes more crucial in protecting people and society from illicit
trade. However, limited accessibility of the transaction-level trade datasets
hinders the progress of open research, and lots of customs administrations have
not benefited from the recent progress in data-based risk management. In this
paper, we introduce an import declaration dataset to facilitate the
collaboration between domain experts in customs administrations and researchers
from diverse domains, such as data science and machine learning. The dataset
contains 54,000 artificially generated trades with 22 key attributes, and it is
synthesized with conditional tabular GAN while maintaining correlated features.
Synthetic data has several advantages. First, releasing the dataset is free
from restrictions that do not allow disclosing the original import data. The
fabrication step minimizes the possible identity risk which may exist in trade
statistics. Second, the published data follow a similar distribution to the
source data so that it can be used in various downstream tasks. Hence, our
dataset can be used as a benchmark for testing the performance of any
classification algorithm. With the provision of data and its generation
process, we open baseline codes for fraud detection tasks, as we empirically
show that more advanced algorithms can better detect fraud.Comment: Datasets: https://github.com/Seondong/Customs-Declaration-Dataset
ColdGANs: Taming Language GANs with Cautious Sampling Strategies
Training regimes based on Maximum Likelihood Estimation (MLE) suffer from
known limitations, often leading to poorly generated text sequences. At the
root of these limitations is the mismatch between training and inference, i.e.
the so-called exposure bias, exacerbated by considering only the reference
texts as correct, while in practice several alternative formulations could be
as good. Generative Adversarial Networks (GANs) can mitigate those limitations
but the discrete nature of text has hindered their application to language
generation: the approaches proposed so far, based on Reinforcement Learning,
have been shown to underperform MLE. Departing from previous works, we analyze
the exploration step in GANs applied to text generation, and show how classical
sampling results in unstable training. We propose to consider alternative
exploration strategies in a GAN framework that we name ColdGANs, where we force
the sampling to be close to the distribution modes to get smoother learning
dynamics. For the first time, to the best of our knowledge, the proposed
language GANs compare favorably to MLE, and obtain improvements over the
state-of-the-art on three generative tasks, namely unconditional text
generation, question generation, and abstractive summarization