7 research outputs found
Local-Global Vectors to Improve Unigram Terminology Extraction
The present paper explores a novel method that integrates efficient distributed representations with terminology extraction. We show that the information from a small number of observed instances can be combined with local and global word embeddings to remarkably improve the term extraction results on unigram terms. To do so, we pass the terms extracted by other tools to a filter made of the local-global embeddings and a classifier which in turn decides whether or not a term candidate is a term. The filter can also be used as a hub to merge different term extraction tools into a single higher-performing system. We compare filters that use the skipgram architecture and filters that employ the CBOW architecture for the task at hand
Term Evaluator: A Tool for Terminology Annotation and Evaluation
There are several methods and available tools for terminology extraction, but the quality of the extracted terms is not always high. Hence, an important consideration in terminology extraction is to assess the quality of the extracted terms. In this paper, we propose and make available a tool for annotating the correctness of terms extracted by three term-extraction tools. This tool facilitates term annotation by using a domain-specific dictionary, a set of filters, and an annotation memory, and allows for post-hoc evaluation. We present a study in which two human judges used the developed tool for term annotation. Their annotations were then analyzed to determine the efficiency of term extraction tools by measures of precision, recall, and F-score, and to calculate the inter-annotator agreement rate
DataDAM: Efficient Dataset Distillation with Attention Matching
Researchers have long tried to minimize training costs in deep learning while
maintaining strong generalization across diverse datasets. Emerging research on
dataset distillation aims to reduce training costs by creating a small
synthetic set that contains the information of a larger real dataset and
ultimately achieves test accuracy equivalent to a model trained on the whole
dataset. Unfortunately, the synthetic data generated by previous methods are
not guaranteed to distribute and discriminate as well as the original training
data, and they incur significant computational costs. Despite promising
results, there still exists a significant performance gap between models
trained on condensed synthetic sets and those trained on the whole dataset. In
this paper, we address these challenges using efficient Dataset Distillation
with Attention Matching (DataDAM), achieving state-of-the-art performance while
reducing training costs. Specifically, we learn synthetic images by matching
the spatial attention maps of real and synthetic data generated by different
layers within a family of randomly initialized neural networks. Our method
outperforms the prior methods on several datasets, including CIFAR10/100,
TinyImageNet, ImageNet-1K, and subsets of ImageNet-1K across most of the
settings, and achieves improvements of up to 6.5% and 4.1% on CIFAR100 and
ImageNet-1K, respectively. We also show that our high-quality distilled images
have practical benefits for downstream applications, such as continual learning
and neural architecture search.Comment: Accepted in International Conference in Computer Vision (ICCV) 202
PEPSI: Practically Efficient Private Set Intersection in the Unbalanced Setting
Two parties with private data sets can find shared elements using a Private
Set Intersection (PSI) protocol without revealing any information beyond the
intersection. Circuit PSI protocols privately compute an arbitrary function of
the intersection - such as its cardinality, and are often employed in an
unbalanced setting where one party has more data than the other. Existing
protocols are either computationally inefficient or require extensive
server-client communication on the order of the larger set. We introduce
Practically Efficient PSI or PEPSI, a non-interactive solution where only the
client sends its encrypted data. PEPSI can process an intersection of 1024
client items with a million server items in under a second, using less than 5
MB of communication. Our work is over 4 orders of magnitude faster than an
existing non-interactive circuit PSI protocol and requires only 10% of the
communication. It is also up to 20 times faster than the work of Ion et al.,
which computes a limited set of functions and has communication costs
proportional to the larger set. Our work is the first to demonstrate that
non-interactive circuit PSI can be practically applied in an unbalanced
setting
Interpretation for Variational Autoencoder Used to Generate Financial Synthetic Tabular Data
Synthetic data, artificially generated by computer programs, has become more widely used in the financial domain to mitigate privacy concerns. Variational Autoencoder (VAE) is one of the most popular deep-learning models for generating synthetic data. However, VAE is often considered a “black box” due to its opaqueness. Although some studies have been conducted to provide explanatory insights into VAE, research focusing on explaining how the input data could influence VAE to create synthetic data, especially for tabular data, is still lacking. However, in the financial industry, most data are stored in a tabular format. This paper proposes a sensitivity-based method to assess the impact of inputted tabular data on how VAE synthesizes data. This sensitivity-based method can provide both global and local interpretations efficiently and intuitively. To test this method, a simulated dataset and three Kaggle banking tabular datasets were employed. The results confirmed the applicability of this proposed method
Distributed specificity for automatic terminology extraction
The present article explores two novel methods that integrate distributed representations with terminology extraction. Both methods assess the specificity of a word (unigram) to the target corpus by leveraging its distributed representation in the target domain as well as in the general domain. The first approach adopts this distributed specificity as a filter, and the second directly applies it to the corpus. The filter can be mounted on any other Automatic Terminology Extraction (ATE) method, allows merging any number of other ATE methods, and achieves remarkable results with minimal training. The direct approach does not perform as high as the filtering approach, but it reemphasizes that using distributed specificity as the words' representation, very little data is required to train an ATE classifier. This encourages more minimally supervised ATE algorithms in the future