6 research outputs found
The Split Matters: Flat Minima Methods for Improving the Performance of GNNs
When training a Neural Network, it is optimized using the available training
data with the hope that it generalizes well to new or unseen testing data. At
the same absolute value, a flat minimum in the loss landscape is presumed to
generalize better than a sharp minimum. Methods for determining flat minima
have been mostly researched for independent and identically distributed (i. i.
d.) data such as images. Graphs are inherently non-i. i. d. since the vertices
are edge-connected. We investigate flat minima methods and combinations of
those methods for training graph neural networks (GNNs). We use GCN and GAT as
well as extend Graph-MLP to work with more layers and larger graphs. We conduct
experiments on small and large citation, co-purchase, and protein datasets with
different train-test splits in both the transductive and inductive training
procedure. Results show that flat minima methods can improve the performance of
GNN models by over 2 points, if the train-test split is randomized. Following
Shchur et al., randomized splits are essential for a fair evaluation of GNNs,
as other (fixed) splits like 'Planetoid' are biased. Overall, we provide
important insights for improving and fairly evaluating flat minima methods on
GNNs. We recommend practitioners to always use weight averaging techniques, in
particular EWA when using early stopping. While weight averaging techniques are
only sometimes the best performing method, they are less sensitive to
hyperparameters, need no additional training, and keep the original model
unchanged. All source code is available in
https://github.com/Foisunt/FMMs-in-GNNs
Memorization of Named Entities in Fine-tuned BERT Models
Privacy preserving deep learning is an emerging field in machine learning
that aims to mitigate the privacy risks in the use of deep neural networks. One
such risk is training data extraction from language models that have been
trained on datasets, which contain personal and privacy sensitive information.
In our study, we investigate the extent of named entity memorization in
fine-tuned BERT models. We use single-label text classification as
representative downstream task and employ three different fine-tuning setups in
our experiments, including one with Differentially Privacy (DP). We create a
large number of text samples from the fine-tuned BERT models utilizing a custom
sequential sampling strategy with two prompting strategies. We search in these
samples for named entities and check if they are also present in the
fine-tuning datasets. We experiment with two benchmark datasets in the domains
of emails and blogs. We show that the application of DP has a detrimental
effect on the text generation capabilities of BERT. Furthermore, we show that a
fine-tuned BERT does not generate more named entities specific to the
fine-tuning dataset than a BERT model that is pre-trained only. This suggests
that BERT is unlikely to emit personal or privacy sensitive named entities.
Overall, our results are important to understand to what extent BERT-based
services are prone to training data extraction attacks.Comment: accepted at CD-MAKE 202
Reducing a Set of Regular Expressions and Analyzing Differences of Domain-specific Statistic Reporting
Due to the large amount of daily scientific publications, it is impossible to
manually review each one. Therefore, an automatic extraction of key information
is desirable. In this paper, we examine STEREO, a tool for extracting
statistics from scientific papers using regular expressions. By adapting an
existing regular expression inclusion algorithm for our use case, we decrease
the number of regular expressions used in STEREO by about . We reveal
common patterns from the condensed rule set that can be used for the creation
of new rules. We also apply STEREO, which was previously trained in the
life-sciences and medical domain, to a new scientific domain, namely
Human-Computer-Interaction (HCI), and re-evaluate it. According to our
research, statistics in the HCI domain are similar to those in the medical
domain, although a higher percentage of APA-conform statistics were found in
the HCI domain. Additionally, we compare extraction on PDF and LaTeX source
files, finding LaTeX to be more reliable for extraction
Reconstructing Native American Population History
The peopling of the Americas has been the subject of extensive genetic, archaeological and linguistic research; however, central questions remain unresolved1–5. One contentious issue is whether the settlement occurred via a single6–8 or multiple streams of migration from Siberia9–15. The pattern of dispersals within the Americas is also poorly understood. To address these questions at higher resolution than was previously possible, we assembled data from 52 Native American and 17 Siberian groups genotyped at 364,470 single nucleotide polymorphisms. We show that Native Americans descend from at least three streams of Asian gene flow. Most descend entirely from a single ancestral population that we call “First American”. However, speakers of Eskimo-Aleut languages from the Arctic inherit almost half their ancestry from a second stream of Asian gene flow, and the Na-Dene-speaking Chipewyan from Canada inherit roughly one-tenth of their ancestry from a third stream. We show that the initial peopling followed a southward expansion facilitated by the coast, with sequential population splits and little gene flow after divergence, especially in South America. A major exception is in Chibchan-speakers on both sides of the Panama Isthmus, who have ancestry from both North and South America