1,241 research outputs found
Data Mining Applications in Banking Sector While Preserving Customer Privacy
In real-life data mining applications, organizations cooperate by using each other’s data on the same data mining task for more accurate results, although they may have different security and privacy concerns. Privacy-preserving data mining (PPDM) practices involve rules and techniques that allow parties to collaborate on data mining applications while keeping their data private. The objective of this paper is to present a number of PPDM protocols and show how PPDM can be used in data mining applications in the banking sector. For this purpose, the paper discusses homomorphic cryptosystems and secure multiparty computing. Supported by experimental analysis, the paper demonstrates that data mining tasks such as clustering and Bayesian networks (association rules) that are commonly used in the banking sector can be efficiently and securely performed. This is the first study that combines PPDM protocols with applications for banking data mining. Doi: 10.28991/ESJ-2022-06-06-014 Full Text: PD
FLIPS: Federated Learning using Intelligent Participant Selection
This paper presents the design and implementation of FLIPS, a middleware
system to manage data and participant heterogeneity in federated learning (FL)
training workloads. In particular, we examine the benefits of label
distribution clustering on participant selection in federated learning. FLIPS
clusters parties involved in an FL training job based on the label distribution
of their data apriori, and during FL training, ensures that each cluster is
equitably represented in the participants selected. FLIPS can support the most
common FL algorithms, including FedAvg, FedProx, FedDyn, FedOpt and FedYogi. To
manage platform heterogeneity and dynamic resource availability, FLIPS
incorporates a straggler management mechanism to handle changing capacities in
distributed, smart community applications. Privacy of label distributions,
clustering and participant selection is ensured through a trusted execution
environment (TEE). Our comprehensive empirical evaluation compares FLIPS with
random participant selection, as well as two other "smart" selection mechanisms
- Oort and gradient clustering using two real-world datasets, two different
non-IID distributions and three common FL algorithms (FedYogi, FedProx and
FedAvg). We demonstrate that FLIPS significantly improves convergence,
achieving higher accuracy by 17 - 20 % with 20 - 60 % lower communication
costs, and these benefits endure in the presence of straggler participants
Technologies and Applications for Big Data Value
This open access book explores cutting-edge solutions and best practices for big data and data-driven AI applications for the data-driven economy. It provides the reader with a basis for understanding how technical issues can be overcome to offer real-world solutions to major industrial areas. The book starts with an introductory chapter that provides an overview of the book by positioning the following chapters in terms of their contributions to technology frameworks which are key elements of the Big Data Value Public-Private Partnership and the upcoming Partnership on AI, Data and Robotics. The remainder of the book is then arranged in two parts. The first part “Technologies and Methods” contains horizontal contributions of technologies and methods that enable data value chains to be applied in any sector. The second part “Processes and Applications” details experience reports and lessons from using big data and data-driven approaches in processes and applications. Its chapters are co-authored with industry experts and cover domains including health, law, finance, retail, manufacturing, mobility, and smart cities. Contributions emanate from the Big Data Value Public-Private Partnership and the Big Data Value Association, which have acted as the European data community's nucleus to bring together businesses with leading researchers to harness the value of data to benefit society, business, science, and industry. The book is of interest to two primary audiences, first, undergraduate and postgraduate students and researchers in various fields, including big data, data science, data engineering, and machine learning and AI. Second, practitioners and industry experts engaged in data-driven systems, software design and deployment projects who are interested in employing these advanced methods to address real-world problems
Zero-knowledge Proof Meets Machine Learning in Verifiability: A Survey
With the rapid advancement of artificial intelligence technology, the usage
of machine learning models is gradually becoming part of our daily lives.
High-quality models rely not only on efficient optimization algorithms but also
on the training and learning processes built upon vast amounts of data and
computational power. However, in practice, due to various challenges such as
limited computational resources and data privacy concerns, users in need of
models often cannot train machine learning models locally. This has led them to
explore alternative approaches such as outsourced learning and federated
learning. While these methods address the feasibility of model training
effectively, they introduce concerns about the trustworthiness of the training
process since computations are not performed locally. Similarly, there are
trustworthiness issues associated with outsourced model inference. These two
problems can be summarized as the trustworthiness problem of model
computations: How can one verify that the results computed by other
participants are derived according to the specified algorithm, model, and input
data? To address this challenge, verifiable machine learning (VML) has emerged.
This paper presents a comprehensive survey of zero-knowledge proof-based
verifiable machine learning (ZKP-VML) technology. We first analyze the
potential verifiability issues that may exist in different machine learning
scenarios. Subsequently, we provide a formal definition of ZKP-VML. We then
conduct a detailed analysis and classification of existing works based on their
technical approaches. Finally, we discuss the key challenges and future
directions in the field of ZKP-based VML
New Statistical Algorithms for the Analysis of Mass Spectrometry Time-Of-Flight Mass Data with Applications in Clinical Diagnostics
Mass spectrometry (MS) based techniques have emerged as a standard forlarge-scale protein analysis. The ongoing progress in terms of more sensitive
machines and improved data analysis algorithms led to a constant expansion of
its fields of applications. Recently, MS was introduced into clinical proteomics
with the prospect of early disease detection using proteomic pattern matching.
Analyzing biological samples (e.g. blood) by mass spectrometry generates
mass spectra that represent the components (molecules) contained in a
sample as masses and their respective relative concentrations.
In this work, we are interested in those components that are constant within a
group of individuals but differ much between individuals of two distinct groups.
These distinguishing components that dependent on a particular medical condition
are generally called biomarkers. Since not all biomarkers found by the
algorithms are of equal (discriminating) quality we are only interested in a
small biomarker subset that - as a combination - can be used as a
fingerprint for a disease. Once a fingerprint for a particular disease
(or medical condition) is identified, it can be used in clinical diagnostics to
classify unknown spectra.
In this thesis we have developed new algorithms for automatic extraction of
disease specific fingerprints from mass spectrometry data. Special emphasis has
been put on designing highly sensitive methods with respect to signal detection.
Thanks to our statistically based approach our methods are able to
detect signals even below the noise level inherent in data acquired by common MS
machines, such as hormones.
To provide access to these new classes of algorithms to collaborating groups
we have created a web-based analysis platform that provides all necessary
interfaces for data transfer, data analysis and result inspection.
To prove the platform's practical relevance it has been utilized in several
clinical studies two of which are presented in this thesis. In these studies it
could be shown that our platform is superior to commercial systems with respect
to fingerprint identification. As an outcome of these studies several
fingerprints for different cancer types (bladder, kidney, testicle, pancreas,
colon and thyroid) have been detected and validated. The clinical partners in
fact emphasize that these results would be impossible with a less sensitive
analysis tool (such as the currently available systems).
In addition to the issue of reliably finding and handling signals in noise we
faced the problem to handle very large amounts of data, since an average dataset
of an individual is about 2.5 Gigabytes in size and we have data of hundreds to
thousands of persons. To cope with these large datasets, we developed a new
framework for a heterogeneous (quasi) ad-hoc Grid - an infrastructure that
allows to integrate thousands of computing resources (e.g. Desktop Computers,
Computing Clusters or specialized hardware, such as IBM's Cell Processor in a
Playstation 3)
- …