4 research outputs found
FedPNN: One-shot Federated Classification via Evolving Clustering Method and Probabilistic Neural Network hybrid
Protecting data privacy is paramount in the fields such as finance, banking,
and healthcare. Federated Learning (FL) has attracted widespread attention due
to its decentralized, distributed training and the ability to protect the
privacy while obtaining a global shared model. However, FL presents challenges
such as communication overhead, and limited resource capability. This motivated
us to propose a two-stage federated learning approach toward the objective of
privacy protection, which is a first-of-its-kind study as follows: (i) During
the first stage, the synthetic dataset is generated by employing two different
distributions as noise to the vanilla conditional tabular generative
adversarial neural network (CTGAN) resulting in modified CTGAN, and (ii) In the
second stage, the Federated Probabilistic Neural Network (FedPNN) is developed
and employed for building globally shared classification model. We also
employed synthetic dataset metrics to check the quality of the generated
synthetic dataset. Further, we proposed a meta-clustering algorithm whereby the
cluster centers obtained from the clients are clustered at the server for
training the global model. Despite PNN being a one-pass learning classifier,
its complexity depends on the training data size. Therefore, we employed a
modified evolving clustering method (ECM), another one-pass algorithm to
cluster the training data thereby increasing the speed further. Moreover, we
conducted sensitivity analysis by varying Dthr, a hyperparameter of ECM at the
server and client, one at a time. The effectiveness of our approach is
validated on four finance and medical datasets.Comment: 27 pages, 13 figures, 7 table
Parallel and Streaming Wavelet Neural Networks for Classification and Regression under Apache Spark
Wavelet neural networks (WNN) have been applied in many fields to solve
regression as well as classification problems. After the advent of big data, as
data gets generated at a brisk pace, it is imperative to analyze it as soon as
it is generated owing to the fact that the nature of the data may change
dramatically in short time intervals. This is necessitated by the fact that big
data is all pervasive and throws computational challenges for data scientists.
Therefore, in this paper, we built an efficient Scalable, Parallelized Wavelet
Neural Network (SPWNN) which employs the parallel stochastic gradient algorithm
(SGD) algorithm. SPWNN is designed and developed under both static and
streaming environments in the horizontal parallelization framework. SPWNN is
implemented by using Morlet and Gaussian functions as activation functions.
This study is conducted on big datasets like gas sensor data which has more
than 4 million samples and medical research data which has more than 10,000
features, which are high dimensional in nature. The experimental analysis
indicates that in the static environment, SPWNN with Morlet activation function
outperformed SPWNN with Gaussian on the classification datasets. However, in
the case of regression, the opposite was observed. In contrast, in the
streaming environment i.e., Gaussian outperformed Morlet on the classification
and Morlet outperformed Gaussian on the regression datasets. Overall, the
proposed SPWNN architecture achieved a speedup of 1.32-1.40.Comment: 25 pages; 2 Tables; 7 Figure
Explainable Artificial Intelligence and Causal Inference based ATM Fraud Detection
Gaining the trust of customers and providing them empathy are very critical
in the financial domain. Frequent occurrence of fraudulent activities affects
these two factors. Hence, financial organizations and banks must take utmost
care to mitigate them. Among them, ATM fraudulent transaction is a common
problem faced by banks. There following are the critical challenges involved in
fraud datasets: the dataset is highly imbalanced, the fraud pattern is
changing, etc. Owing to the rarity of fraudulent activities, Fraud detection
can be formulated as either a binary classification problem or One class
classification (OCC). In this study, we handled these techniques on an ATM
transactions dataset collected from India. In binary classification, we
investigated the effectiveness of various over-sampling techniques, such as the
Synthetic Minority Oversampling Technique (SMOTE) and its variants, Generative
Adversarial Networks (GAN), to achieve oversampling. Further, we employed
various machine learning techniques viz., Naive Bayes (NB), Logistic Regression
(LR), Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF),
Gradient Boosting Tree (GBT), Multi-layer perceptron (MLP). GBT outperformed
the rest of the models by achieving 0.963 AUC, and DT stands second with 0.958
AUC. DT is the winner if the complexity and interpretability aspects are
considered. Among all the oversampling approaches, SMOTE and its variants were
observed to perform better. In OCC, IForest attained 0.959 CR, and OCSVM
secured second place with 0.947 CR. Further, we incorporated explainable
artificial intelligence (XAI) and causal inference (CI) in the fraud detection
framework and studied it through various analyses.Comment: 34 pages; 21 Figures; 8 Table
Parallel bi-objective evolutionary algorithms for scalable feature subset selection via migration strategy under Spark
Feature subset selection (FSS) for classification is inherently a
bi-objective optimization problem, where the task is to obtain a feature subset
which yields the maximum possible area under the receiver operator
characteristic curve (AUC) with minimum cardinality of the feature subset. In
todays world, a humungous amount of data is generated in all activities of
humans. To mine such voluminous data, which is often high-dimensional, there is
a need to develop parallel and scalable frameworks. In the first-of-its-kind
study, we propose and develop an iterative MapReduce-based framework for
bi-objective evolutionary algorithms (EAs) based wrappers under Apache spark
with the migration strategy. In order to accomplish this, we parallelized the
non-dominated sorting based algorithms namely non dominated sorting algorithm
(NSGA-II), and non-dominated sorting particle swarm optimization (NSPSO), also
the decomposition-based algorithm, namely the multi-objective evolutionary
algorithm based on decomposition (MOEA-D), and named them P-NSGA-II-IS,
P-NSPSO-IS, P-MOEA-D-IS, respectively. We proposed a modified MOEA-D by
incorporating the non-dominated sorting principle while parallelizing it.
Throughout the study, AUC is computed by logistic regression (LR). We test the
effectiveness of the proposed methodology on various datasets. It is noteworthy
that the P-NSGA-II turns out to be statistically significant by being in the
top 2 positions on most datasets. We also reported the empirical attainment
plots, speed up analysis, and mean AUC obtained by the most repeated feature
subset and the least cardinal feature subset with the highest AUC, and
diversity analysis using hypervolume.Comment: 32 pages, 11 Tables, 8 figure