460 research outputs found
Adaptive One-Dimensional Convolutional Neural Network for Tabular Data
This study introduces an innovative approach for tackling the credit risk prediction problem using an Adaptive One-Dimensional Convolutional Neural Network (1D CNN). The proposed methodology is designed for one-dimensional data, such as tabular data, through a combination of feed-forward and back-propagation phases. During the feed-forward phase, neuron outputs are computed by applying convolution operations to previous layer outputs, along with bias terms and activation functions. The subsequent back-propagation phase updates weights and biases to minimize prediction errors. A custom weight initialization algorithm tailored to Leaky ReLU activation is employed to enhance model adaptability. The core of the proposed algorithm lies in its ability to process each training data sample across layers, optimizing weights and biases to achieve accurate predictions. Comprehensive evaluations are conducted on various machine learning algorithms, including Gaussian Naive Bayes, Logistic Regression, ensemble methods, and neural networks. The proposed Adaptive 1D CNN emerges as the top performer, consistently surpassing other methods in precision, recall, F1-score, and accuracy. This success is attributed to its specialized weight initialization, effective back-propagation, and integration of 1D convolutional layers
Cross-Device Tracking: Matching Devices and Cookies
The number of computers, tablets and smartphones is increasing rapidly, which
entails the ownership and use of multiple devices to perform online tasks. As
people move across devices to complete these tasks, their identities becomes
fragmented. Understanding the usage and transition between those devices is
essential to develop efficient applications in a multi-device world. In this
paper we present a solution to deal with the cross-device identification of
users based on semi-supervised machine learning methods to identify which
cookies belong to an individual using a device. The method proposed in this
paper scored third in the ICDM 2015 Drawbridge Cross-Device Connections
challenge proving its good performance
An academic review: applications of data mining techniques in finance industry
With the development of Internet techniques, data volumes are doubling every two years, faster than predicted by Moore’s Law. Big Data Analytics becomes particularly important for enterprise business. Modern computational technologies will provide effective tools to help understand hugely accumulated data and leverage this information to get insights into the finance industry. In order to get actionable insights into the business, data has become most valuable asset of financial organisations, as there are no physical products in finance industry to manufacture. This is where data mining techniques come to their rescue by allowing access to the right information at the right time. These techniques are used by the finance industry in various areas such as fraud detection, intelligent forecasting, credit rating, loan management, customer profiling, money laundering, marketing and prediction of price movements to name a few. This work aims to survey the research on data mining techniques applied to the finance industry from 2010 to 2015.The review finds that Stock prediction and Credit rating have received most attention of researchers, compared to Loan prediction, Money Laundering and Time Series prediction. Due to the dynamics, uncertainty and variety of data, nonlinear mapping techniques have been deeply studied than linear techniques. Also it has been proved that hybrid methods are more accurate in prediction, closely followed by Neural Network technique. This survey could provide a clue of applications of data mining techniques for finance industry, and a summary of methodologies for researchers in this area. Especially, it could provide a good vision of Data Mining Techniques in computational finance for beginners who want to work in the field of computational finance
A Survey on Semi-Supervised Learning for Delayed Partially Labelled Data Streams
Unlabelled data appear in many domains and are particularly relevant to
streaming applications, where even though data is abundant, labelled data is
rare. To address the learning problems associated with such data, one can
ignore the unlabelled data and focus only on the labelled data (supervised
learning); use the labelled data and attempt to leverage the unlabelled data
(semi-supervised learning); or assume some labels will be available on request
(active learning). The first approach is the simplest, yet the amount of
labelled data available will limit the predictive performance. The second
relies on finding and exploiting the underlying characteristics of the data
distribution. The third depends on an external agent to provide the required
labels in a timely fashion. This survey pays special attention to methods that
leverage unlabelled data in a semi-supervised setting. We also discuss the
delayed labelling issue, which impacts both fully supervised and
semi-supervised methods. We propose a unified problem setting, discuss the
learning guarantees and existing methods, explain the differences between
related problem settings. Finally, we review the current benchmarking practices
and propose adaptations to enhance them
EC3: Combining Clustering and Classification for Ensemble Learning
Classification and clustering algorithms have been proved to be successful
individually in different contexts. Both of them have their own advantages and
limitations. For instance, although classification algorithms are more powerful
than clustering methods in predicting class labels of objects, they do not
perform well when there is a lack of sufficient manually labeled reliable data.
On the other hand, although clustering algorithms do not produce label
information for objects, they provide supplementary constraints (e.g., if two
objects are clustered together, it is more likely that the same label is
assigned to both of them) that one can leverage for label prediction of a set
of unknown objects. Therefore, systematic utilization of both these types of
algorithms together can lead to better prediction performance. In this paper,
We propose a novel algorithm, called EC3 that merges classification and
clustering together in order to support both binary and multi-class
classification. EC3 is based on a principled combination of multiple
classification and multiple clustering methods using an optimization function.
We theoretically show the convexity and optimality of the problem and solve it
by block coordinate descent method. We additionally propose iEC3, a variant of
EC3 that handles imbalanced training data. We perform an extensive experimental
analysis by comparing EC3 and iEC3 with 14 baseline methods (7 well-known
standalone classifiers, 5 ensemble classifiers, and 2 existing methods that
merge classification and clustering) on 13 standard benchmark datasets. We show
that our methods outperform other baselines for every single dataset, achieving
at most 10% higher AUC. Moreover our methods are faster (1.21 times faster than
the best baseline), more resilient to noise and class imbalance than the best
baseline method.Comment: 14 pages, 7 figures, 11 table
SUPPORT OF MANAGERIAL DECISION MAKING BY TRANSDUCTIVE LEARNING
Transductive inference has been introduced as a novelparadigm towards building predictive classi¯cation modelsfrom empirical data. Such models are routinely employedto support decision making in, e.g., marketing, risk manage-ment and manufacturing. To that end, the characteristics ofthe new philosophy are reviewed and their implications fortypical decision problems are examined. The paper\u27s objec-tive is to explore the potential of transductive learning forcorporate planning. The analysis reveals two main factorsthat govern the applicability of transduction in business set-tings, decision scope and urgency. In a similar fashion, twomajor drivers for its e®ectiveness are identi¯ed and empir-ical experiments are undertaken to con¯rm their in°uence.The results evidence that transductive classi¯ers are wellsuperior to their inductive counterparts if their speci¯c ap-plication requirements are ful¯lled
Extending the Tsetlin Machine With Integer-Weighted Clauses for Increased Interpretability
Despite significant effort, building models that are both interpretable and
accurate is an unresolved challenge for many pattern recognition problems. In
general, rule-based and linear models lack accuracy, while deep learning
interpretability is based on rough approximations of the underlying inference.
Using a linear combination of conjunctive clauses in propositional logic,
Tsetlin Machines (TMs) have shown competitive performance on diverse
benchmarks. However, to do so, many clauses are needed, which impacts
interpretability. Here, we address the accuracy-interpretability challenge in
machine learning by equipping the TM clauses with integer weights. The
resulting Integer Weighted TM (IWTM) deals with the problem of learning which
clauses are inaccurate and thus must team up to obtain high accuracy as a team
(low weight clauses), and which clauses are sufficiently accurate to operate
more independently (high weight clauses). Since each TM clause is formed
adaptively by a team of Tsetlin Automata, identifying effective weights becomes
a challenging online learning problem. We address this problem by extending
each team of Tsetlin Automata with a stochastic searching on the line (SSL)
automaton. In our novel scheme, the SSL automaton learns the weight of its
clause in interaction with the corresponding Tsetlin Automata team, which, in
turn, adapts the composition of the clause by the adjusting weight. We evaluate
IWTM empirically using five datasets, including a study of interpetability. On
average, IWTM uses 6.5 times fewer literals than the vanilla TM and 120 times
fewer literals than a TM with real-valued weights. Furthermore, in terms of
average F1-Score, IWTM outperforms simple Multi-Layered Artificial Neural
Networks, Decision Trees, Support Vector Machines, K-Nearest Neighbor, Random
Forest, XGBoost, Explainable Boosting Machines, and standard and real-value
weighted TMs.Comment: 20 pages, 10 figure
Searching for Needles in the Cosmic Haystack
Searching for pulsar signals in radio astronomy data sets is a difficult task. The data sets are extremely large, approaching the petabyte scale, and are growing larger as instruments become more advanced. Big Data brings with it big challenges. Processing the data to identify candidate pulsar signals is computationally expensive and must utilize parallelism to be scalable. Labeling benchmarks for supervised classification is costly. To compound the problem, pulsar signals are very rare, e.g., only 0.05% of the instances in one data set represent pulsars. Furthermore, there are many different approaches to candidate classification with no consensus on a best practice. This dissertation is focused on identifying and classifying radio pulsar candidates from single pulse searches. First, to identify and classify Dispersed Pulse Groups (DPGs), we developed a supervised machine learning approach that consists of RAPID (a novel peak identification algorithm), feature extraction, and supervised machine learning classification. We tested six algorithms for classification with four imbalance treatments. Results showed that classifiers with imbalance treatments had higher recall values. Overall, classifiers using multiclass RandomForests combined with Synthetic Majority Oversampling TEchnique (SMOTE) were the most efficient; they identified additional known pulsars not in the benchmark, with less false positives than other classifiers. Second, we developed a parallel single pulse identification method, D-RAPID, and introduced a novel automated multiclass labeling (ALM) technique that we combined with feature selection to improve execution performance. D-RAPID improved execution performance over RAPID by a factor of 5. We also showed that the combination of ALM and feature selection sped up the execution performance of RandomForest by 54% on average with less than a 2% average reduction in classification performance. Finally, we proposed CoDRIFt, a novel classification algorithm that is distributed for scalability and employs semi-supervised learning to leverage unlabeled data to inform classification. We evaluated and compared CoDRIFt to eleven other classifiers. The results showed that CoDRIFt excelled at classifying candidates in imbalanced benchmarks with a majority of non-pulsar signals (\u3e95%). Furthermore, CoDRIFt models created with very limited sets of labeled data (as few as 22 labeled minority class instances) were able to achieve high recall (mean = 0.98). In comparison to the other algorithms trained on similar sets, CoDRIFt outperformed them all, with recall 2.9% higher than the next best classifier and a 35% average improvement over all eleven classifiers. CoDRIFt is customizable for other problem domains with very large, imbalanced data sets, such as fraud detection and cyber attack detection
- …