62 research outputs found
An academic review: applications of data mining techniques in finance industry
With the development of Internet techniques, data volumes are doubling every two years, faster than predicted by Moore’s Law. Big Data Analytics becomes particularly important for enterprise business. Modern computational technologies will provide effective tools to help understand hugely accumulated data and leverage this information to get insights into the finance industry. In order to get actionable insights into the business, data has become most valuable asset of financial organisations, as there are no physical products in finance industry to manufacture. This is where data mining techniques come to their rescue by allowing access to the right information at the right time. These techniques are used by the finance industry in various areas such as fraud detection, intelligent forecasting, credit rating, loan management, customer profiling, money laundering, marketing and prediction of price movements to name a few. This work aims to survey the research on data mining techniques applied to the finance industry from 2010 to 2015.The review finds that Stock prediction and Credit rating have received most attention of researchers, compared to Loan prediction, Money Laundering and Time Series prediction. Due to the dynamics, uncertainty and variety of data, nonlinear mapping techniques have been deeply studied than linear techniques. Also it has been proved that hybrid methods are more accurate in prediction, closely followed by Neural Network technique. This survey could provide a clue of applications of data mining techniques for finance industry, and a summary of methodologies for researchers in this area. Especially, it could provide a good vision of Data Mining Techniques in computational finance for beginners who want to work in the field of computational finance
A review of tools, models and techniques for long-term assessment of distribution systems using OpenDSS and parallel computing
Many distribution system studies require long-term evaluations (e.g. for one year or more): Energy loss minimization, reliability assessment, or optimal rating of distributed energy resources should be based on long-term simulations of the distribution system. This paper summarizes the work carried out by the authors to perform long-term studies of large distribution systems using an OpenDSS-MATLAB environment and parallel computing. The paper details the tools, models, and procedures used by the authors in optimal allocation of distributed resources, reliability assessment of distribution systems with and without distributed generation, optimal rating of energy storage systems, or impact analysis of the solid state transformer. Since in most cases, the developed procedures were implemented for application in a multicore installation, a summary of capabilities required for parallel computing applications is also included. The approaches chosen for carrying out those studies used the traditional Monte Carlo method, clustering techniques or genetic algorithms. Custom-made models for application with OpenDSS were required in some studies: A summary of the characteristics of those models and their implementation are also included.Peer ReviewedPostprint (published version
Analyzing Granger causality in climate data with time series classification methods
Attribution studies in climate science aim for scientifically ascertaining the influence of climatic variations on natural or anthropogenic factors. Many of those studies adopt the concept of Granger causality to infer statistical cause-effect relationships, while utilizing traditional autoregressive models. In this article, we investigate the potential of state-of-the-art time series classification techniques to enhance causal inference in climate science. We conduct a comparative experimental study of different types of algorithms on a large test suite that comprises a unique collection of datasets from the area of climate-vegetation dynamics. The results indicate that specialized time series classification methods are able to improve existing inference procedures. Substantial differences are observed among the methods that were tested
Novel methods for multi-view learning with applications in cyber security
Modern data is complex. It exists in many different forms, shapes and kinds. Vectors, graphs, histograms, sets, intervals, etc.: they each have distinct and varied structural properties. Tailoring models to the characteristics of various feature representations has been the subject of considerable research. In this thesis, we address the challenge of learning from data that is described by multiple heterogeneous feature representations.
This situation arises often in cyber security contexts. Data from a computer network can be represented by a graph of user authentications, a time series of network traffic, a tree of process events, etc. Each representation provides a complementary view of the holistic state of the network, and so data of this type is referred to as multi-view data. Our motivating problem in cyber security is anomaly detection: identifying unusual observations in a joint feature space, which may not appear anomalous marginally.
Our contributions include the development of novel supervised and unsupervised methods, which are applicable not only to cyber security but to multi-view data in general. We extend the generalised linear model to operate in a vector-valued reproducing kernel Hilbert space implied by an operator-valued kernel function, which can be tailored to the structural characteristics of multiple views of data. This is a highly flexible algorithm, able to predict a wide variety of response types. A distinguishing feature is the ability to simultaneously identify outlier observations with respect to the fitted model. Our proposed unsupervised learning model extends multidimensional scaling to directly map multi-view data into a shared latent space. This vector embedding captures both commonalities and disparities that exist between multiple views of the data. Throughout the thesis, we demonstrate our models using real-world cyber security datasets.Open Acces
Recommended from our members
On variants of stochastic gradient descent
Stochastic Gradient Descent (SGD) has played a crucial role in the success of modern machine learning methods. The popularity of SGD arises due to its ease of implementation, low memory and computational requirements, and applicability to a wide variety of optimization problems. However, SGD suffers from numerous issues; chief amongst them are high variance, slow rate of convergence, poor generalization, non-robustness to outliers, and poor performance for imbalanced classification. In this thesis, we propose variants of stochastic gradient descent, to tackle one or more of these issues for different problem settings.
In the first chapter, we analyze the trade-off between variance and complexity to improve the convergence rate of SGD. A common alternative in the literature to SGD is Stochastic Variance Reduced Gradient (SVRG), which achieves linear convergence. However, SVRG involves the computation of a full gradient every few epochs, which is often intractable. We propose the Cheap Stochastic Variance Reduced Gradient (CheapSVRG) algorithm that attains linear convergence up to a neighborhood around the optimum without requiring a full gradient computation step.
In the second chapter, we aim to compare the generalization capabilities of adaptive and non-adaptive methods for over-parameterized linear regression. Of the many possible solutions, SGD tends to gravitate towards the solution with minimum l2-norm while adaptive methods do not. We provide specific conditions on the pre-conditioner matrices under which a subclass of adaptive methods has the same generalization guarantees as SGD for over-parameterized linear regression. With synthetic examples and real data, we show that minimum norm solutions are not an excellent certificate to guarantee better generalization.
In the third chapter, we propose a simple variant of SGD that guarantees robustness. Instead of considering SGD with one sample, we take a mini-batch and choose the sample with the lowest loss. For the noiseless framework with and without outliers, we provide conditions for the convergence of MKL-SGD to a provably better solution than SGD in the worst case. We also perform the standard rate of convergence analysis for both noiseless and noisy settings.
In the final chapter, we tackle the challenges introduced by imbalanced class distribution in SGD. In place of using all the samples to update the parameter, our proposed Balancing SGD (B-SGD) algorithm rejects samples with low loss as they are redundant and do not play a role in determining the separating hyperplane. Imposing this label-dependent loss-based thresholding scheme on incoming samples allows us to improve the rate of convergence and achieve better generalization.Electrical and Computer Engineerin
- …