7,617 research outputs found
Using Feature Selection with Machine Learning for Generation of Insurance Insights
Insurance is a data-rich sector, hosting large volumes of customer data that is analysed to evaluate risk. Machine learning techniques are increasingly used in the effective management of insurance risk. Insurance datasets by their nature, however, are often of poor quality with noisy subsets of data (or features). Choosing the right features of data is a significant pre-processing step in the creation of machine learning models. The inclusion of irrelevant and redundant features has been demonstrated to affect the performance of learning models. In this article, we propose a framework for improving predictive machine learning techniques in the insurance sector via the selection of relevant features. The experimental results, based on five publicly available real insurance datasets, show the importance of applying feature selection for the removal of noisy features before performing machine learning techniques, to allow the algorithm to focus on influential features. An additional business benefit is the revelation of the most and least important features in the datasets. These insights can prove useful for decision making and strategy development in areas/business problems that are not limited to the direct target of the downstream algorithms. In our experiments, machine learning techniques based on a set of selected features suggested by feature selection algorithms outperformed the full feature set for a set of real insurance datasets. Specifically, 20% and 50% of features in our five datasets had improved downstream clustering and classification performance when compared to whole datasets. This indicates the potential for feature selection in the insurance sector to both improve model performance and to highlight influential features for business insights
Automated Classification for Electrophysiological Data: Machine Learning Approaches for Disease Detection and Emotion Recognition
Smart healthcare is a health service system that utilizes technologies, e.g., artificial intelligence and
big data, to alleviate the pressures on healthcare systems. Much recent research has focused on the
automatic disease diagnosis and recognition and, typically, our research pays attention on automatic
classifications for electrophysiological signals, which are measurements of the electrical activity.
Specifically, for electrocardiogram (ECG) and electroencephalogram (EEG) data, we develop a
series of algorithms for automatic cardiovascular disease (CVD) classification, emotion recognition
and seizure detection.
With the ECG signals obtained from wearable devices, the candidate developed novel signal
processing and machine learning method for continuous monitoring of heart conditions. Compared to
the traditional methods based on the devices at clinical settings, the developed method in this thesis
is much more convenient to use. To identify arrhythmia patterns from the noisy ECG signals obtained
through the wearable devices, CNN and LSTM are used, and a wavelet-based CNN is proposed to
enhance the performance.
An emotion recognition method with a single channel ECG is developed, where a novel exploitative
and explorative GWO-SVM algorithm is proposed to achieve high performance emotion
classification. The attractive part is that the proposed algorithm has the capability to learn the SVM
hyperparameters automatically, and it can prevent the algorithm from falling into local solutions,
thereby achieving better performance than existing algorithms.
A novel EEG-signal based seizure detector is developed, where the EEG signals are transformed to
the spectral-temporal domain, so that the dimension of the input features to the CNN can be
significantly reduced, while the detector can still achieve superior detection performance
Dense semantic labeling of sub-decimeter resolution images with convolutional neural networks
Semantic labeling (or pixel-level land-cover classification) in ultra-high
resolution imagery (< 10cm) requires statistical models able to learn high
level concepts from spatial data, with large appearance variations.
Convolutional Neural Networks (CNNs) achieve this goal by learning
discriminatively a hierarchy of representations of increasing abstraction.
In this paper we present a CNN-based system relying on an
downsample-then-upsample architecture. Specifically, it first learns a rough
spatial map of high-level representations by means of convolutions and then
learns to upsample them back to the original resolution by deconvolutions. By
doing so, the CNN learns to densely label every pixel at the original
resolution of the image. This results in many advantages, including i)
state-of-the-art numerical accuracy, ii) improved geometric accuracy of
predictions and iii) high efficiency at inference time.
We test the proposed system on the Vaihingen and Potsdam sub-decimeter
resolution datasets, involving semantic labeling of aerial images of 9cm and
5cm resolution, respectively. These datasets are composed by many large and
fully annotated tiles allowing an unbiased evaluation of models making use of
spatial information. We do so by comparing two standard CNN architectures to
the proposed one: standard patch classification, prediction of local label
patches by employing only convolutions and full patch labeling by employing
deconvolutions. All the systems compare favorably or outperform a
state-of-the-art baseline relying on superpixels and powerful appearance
descriptors. The proposed full patch labeling CNN outperforms these models by a
large margin, also showing a very appealing inference time.Comment: Accepted in IEEE Transactions on Geoscience and Remote Sensing, 201
A Contribution to land cover and land use mapping: in Portugal with multi-temporal Sentinel-2 data and supervised classification
Dissertation presented as the partial requirement for obtaining a Master's degree in Geographic Information Systems and ScienceRemote sensing techniques have been widely employed to map and monitor land cover and land use, important elements for the description of the environment. The current land cover and land use mapping paradigm takes advantage of a variety of data options with proper spatial, spectral and temporal resolutions along with advances in technology. This enabled the creation of automated data processing workflows integrated with classification algorithms to accurately map large areas with multi-temporal data. In Portugal, the General Directorate for Territory (DGT) is developing an operational Land Cover Monitoring System (SMOS), which includes an annual land cover cartography product (COSsim) based on an automatic process using supervised classification of multi-temporal Sentinel-2 data. In this context, a range of experiments are being conducted to improve map accuracy and classification efficiency. This study provides a contribution to DGT’s work. A classification of the biogeographic region of Trás-os-Montes in the North of Portugal was performed for the agricultural year of 2018 using Random Forest and an intra-annual multi-temporal Sentinel-2 dataset, with stratification of the study area and a combination of manually and automatically extracted training samples, with the latter being based on existing reference datasets. This classification was compared to a benchmark classification, conducted without stratification and with training data collected automatically only. In addition, an assessment of the influence of training sample size in classification accuracy was conducted. The main focus of this study was to investigate whether the use of
vi
classification uncertainty to create an improved training dataset could increase classification accuracy. A process of extracting additional training samples from areas of high classification uncertainty was conducted, then a new classification was performed and the results were compared. Classification accuracy assessment for all proposed experiments was conducted using the overall accuracy, precision, recall and F1-score. The use of stratification and combination of training strategies resulted in a classification accuracy of 66.7%, in contrast to 60.2% in the case of the benchmark classification. Despite the difference being considered not statistically significant, visual inspection of both maps indicated that stratification and introduction of manual training contributed to map land cover more accurately in some areas. Regarding the influence of sample size in classification accuracy, the results indicated a small difference, considered not statistically significant, in accuracy even after a reduction of over 90% in the sample size. This supports the findings of other studies which suggested that Random Forest has low sensitivity to variations in training sample size. However, the results might have been influenced by the training strategy employed, which uses spectral subclasses, thus creating spectral diversity in the samples independently of their size. With respect to the use of classification uncertainty to improve training sample, a slight increase of approximately 1% was observed, which was considered not statistically significant. This result could have been affected by limitations in the process of collecting additional sampling units for some classes, which resulted in a lack of additional training for some classes (eg. agriculture) and an overall imbalanced training dataset. Additionally, some classes had their additional training sampling units collected from a limited number of polygons, which could limit the spectral diversity of new samples. Nevertheless, visual inspection of the map suggested that the new training contributed to reduce confusion between some classes, improving map agreement with ground truth. Further investigation can be conducted to explore more deeply the potential of classification uncertainty, especially focusing on addressing problems related to the collection of the additional samples
Spectral Feature Selection for Data Mining
This timely introduction to spectral feature selection illustrates the potential of this powerful dimensionality reduction technique in high-dimensional data processing. It presents the theoretical foundations of spectral feature selection, its connections to other algorithms, and its use in handling both large-scale data sets and small sample problems. Readers learn how to use spectral feature selection to solve challenging problems in real-life applications and discover how general feature selection and extraction are connected to spectral feature selection. Source code for the algorithms is available online
- …