2 research outputs found

    Towards An efficient unsupervised feature selection methods for high-dimensional data

    Get PDF
    With the proliferation of the data, the dimensions of data have increased significantly, producing what is known as high-dimensional data. This increase of data dimensions results in redundant and non-representative features, which pose challenges to existing machine learning algorithms. Firstly, they add extra processing time to the learning algorithms and therefore negatively affect their performance/running time. Secondly, they reduce the accuracy of the learning algorithms by overfitting the data with these redundant and non-representative features. Lastly, they require greater storage capacity. This thesis is concerned with reducing the data dimensions for machine learning algorithms in order to improve their accuracy and run-time efficiently. The reduction is carried out by selecting a reduced set of representative and non-redundant features from the original feature space so it approximates the original feature space. Three research issues have been addressed to achieve the main aim of this thesis. The first research task addresses the issue of accurate selection of representative features from high-dimensional data. An efficient and accurate similarity-based unsupervised feature selection method (called AUFS) is proposed to tackle the issue of the high-dimensionality of data by selecting representative features without the need to use data class labels. The proposed AUFS method extends the k-mean clustering algorithm to partition the features into k clusters based on different similarity measures in order to accurately partition the features. Then, the proposed centroid-based feature selection method is used to accurately select those representative features. The second research task is intended to select representative features from streaming features applications where the number of features increases while the number of instances remains fixed. Streaming features applications pose challenges for feature selection methods. These dynamic features applications have the following characteristics: a) features are sequentially generated and are processed one by one upon their arrival while the number of instances/points remains fixed; and b) the complete feature space is not known in advance. A new method known as Unsupervised Feature Selection for Streaming Features (UFSSF), is proposed to select representative features considering these characteristics of streaming features applications. UFSSF further extends the k-mean clustering algorithm to incrementally decide whether to add the newly arrived feature to the existing set of representative features. Those features that are not representative are discarded. The last research task involves reducing the dimensionality of multi-view data where both the number of features and instances can increase over time. Multi-view learning provides complementary information for machine learning algorithms. However, it results in high-dimensionality as the data is being considered from different views. Indeed, extra views would definitely result in extra dimensions. In particular, existing solutions assume that the number of the views is static; however, this is not realistic when dealing with real applications as new views can be added. Therefore, an Onlline Unsupervised Feature Selection for Dynamic Views (OUDVFS) is proposed. As we are targeting unsupervised learning, we propose a new clustering-based feature selection method that incrementally clusters the views. The set of selected representative features is updated at each clustering step

    Data mining of vehicle telemetry data

    Get PDF
    Driving a safety critical task that requires a high level of attention and workload from the driver. Despite this, people often perform secondary tasks such as eating or using a mobile phone, which increase workload levels and divert cognitive and physical attention from the primary task of driving. As well as these distractions, the driver may also be overloaded for other reasons, such as dealing with an incident on the road or holding conversations in the car. One solution to this distraction problem is to limit the functionality of in-car devices while the driver is overloaded. This can take the form of withholding an incoming phone call or delaying the display of a non-urgent piece of information about the vehicle. In order to design and build these adaptions in the car, we must first have an understanding of the driver's current level of workload. Traditionally, driver workload has been monitored using physiological sensors or camera systems in the vehicle. However, physiological systems are often intrusive and camera systems can be expensive and are unreliable in poor light conditions. It is important, therefore, to use methods that are non-intrusive, inexpensive and robust, such as sensors already installed on the car and accessible via the Controller Area Network (CAN)-bus. This thesis presents a data mining methodology for this problem, as well as for others in domains with similar types of data, such as human activity monitoring. It focuses on the variable selection stage of the data mining process, where inputs are chosen for models to learn from and make inferences. Selecting inputs from vehicle telemetry data is challenging because there are many irrelevant variables with a high level of redundancy. Furthermore, data in this domain often contains biases because only relatively small amounts can be collected and processed, leading to some variables appearing more relevant to the classification task than they are really. Over the course of this thesis, a detailed variable selection framework that addresses these issues for telemetry data is developed. A novel blocked permutation method is developed and applied to mitigate biases when selecting variables from potentially biased temporal data. This approach is infeasible computationally when variable redundancies are also considered, and so a novel permutation redundancy measure with similar properties is proposed. Finally, a known redundancy structure between features in telemetry data is used to enhance the feature selection process in two ways. First the benefits of performing raw signal selection, feature extraction, and feature selection in different orders are investigated. Second, a two-stage variable selection framework is proposed and the two permutation based methods are combined. Throughout the thesis, it is shown through classification evaluations and inspection of the features that these permutation based selection methods are appropriate for use in selecting features from CAN-bus data
    corecore