2,010 research outputs found

    Development and Identification of Metrics to Predict the Impact of Dimension Reduction Techniques on Classical Machine Learning Algorithms for Still Highway Images

    Get PDF
    We are witnessing an influx of data - images, texts, video, etc. Their high dimensionality and large volume make it challenging to apply machine learning to obtain actionable insight. This thesis explores several aspects pertaining to dimensional reduction: dimension reduction methods, metrics to measure distortion, image preprocessing, etc. Faster training and inference time on reduced data and smaller models which can be deployed on commodity hardware are a critical advantage of dimension reduction. For this study, classical machine learning methods were explored owing to their solid mathematical foundation and interpretability. The dataset used is a time series of images from several camera feeds observing the traffic, weather and road conditions along highways. The time-series nature of dataset gives rise to interesting questions which are investigated in this work. For instance, can machine learning models trained on past data be used on future camera feed data? This is highly desirable and yet difficult due to the changing weather, road conditions, traffic conditions and scenery. Can dimension reduction models obtained from past data be used for reducing dimensionality of future data? This thesis also examines the difference between the performance of machine learning methods before and after application of dimension reduction. It tests some existing metrics to measure quality of dimension-reduced data set and introduces several new ones. It also examines the application of image pre-processing methods to boost the performance of classifiers. The classification performance with and without random sampling has been studied as well

    Feature Selection Methods with Applications in Electrical Load Forecasting

    Get PDF
    The purpose of this thesis is two-fold: implement and evaluate a method, the Fast Correlation-Based Filter (FCBF) by Yu et al., for feature selection applied on a meteorological data set consisting of 19 weather variables from 606 locations in Scandinavia, and investigate whether geography can be exploited in the search for relevant features. Four areas are chosen as target areas where load prediction error is evaluated as a measure of goodness. A subset of the total data set is used to lower the computation time; only Swedish locations were used, and only data from SMHI was used. The impact of using different subsets of weather features as well as selecting features from several locations is investigated using FCBF and epsilon-Support Vector Regression. A modification to the FCBF algorithm is tested in one of the experiments, using Pearson correlation in place of symmetrical uncertainty. An investigation of how the relationships between features change with distance is performed and the results are then used to motivate a greedy feature selection method. FCBF, even when implemented with the naive approximation of marginal and conditional entropy, filtered the total data set from 3180 to approximately 20 features with a prediction error of less than 1% for three of the target areas and 1.71% for the fourth. Further tests lowered the numbered of features even further without significantly affecting the prediction error. Using FCBF to rank the weather variables for a single area proved less than optimal which may be attributed to many of the extremely small intra-feature SU values. Selecting locations based on distance from target area resulted in prediction errors better than random sampling and comparable to the filter while still keeping the number of features low. The very best feature selection results were only slightly lower than a base case, suggesting that the present experimental setting may not be enough to draw definitive conclusions regarding the efficacy of the selection methods. Two possible contributing factors are the unoptimized model used, and the choice to investigate the impact on average load over a 24 hour window. Future studies may also wish to extend the geographical investigation to use coordinates or direction in conjunction with distance from the target area, as some indication of latitude dependent behavior was found, most likely contributed by the elongated shape of Sweden.Finding useful information in a large data set to better predict consumption of electricity Data describing the weather at different places in Scandinavia shows a lot of redundancy which may affect its usefulness in predicting future electricity consumption. This master thesis tests two methods for removing lots of useless or harmful information. Predicting the consumption of electricity on a city-wide scale allow those who manage equipment, generate and store electricity, and buy and sell energy to better plan the maintenance of their equipment, and to ensure that there are enough electrons flowing through your wall socket when you plug in your new computer. The predictions are done using artificial intelligence methods that look for patterns in data that can be used to determine the magnitude of electric consumption in the future. One of the main problems in performing accurate predictions is finding the right data to use; Choosing the wrong variables may lead to poor predictions which in turn may lead to equipment failure or other costly decisions for the energy providers and utility companies. The data sets typically used for these kinds of predictions describe different aspects of future weather. Since weather is a natural phenomenon that varies differently depending on how far between two points you look we may assume that there will be a lot of data showing basically the same thing; the weather in Lund is probably not very different from the weather in Malmö, while the weather in Umeå might differ much more from the other two cities. In this case, we call the data from Lund and Malmö redundant in the light of each other. The goal of this thesis has been to investigate methods that sort through the data set in order to find useful data, which we call relevant, and remove redundant information. Two approaches are taken. First, we look at the properties of the data itself and measure relevancy and redundancy by seeing if there is a significant similarity between pairs of variables. For this purpose an algorithm called the Fast Correlation-Based Filter is implemented and evaluated. The filter searches through the data set without considering all possible combinations of variables in order to make it faster. Furthermore, we look at the possibility of being able to choose relevant data based on the geographical location. Motivated by the fact that weather data from places close to each other are very similar it is possible to sort through the data set just by using distance from the city where electrical consumption is being predicted. Both methods show promising results when tested on predicting the daily average electricity consumption on four areas, managing to remove over 99% of the data while still performing accurate predictions. Further tests should investigate the computations performed for the statistical measure used, as well as see how useful the methods are on data of higher resolution

    From Intrusion Detection to Attacker Attribution: A Comprehensive Survey of Unsupervised Methods

    Get PDF
    Over the last five years there has been an increase in the frequency and diversity of network attacks. This holds true, as more and more organisations admit compromises on a daily basis. Many misuse and anomaly based Intrusion Detection Systems (IDSs) that rely on either signatures, supervised or statistical methods have been proposed in the literature, but their trustworthiness is debatable. Moreover, as this work uncovers, the current IDSs are based on obsolete attack classes that do not reflect the current attack trends. For these reasons, this paper provides a comprehensive overview of unsupervised and hybrid methods for intrusion detection, discussing their potential in the domain. We also present and highlight the importance of feature engineering techniques that have been proposed for intrusion detection. Furthermore, we discuss that current IDSs should evolve from simple detection to correlation and attribution. We descant how IDS data could be used to reconstruct and correlate attacks to identify attackers, with the use of advanced data analytics techniques. Finally, we argue how the present IDS attack classes can be extended to match the modern attacks and propose three new classes regarding the outgoing network communicatio

    INTEGRATING KANO MODEL WITH DATA MINING TECHNIQUES TO ENHANCE CUSTOMER SATISFACTION

    Get PDF
    The business world is becoming more competitive from time to time; therefore, businesses are forced to improve their strategies in every single aspect. So, determining the elements that contribute to the clients\u27 contentment is one of the critical needs of businesses to develop successful products in the market. The Kano model is one of the models that help determine which features must be included in a product or service to improve customer satisfaction. The model focuses on highlighting the most relevant attributes of a product or service along with customers’ estimation of how these attributes can be used to predict satisfaction with specific services or products. This research aims at developing a method to integrate the Kano model and data mining approaches to select relevant attributes that drive customer satisfaction, with a specific focus on higher education. The significant contribution of this research is to improve the quality of United Arab Emirates University academic support and development services provided to their students by solving the problem of selecting features that are not methodically correlated to customer satisfaction, which could reduce the risk of investing in features that could ultimately be irrelevant to enhancing customer satisfaction. Questionnaire data were collected from 646 students from United Arab Emirates University. The experiment suggests that Extreme Gradient Boosting Regression can produce the best results for this kind of problem. Based on the integration of the Kano model and the feature selection method, the number of features used to predict customer satisfaction is minimized to four features. It was found that either Chi-Square or Analysis of Variance (ANOVA) features selection model’s integration with the Kano model giving higher values of Pearson correlation coefficient and R2. Moreover, the prediction was made using union features between the Kano model\u27s most important features and the most frequent features among 8 clusters. It shows high-performance results

    Swarm intelligence for clustering dynamic data sets for web usage mining and personalization.

    Get PDF
    Swarm Intelligence (SI) techniques were inspired by bee swarms, ant colonies, and most recently, bird flocks. Flock-based Swarm Intelligence (FSI) has several unique features, namely decentralized control, collaborative learning, high exploration ability, and inspiration from dynamic social behavior. Thus FSI offers a natural choice for modeling dynamic social data and solving problems in such domains. One particular case of dynamic social data is online/web usage data which is rich in information about user activities, interests and choices. This natural analogy between SI and social behavior is the main motivation for the topic of investigation in this dissertation, with a focus on Flock based systems which have not been well investigated for this purpose. More specifically, we investigate the use of flock-based SI to solve two related and challenging problems by developing algorithms that form critical building blocks of intelligent personalized websites, namely, (i) providing a better understanding of the online users and their activities or interests, for example using clustering techniques that can discover the groups that are hidden within the data; and (ii) reducing information overload by providing guidance to the users on websites and services, typically by using web personalization techniques, such as recommender systems. Recommender systems aim to recommend items that will be potentially liked by a user. To support a better understanding of the online user activities, we developed clustering algorithms that address two challenges of mining online usage data: the need for scalability to large data and the need to adapt cluster sing to dynamic data sets. To address the scalability challenge, we developed new clustering algorithms using a hybridization of traditional Flock-based clustering with faster K-Means based partitional clustering algorithms. We tested our algorithms on synthetic data, real VCI Machine Learning repository benchmark data, and a data set consisting of real Web user sessions. Having linear complexity with respect to the number of data records, the resulting algorithms are considerably faster than traditional Flock-based clustering (which has quadratic complexity). Moreover, our experiments demonstrate that scalability was gained without sacrificing quality. To address the challenge of adapting to dynamic data, we developed a dynamic clustering algorithm that can handle the following dynamic properties of online usage data: (1) New data records can be added at any time (example: a new user is added on the site); (2) Existing data records can be removed at any time. For example, an existing user of the site, who no longer subscribes to a service, or who is terminated because of violating policies; (3) New parts of existing records can arrive at any time or old parts of the existing data record can change. The user\u27s record can change as a result of additional activity such as purchasing new products, returning a product, rating new products, or modifying the existing rating of a product. We tested our dynamic clustering algorithm on synthetic dynamic data, and on a data set consisting of real online user ratings for movies. Our algorithm was shown to handle the dynamic nature of data without sacrificing quality compared to a traditional Flock-based clustering algorithm that is re-run from scratch with each change in the data. To support reducing online information overload, we developed a Flock-based recommender system to predict the interests of users, in particular focusing on collaborative filtering or social recommender systems. Our Flock-based recommender algorithm (FlockRecom) iteratively adjusts the position and speed of dynamic flocks of agents, such that each agent represents a user, on a visualization panel. Then it generates the top-n recommendations for a user based on the ratings of the users that are represented by its neighboring agents. Our recommendation system was tested on a real data set consisting of online user ratings for a set of jokes, and compared to traditional user-based Collaborative Filtering (CF). Our results demonstrated that our recommender system starts performing at the same level of quality as traditional CF, and then, with more iterations for exploration, surpasses CF\u27s recommendation quality, in terms of precision and recall. Another unique advantage of our recommendation system compared to traditional CF is its ability to generate more variety or diversity in the set of recommended items. Our contributions advance the state of the art in Flock-based 81 for clustering and making predictions in dynamic Web usage data, and therefore have an impact on improving the quality of online services

    Using UAV-Based Imagery to Determine Volume, Groundcover, and Growth Rate Characteristics of Lentil (Lens culinaris Medik.)

    Get PDF
    Plant growth rate is an essential phenotypic parameter for crop physiologists and plant breeders to understand in order to quantify potential crop productivity based on specific stages throughout the growing season. While plant growth rate information can be attained though manual collection of biomass, this procedure is rarely performed due to the prohibitively large effort and destruction of plant material that is required. Unmanned Aerial Vehicles (UAVs) offer great potential for rapid collection of imagery which can be utilized for quantification of plant growth rate. In this study, six diverse lines of lentil were grown in three replicates of microplots with six biomass collection time-points throughout the growing season over five site-years. Aerial imagery of each biomass collection time point was collected from a UAV and utilized to produce stitched two-dimensional orthomosaics and three-dimensional point clouds. Analysis of this imagery produced quantification of groundcover and vegetation volume on an individual plot basis. Comparison with manually-measured above-ground biomass suggests strong correlation, indicating great potential for UAVs to be utilized in plant breeding programs for evaluation of groundcover and vegetation volume. Nonlinear logistic models were fit to multiple data collection points throughout the growing season. The growth rate and G50, which is the number of growing degree days (GDD) required to accumulate 50 % of maximum growth, parameters of the model are capable of quantifying growth rate, and have potential utility in plant research and plant breeding programs. Predicted maximum volume was identified as a potential proxy for whole-plot biomass measurement. Six new phenotypes have been described that can be accurately and efficiently collected from field trials with the use of UAV’s or other overhead image-collection systems. These phenotypes are; Area Growth Rate, Area G50, Area Maximum Predicted Growth, Volume Growth Rate, Volume G50, and Volume Maximum Predicted Growth

    Upgrading decision support systems with Cloud-based environments and machine learning

    Get PDF
    Business Intelligence (BI) is a process for analyzing raw data and displaying it in order to make it easier for business users to take the right decision at the right time. Inthe market we can find several BI platforms. One commonly used BI solution is calledMicroStrategy, which allows users to build and display reports.Machine Learning (ML) is a process of using algorithms to search for patterns in data which are used to predict and/or classify other data.In recent years, these two fields have been integrated into one another in order to try to complement the prediction side of BI to enable higher quality results for the client.The consulting company (CC) where I have worked on has several solutions related to Data & Analytics built on top of Micro Strategy. Those solutions were all demonstrable in a server installed on-premises. This server was also utilized to build proofs of concept(PoC) to be used as demos for other potential clients. CC also develops new PoCs for clients from the ground up, with the objective of show casing what is possible to display to the client in order to optimize business management.CC was using a local, out of date server to demo the PoCs to clients, which suffered from stability and reliability issues. To address these issues, the server has been migrated and set up in a cloud based solution using a Microsoft Azure-based Virtual Machine,where it now performs similar functions compared to its previous iteration. This move has made the server more reliable, as well as made developing new solutions easier forthe team and enabled a new kind of service (Analytics as a Service).My work at CC was focused on one main task: Migration of the demo server for CCsolutions (which included PoCs for testing purposes, one of which is a machine learning model to predict wind turbine failures). The migration was successful as previously stated and the prediction models, albeit with mostly negative results, demonstrated successfully the development of large PoCs.Business Intelligence (BI) é um processo para analizar dados não tratados e mostrá-los para ajudar gestores a fazer a decisão correcta no momento certo. No mercado, pode-se encontrar várias plataformas de BI. Uma solução de BI comum chama-se MicroStrategy,que permite com que os utilizadores construam e mostrem relatórios.Machine Learning (ML) é um processo de usar algoritmos para procurar padrões em dados que por sua vez são usados para prever e/ou classificar outros dados.Nos últimos anos, estes campos foram integrados um no outro para tentar complementar o lado predictivo de BI para possibilitar resultados de mais alta qualidade para o cliente.A empresa de consultoria (EC) onde trabalhei tem várias soluções relacionadas com Data e Analytics construídas com base no MicroStrategy. Essas soluções eram todas demonstráveis num servidor instalado no local. Este servidor também era usado para criar provas de conceito (PoC) para serem usadas como demos para outros potenciais clientes.A EC também desenvolve novas PoCs para clientes a partir do zero, com o objectivo de demonstrar ao cliente o que é possível mostrar para optimizar a gestão do negócio.A EC estava a utilizar um servidor local desactualizado para demonstrar os PoCs aos clientes, que tinha problemas de estabilidade e fiabilidade. Para resolver estes problemas,o servidor foi migrado e configurado numa solução baseada na cloud com o uso de uma Máquina Virtual baseada no Microsoft Azure, onde executa funções semelhantes à versão anterior. Esta migração tornou o servidor mais fiável, simplificou o processo de desenvolver novas soluções para a equipa e disponibilizou um novo tipo de serviço (Analytics as a Service).O meu trabalho na EC foi focado numa tarefas principal: Migração do servidor de demonstrações de soluções CC (que inclui PoCs para propósitos de testes, uma das quais é um modelo de aprendizagem de máquina para prever falhas em turbinas eólicas). A migração foi efectuada com sucesso (como mencionado previamente) e os modelos testados,apesar de terem maioritariamente resultados negativos, demonstraram com sucesso que é possível desenvolver PoCs de grande dimensão
    corecore