Search CORE

275 research outputs found

Recommended from our members

High performance latent dirichlet allocation for text mining

Author: Liu Zelong
Publication venue: Brunel University School of Engineering and Design PhD Theses
Publication date: 01/01/2013
Field of study

This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Latent Dirichlet Allocation (LDA), a total probability generative model, is a three-tier Bayesian model. LDA computes the latent topic structure of the data and obtains the significant information of documents. However, traditional LDA has several limitations in practical applications. LDA cannot be directly used in classification because it is a non-supervised learning model. It needs to be embedded into appropriate classification algorithms. LDA is a generative model as it normally generates the latent topics in the categories where the target documents do not belong to, producing the deviation in computation and reducing the classification accuracy. The number of topics in LDA influences the learning process of model parameters greatly. Noise samples in the training data also affect the final text classification result. And, the quality of LDA based classifiers depends on the quality of the training samples to a great extent. Although parallel LDA algorithms are proposed to deal with huge amounts of data, balancing computing loads in a computer cluster poses another challenge. This thesis presents a text classification method which combines the LDA model and Support Vector Machine (SVM) classification algorithm for an improved accuracy in classification when reducing the dimension of datasets. Based on Density-Based Spatial Clustering of Applications with Noise (DBSCAN), the algorithm automatically optimizes the number of topics to be selected which reduces the number of iterations in computation. Furthermore, this thesis presents a noise data reduction scheme to process noise data. When the noise ratio is large in the training data set, the noise reduction scheme can always produce a high level of accuracy in classification. Finally, the thesis parallelizes LDA using the MapReduce model which is the de facto computing standard in supporting data intensive applications. A genetic algorithm based load balancing algorithm is designed to balance the workloads among computers in a heterogeneous MapReduce cluster where the computers have a variety of computing resources in terms of CPU speed, memory space and hard disk space

Brunel University Research Archive

Anomaly Detection Based on Multiple Streams Clustering for Train Real-Time Ethernet

Author: Dalin Zhang
Jing Liu*
Yunjuan Peng
Publication venue: 'Mechanical Engineering Faculty in Slavonski Brod'
Publication date: 01/01/2021
Field of study

With the increasing traffic of train communication network (TCN), real-time Ethernet becomes the development trend. However, Train Control and Management System (TCMS) is inevitably faced with more security threats than before because of the openness of Ethernet communication protocol. It is necessary to introduce effective security mechanism into TCN. Therefore, we propose a train real-time Ethernet anomaly detection system (TREADS). TREADS introduces a multiple streams clustering algorithm to realize anomaly detection, which considers the correlation between the data dimensions and adopts the decay window to pay more attention to the recent data. In the experiment, the reliability of TREADS is tested based on the TRDP data set collected from the real network environment, and the models of anomaly detection algorithms are established for evaluation. Experimental results show that TREADS can provide a high reliability guarantee, besides, the algorithm can detect and analyze network anomalies more efficiently and accurately

Directory of Open Access Journals

HRČAK - Portal of Croatian Scientific and Professional Journals

Hrčak - Portal of scientific journals of Croatia

Handling Imbalanced Data through Re-sampling: Systematic Review

Author: Ali Moez Mutasim
Eltayeb Razan Yaser
Karrar Abdelrahman Elsharif
Osman Waleed Ibrahim
Publication venue: IAES Indonesia Section
Publication date: 30/06/2023
Field of study

Handling imbalanced data is an important issue that can affect the validity and reliability of the results. One common approach to addressing this issue is through re-sampling the data. Re-sampling is a technique that allows researchers to balance the class distribution of their dataset by either over-sampling the minority class or under-sampling the majority class. Over-sampling involves adding more copies of the minority class examples to the dataset in order to balance out the class distribution. On the other hand, under-sampling involves removing some of the majority class examples from the dataset in order to balance out the class distribution. It's also common to combine both techniques, usually called hybrid sampling. It is important to note that re-sampling techniques can have an impact on the model's performance, and it is essential to evaluate the model using different evaluation metrics and to consider other techniques such as cost-sensitive learning and anomaly detection. In addition, it is important to keep in mind that increasing the sample size is always a good idea to improve the performance of the model. In this systematic review, we aim to provide an overview of existing methods for re-sampling imbalanced data. We will focus on methods that have been proposed in the literature and evaluate their effectiveness through a thorough examination of experimental results. The goal of this review is to provide practitioners with a comprehensive understanding of the different re-sampling methods available, as well as their strengths and weaknesses, to help them make informed decisions when dealing with imbalanced data

Indonesian Journal of Electrical Engineering and Informatics (IJEEI)

Distributed mining of convoys in large scale datasets

Author: Calders Toon
Orakzai Faisal
Pedersen Torben Bach
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2021
Field of study

Tremendous increase in the use of the mobile devices equipped with the GPS and other location sensors has resulted in the generation of a huge amount of movement data. In recent years, mining this data to understand the collective mobility behavior of humans, animals and other objects has become popular. Numerous mobility patterns, or their mining algorithms have been proposed, each representing a specific movement behavior. Convoy pattern is one such pattern which can be used to find groups of people moving together in public transport or to prevent traffic jams. A convoy is a set of at least m objects moving together for at least k consecutive time stamps where m and k are user-defined parameters. Existing algorithms for detecting convoy patterns do not scale to real-life dataset sizes. Therefore in this paper, we propose a generic distributed convoy pattern mining algorithm called DCM and show how such an algorithm can be implemented using the MapReduce framework. We present a cost model for DCM and a detailed theoretical analysis backed by experimental results. We show the effect of partition size on the performance of DCM. The results from our experiments on different data-sets and hardware setups, show that our distributed algorithm is scalable in terms of data size and number of nodes, and more efficient than any existing sequential as well as distributed convoy pattern mining algorithm, showing speed-ups of up to 16 times over SPARE, the state of the art distributed co-movement pattern mining framework. DCM is thus able to process large datasets which SPARE is unable to.SCOPUS: ar.jDecretOANoAutActifinfo:eu-repo/semantics/publishe

VBN

Institutional Repository Universiteit Antwerpen

DI-fusion

ALOJA: A framework for benchmarking and predictive analytics in Hadoop deployments

Author: Berral García Josep Lluís
Call Aaron
Carrera Pérez David
Green Daron
Poggi Mastrokalo Nicolas
Reinauer Rob
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2015
Field of study

This article presents the ALOJA project and its analytics tools, which leverages machine learning to interpret Big Data benchmark performance data and tuning. ALOJA is part of a long-term collaboration between BSC and Microsoft to automate the characterization of cost-effectiveness on Big Data deployments, currently focusing on Hadoop. Hadoop presents a complex run-time environment, where costs and performance depend on a large number of configuration choices. The ALOJA project has created an open, vendor-neutral repository, featuring over 40,000 Hadoop job executions and their performance details. The repository is accompanied by a test-bed and tools to deploy and evaluate the cost-effectiveness of different hardware configurations, parameters and Cloud services. Despite early success within ALOJA, a comprehensive study requires automation of modeling procedures to allow an analysis of large and resource-constrained search spaces. The predictive analytics extension, ALOJA-ML, provides an automated system allowing knowledge discovery by modeling environments from observed executions. The resulting models can forecast execution behaviors, predicting execution times for new configurations and hardware choices. That also enables model-based anomaly detection or efficient benchmark guidance by prioritizing executions. In addition, the community can benefit from ALOJA data-sets and framework to improve the design and deployment of Big Data applications.This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 639595). This work is partially supported by the Ministry of Economy of Spain under contracts TIN2012-34557 and 2014SGR1051.Peer ReviewedPostprint (published version

arXiv.org e-Print Archive

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Big data clustering with varied density based on MapReduce

Author: Afsharkazemi Mohammad Ali
Alborzi Mahmood
Ghatari Ali Rajabzadeh
Heidari Safanaz
Radfar Reza
Publication venue: Springer One
Publication date: 22/08/2019
Field of study

The DBSCAN algorithm is a prevalent method of density-based clustering algorithms, the most important feature of which is the ability to detect arbitrary shapes and varied clusters and noise data. Nevertheless, this algorithm faces a number of challenges, including failure to find clusters of varied densities. On the other hand, with the rapid development of the information age, plenty of data are produced every day, such that a single machine alone cannot process this volume of data; hence, new technologies are required to store and extract information from this volume of data. A large volume of data that is beyond the capabilities of existing software is called Big data. In this paper, we have attempted to introduce a new algorithm for clustering big data with varied density using a Hadoop platform running MapReduce. The main idea of this research is the use of local density to find each point’s density. This strategy can avoid the situation of connecting clusters with varying densities. The proposed algorithm is implemented and compared with other algorithms using the MapReduce paradigm and shows the best varying density clustering capability and scalability

University of Lincoln Institutional Repository

BIG DATA ANALYTICS SOLUTION FOR SMALL CELLS DEPLOYMENT USING MACHINE LEARNING TECHNIQUES

Author: Daware Sachin
Publication venue
Publication date: 16/12/2016
Field of study

This thesis presents a “Novel Small Cell Planning Solution using Machine learning”. The Telecom service providers are interested in estimating various trends in order to plan future upgrades and deployments driven by real data. Fundamentally, the service provider landscape is changing. The numbers of devices are increasing in the network such as small cells to cater the growing demands. Also, the increasing amount of data has caused a big data revolution that is having an impact on telecom. With the advance big data analytics solutions and with fine grained analytics in real time, needs in bandwidth change from one place to another throughout the day, week, month, etc, becomes predictable. Hence, big data analytics solutions can help in deciding footprint of small cells and efficiently deployment of small cells. In this thesis, I have used the open big data that is published at the site: https://dandelion.eu/datamine/open-big-data/ under Open Data Commons Open Database License (ODbL) license. This dataset provides information about the telecommunication activities over the city of Milano. The dataset is the result of a computation over the Call Detail Records (CDRs) generated by the Telecom Italia cellular network over the city of Milano. Data mining is the technique to find concealed and fascinating pattern from dataset, which can be used in decision making and future prediction. In this thesis, data preprocessing has been performed on hadoop framework using hive with Cloudera's open source platform, CDH cloudera-quickstart-vm-5.3.0-0-vmware. In this thesis, the (Eps, MinPts) DBSCAN density based spatial clustering algorithm is used clustering the geospatial data. DBSCAN clusters a spatial data set based on two parameters namely physical distance from each point and a minimum cluster size. This method is best fit for spatial latitude-longitude data. In this thesis, the scikit-leran machine learning platform is used to implement the solution, scikit-learn in python is one of the widely used machine learning platform, it provides a wide range of supervised and unsupervised learning algorithms via a consistent interface in Python. For the validation of the clustering results, the data mining tool WEKA 3.6.11 is used. For benchmarking of the proposed solution, the DBSCAN algorithms clustering result is compared with the WEKA cluster’s results. The final results show that the solution produces very promising results. The three promising results are , it is able to reveal all the objects from the datasets on the basis of user defined algorithm input parameters. The input parameters have a decisive impact on the cluster result. It can extract spatial, temporal and semantically separated clusters. The detected clusters are visualized using Matplotlib plotting library for the Python, WEKA and geojson.io online tool

SHAREOK repository

Scalable Solutions for Automated Single Pulse Identification and Classification in Radio Astronomy

Author: Bertram Ludäscher
Brian
Junfei Qiu
Matei
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 07/10/2018
Field of study

Data collection for scientific applications is increasing exponentially and is forecasted to soon reach peta- and exabyte scales. Applications which process and analyze scientific data must be scalable and focus on execution performance to keep pace. In the field of radio astronomy, in addition to increasingly large datasets, tasks such as the identification of transient radio signals from extrasolar sources are computationally expensive. We present a scalable approach to radio pulsar detection written in Scala that parallelizes candidate identification to take advantage of in-memory task processing using Apache Spark on a YARN distributed system. Furthermore, we introduce a novel automated multiclass supervised machine learning technique that we combine with feature selection to reduce the time required for candidate classification. Experimental testing on a Beowulf cluster with 15 data nodes shows that the parallel implementation of the identification algorithm offers a speedup of up to 5X that of a similar multithreaded implementation. Further, we show that the combination of automated multiclass classification and feature selection speeds up the execution performance of the RandomForest machine learning algorithm by an average of 54% with less than a 2% average reduction in the algorithm's ability to correctly classify pulsars. The generalizability of these results is demonstrated by using two real-world radio astronomy data sets.Comment: In Proceedings of the 47th International Conference on Parallel Processing (ICPP 2018). ACM, New York, NY, USA, Article 11, 11 page

arXiv.org e-Print Archive

Crossref

Scaling DBSCAN-like algorithms for event detection systems in Twitter

Author: Capdevila Pujol Joan
Cerquides Jesús
Pericacho Gonzalo
Torres Viñals Jordi
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

The increasing use of mobile social networks has lately transformed news media. Real-world events are nowadays reported in social networks much faster than in traditional channels. As a result, the autonomous detection of events from networks like Twitter has gained lot of interest in both research and media groups. DBSCAN-like algorithms constitute a well-known clustering approach to retrospective event detection. However, scaling such algorithms to geographically large regions and temporarily long periods present two major shortcomings. First, detecting real-world events from the vast amount of tweets cannot be performed anymore in a single machine. Second, the tweeting activity varies a lot within these broad space-time regions limiting the use of global parameters. Against this background, we propose to scale DBSCAN-like event detection techniques by parallelizing and distributing them through a novel density-aware MapReduce scheme. The proposed scheme partitions tweet data as per its spatial and temporal features and tailors local DBSCAN parameters to local tweet densities. We implement the scheme in Apache Spark and evaluate its performance in a dataset composed of geo-located tweets in the Iberian peninsula during the course of several football matches. The results pointed out to the benefits of our proposal against other state-of-the-art techniques in terms of speed-up and detection accuracy.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC