12 research outputs found

    SLDPC: Towards Second Order Learning for Detecting Persistent Clusters in Data Streams

    Get PDF
    The main attention of research on data stream clustering algorithms so far has been focused on the adaptation of the algorithms for static datasets to the data streams and improvements of the existing adapted algorithms. Such algorithms fulfil the purpose of the first-order learning from data to clusters. This paper prompts a new question on second-order learning of cluster models from data streams and presents a learning algorithm that detects persistent clusters from consecutive clustering snapshots in data streams. In this work, we first collect a sequence of cluster snapshots as the output clusters at selected query points and then identify the persistent clusters within a given timeframe. The algorithm is evaluated on collections of synthetic datasets. The experimental results have demonstrated the effectiveness of the algorithm in detecting such persistent clusters

    Data Reduction Approach Based on Fog Computing in IoT Environment

    Get PDF
    This paper investigates a data processing model for a real experimental environment in which data is collected from several IoT devices on an edge server where a clustering-based data reduction model is implemented. Then, only representative data is transmitted to a cloud-hosted service instead of raw data. In our model, the subtractive clustering algorithm is employed for the first time for streamed IoT data with high efficiency. Developed services show the real impact of data reduction technique at the fog node on enhancing overall system performance. High accuracy and reduction rate have been obtained through visualizing data before and after reduction

    Адаптивная оценка весового коэффициента в алгоритме динамической кластеризации данных

    Get PDF
    Рассматривается адаптивный алгоритм определения коэффициентов затухающей оконной модели динамического ЕМ-алгоритма кластеризации потока данных. Алгоритм предназначен для кластеризации данных с нормальным распределением в R", параметры которого изменяются во времени, что соответствует ситуации в реальных динамических системах, таких как компьютерные системы, сети связи и т.п. Для расчета весовых коэффициентов требуется хранение ограниченного объема данных, алгоритм эффективно вычислим, может применяться в системах реального времени. Приведены данные вычислительного эксперимента (на имитационной модели потока)

    Identification of Human Factors in Aviation Incidents Using a Data Stream Approach

    Get PDF
    This paper investigates the use of data streaming analytics to better predict the presence of human factors in aviation incidents with new incident reports. As new incidents data become available, the fresh information can help not only evaluate but also improve existing models. First, we use four algorithms in batch learning to establish a baseline for comparison purposes. These are NaiveBayes (NB), Cost Sensitive Classifier (CSC), Hoeffdingtree (VFDT), and OzabagADWIN (OBA). The traditional measure of the classification accuracy rate is used to test their performance. The results show that among the four, NB and CSC are the best classification algorithms. Then we test the classifiers in a data stream setting. The two performance measure methods Holdout and Interleaved Test-Then-Train or Prequential are used in this setting. The Kappa statistic charts of Prequential measure with a sliding window show that NB exhibits the best performance, and is better than the other algorithms. The two different measure methods, batch learning with 10-fold cross validation and data stream with Prequential measure, get one consistent result. CSC is a suitable for unbalanced data in batch learning, but it is not best in Kappa statistic for data stream. Valid incremental algorithms need to be developed for the data stream with unbalanced labels

    Grouping of Village Status in West Java Province Using the Manhattan, Euclidean and Chebyshev Methods on the K-Mean Algorithm

    Get PDF
    The Ministry of Villages, Development of Disadvantaged Areas and Transmigration (Ministry of Village PDTT) is a ministry within the Indonesian Government in charge of rural and rural development, empowerment of rural communities, accelerated development of disadvantaged areas, and transmigration. Village Potential Data for 2014 (Podes 2014) in West Java Province is data issued by the Central Statistics Agency in collaboration with the Ministry of Village PDTT which is in unsupervised data format, consists of 5319 village data. The Podes 2014 data in West Java Province were made based on the level of village development (village specific) in Indonesia, by making the village as the unit of analysis. Base on the Regulation of the Minister of Villages, Disadvantaged Areas and Transmigration of the Republic of Indonesia number 2 of 2016 concerning the village development index, the Village is classified into 5 village status, namely Very Disadvantaged Village, Disadvantaged Village, Developing Village, Advanced Village and Independent Village based on the ability to manage and increase the potential of social, economic and ecological resources. Village status is in fact inseparable from village development that is under government funding support. However, village development funds have not been distributed effectively and accurately according to the conditions and potential of the village due to the lack of clear information about the status of the village. Therefore, the information regarding the villages priority in term of which villages needs more funding and attention from the government is still lacking. Data mining is a method that can be used to group objects in a data into classes that have the same criteria (clustering). One of the algorithms that can be used for the clustering process is the k-means algorithm. Data grouping using k-means is done by calculating the closest distance from data to a centroid point. In this study, different types of distance calculation in the K-means algorithm are compared. Those types are Manhattan, Euclidean and Chebyshev. Validation tests have been carried out using the execution time and Davies Bouldin index. From this test, the data Village Potential 2014 in West Java province have grouped all the 5 status of the village with the obtained number of villages for each cluster is a cluster village Extremely Backward many as 694 villages, cluster Villages 567 villages, cluster village Evolving as much as 1440 villages, the cluster with Desa Maju1557 villages and the cluster Independent Village for 1061 villages. For distance calculation, Chebyshev has the most efficient accumulation time of 1 second compared to Euclidean 1.6 seconds and Manhattan 2.4 seconds. Meanwhile, the Euclidean method has the value, Davies Index most optimal which is 0.886 compared to the Manhattan method 0.926 and Chebyshev 0.990

    Parallel clustering of high-dimensional social media data streams

    Full text link
    We introduce Cloud DIKW as an analysis environment supporting scientific discovery through integrated parallel batch and streaming processing, and apply it to one representative domain application: social media data stream clustering. Recent work demonstrated that high-quality clusters can be generated by representing the data points using high-dimensional vectors that reflect textual content and social network information. Due to the high cost of similarity computation, sequential implementations of even single-pass algorithms cannot keep up with the speed of real-world streams. This paper presents our efforts to meet the constraints of real-time social stream clustering through parallelization. We focus on two system-level issues. Most stream processing engines like Apache Storm organize distributed workers in the form of a directed acyclic graph, making it difficult to dynamically synchronize the state of parallel workers. We tackle this challenge by creating a separate synchronization channel using a pub-sub messaging system. Due to the sparsity of the high-dimensional vectors, the size of centroids grows quickly as new data points are assigned to the clusters. Traditional synchronization that directly broadcasts cluster centroids becomes too expensive and limits the scalability of the parallel algorithm. We address this problem by communicating only dynamic changes of the clusters rather than the whole centroid vectors. Our algorithm under Cloud DIKW can process the Twitter 10% data stream in real-time with 96-way parallelism. By natural improvements to Cloud DIKW, including advanced collective communication techniques developed in our Harp project, we will be able to process the full Twitter stream in real-time with 1000-way parallelism. Our use of powerful general software subsystems will enable many other applications that need integration of streaming and batch data analytics.Comment: IEEE/ACM CCGrid 2015: 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 201

    Design and evaluation of a cloud native data analysis pipeline for cyber physical production systems

    Get PDF
    Since 1991 with the birth of the World Wide Web the rate of data growth has been growing with a record level in the last couple of years. Big companies tackled down this data growth with expensive and enormous data centres to process and get value of this data. From social media, Internet of Things (IoT), new business process, monitoring and multimedia, the capacities of those data centres started to be a problem and required continuos and expensive expansion. Thus, Big Data was something that only a few were able to access. This changed fast when Amazon launched Amazon Web Services (AWS) around 15 years ago and gave the origins to the public cloud. At that time, the capabilities were still very new and reduced but 10 years later the cloud was a whole new business that changed for ever the Big Data business. This not only commoditised computer power but it was accompanied by a price model that let medium and small players the possibility to access it. In consequence, new problems arised regarding the nature of these distributed systems and the software architectures required for proper data processing. The present job analyse the type of typical Big Data workloads and propose an architecture for a cloud native data analysis pipeline. Lastly, it provides a chapter for tools and services that can be used in the architecture taking advantage of their open source nature and the cloud price models.Fil: Ferrer Daub, Facundo Javier. Universidad Católica de Córdoba. Instituto de Ciencias de la Administración; Argentin

    Data Stream Clustering: A Review

    Full text link
    Number of connected devices is steadily increasing and these devices continuously generate data streams. Real-time processing of data streams is arousing interest despite many challenges. Clustering is one of the most suitable methods for real-time data stream processing, because it can be applied with less prior information about the data and it does not need labeled instances. However, data stream clustering differs from traditional clustering in many aspects and it has several challenging issues. Here, we provide information regarding the concepts and common characteristics of data streams, such as concept drift, data structures for data streams, time window models and outlier detection. We comprehensively review recent data stream clustering algorithms and analyze them in terms of the base clustering technique, computational complexity and clustering accuracy. A comparison of these algorithms is given along with still open problems. We indicate popular data stream repositories and datasets, stream processing tools and platforms. Open problems about data stream clustering are also discussed.Comment: Has been accepted for publication in Artificial Intelligence Revie
    corecore