30 research outputs found

    A spark-based parallel fuzzy C median algorithm for web log big data

    Get PDF
    Now-a-days, the World Wide Web (WWW) is regarded as an exceptionally large data storehouse. The WWW is becoming more complicated and substantive every day. At the moment, the situation is such that we are starved for knowledge while drowning in data. Due to these factors, the data mining clustering technique is one of the most crucial tools for collecting useful data from the web. Clustering techniques for small datasets have led to the development of numerous successful clustering techniques. Nevertheless, these techniques do not provide adequate results when trading with extensive data sets. The most important problems are excessive computational difficulty and lengthy evaluating time, which is not acceptable for real-time context. It is very prime to process this enormous information on time. This paper proposes an efficient parallel Fuzzy C median solution based on Spark for large-scale web log data. Based on the Rand Index and SSE (sum of squared error), the parallel Fuzzy C median algorithm's performance is evaluated in the PySpark platform. According to the experimental findings, the parallel Fuzzy C median method built on Spark performs better

    Implementation of Parallel K-Means Algorithm to Estimate Adhesion Failure in Warm Mix Asphalt

    Get PDF
    Warm Mix Asphalt (WMA) and Hot Mix Asphalt (HMA) are prepared at lower temperatures, making it more susceptible to moisture damage, which eventually leads to stripping due to the adhesion failure. Moreover, the assessment of the adhesion failure depends on the expertise of the investigator’s subjective visual assessment skills. Nowadays, image processing has gained popularity to address the inaccuracy of visual assessment. To attain high accuracy from image processing algorithms, the loss of pixels plays an essential role. In high-quality image samples, processing takes more execution time due to the greater resolution of the image. Therefore, the execution time of the image processing algorithm is also an essential aspect of quality. This manuscript proposes a parallel k means for image processing (PKIP) algorithm using multiprocessing and distributed computing to assess the adhesion failure in WMA and HMA samples subjected to three different moisture sensitivity tests (dry, one, and three freeze-thaw cycles) and fractured by indirect tensile test. For the proposed experiment, the number of clusters was chosen as ten (k = 10) based on k value and cost of k means function was computed to analyse the adhesion failure. The results showed that the PKIP algorithm decreases the execution time up to 30% to 46% if compared with the sequential k means algorithm when implemented using multiprocessing and distributed computing. In terms of results concerning adhesion failure, the WMA specimens subjected to a higher degree of moisture effect showed relatively lower adhesion failure compared to the Hot Mix Asphalt (HMA) samples when subjected to different levels of moisture sensitivity

    Big data clustering with varied density based on MapReduce

    Get PDF
    The DBSCAN algorithm is a prevalent method of density-based clustering algorithms, the most important feature of which is the ability to detect arbitrary shapes and varied clusters and noise data. Nevertheless, this algorithm faces a number of challenges, including failure to find clusters of varied densities. On the other hand, with the rapid development of the information age, plenty of data are produced every day, such that a single machine alone cannot process this volume of data; hence, new technologies are required to store and extract information from this volume of data. A large volume of data that is beyond the capabilities of existing software is called Big data. In this paper, we have attempted to introduce a new algorithm for clustering big data with varied density using a Hadoop platform running MapReduce. The main idea of this research is the use of local density to find each point’s density. This strategy can avoid the situation of connecting clusters with varying densities. The proposed algorithm is implemented and compared with other algorithms using the MapReduce paradigm and shows the best varying density clustering capability and scalability

    Aplikasi Dynamic Cluster pada K-Means BerbasisWeb untuk Klasifikasi Data Industri Rumahan

    Get PDF
    Masalah utama yang dihadapi Pemerintah Daerah Provinsi Kepulauan Bangka Belitung saat ini adalah sulitnya mengklasifikasikan data industri rumahan berdasarkan Peraturan Menteri PPPA No 2 Tahun 2016 yaitu pemula, berkembang dan maju. Berdasarkan permasalahan tersebut diusulkan pengembangan algoritma Kmeans yaitu algoritma Dynamic cluster pada K-means dengan tujuan agar dapat menghasilkan klaster yang optimal dalam pengelompokan data industri rumahan dengan membangun aplikasi cerdas berbasis web. Penelitian ini menggunakan metode analisis data mining SEMMA, yang meliputi tahapan-tahapan seperti data sampel, deskripsi data, transformasi data, pemodelan data, dan evaluasi data. 3.466 industri rumah tangga digunakan sebagai sampel data. Kinerja algoritma dievaluasi menggunakan pengukuran validitas klaster Davies Bouldin Index (DBI). Hasil eksperimen menunjukkan bahwa algoritma Dynamic cluster pada K-means memberikan nilai yang optimal pada iterasi ke lima, dengan perolehan sebagai berikut: klaster pemula (C1) diperoleh sebanyak 3214, kemudian klaster berkembang (C2) diperoleh sebanyak 167 dan klaster maju (C3) diperoleh sebanyak 85. Hasil evaluasi validitas klaster menunjukan bahwa algoritma Dynamic cluster pada Kmeans memperoleh nilai DBI lebih kecil dibandingkan dengan algoritma K-means dengan nilai DBI sebesar 0.184. Implementasi algoritma dynamic cluster pada K-means untuk pengelompokan data industri rumahan pada Dinas P3ACSKB di Provinsi Kepulauan Bangka Belitung terbukti menghasilkan kualitas cluster yang lebih optimal

    On hierarchical clustering-based approach for RDDBS design

    Get PDF
    Distributed database system (DDBS) design is still an open challenge even after decades of research, especially in a dynamic network setting. Hence, to meet the demands of high-speed data gathering and for the management and preservation of huge systems, it is important to construct a distributed database for real-time data storage. Incidentally, some fragmentation schemes, such as horizontal, vertical, and hybrid, are widely used for DDBS design. At the same time, data allocation could not be done without first physically fragmenting the data because the fragmentation process is the foundation of the DDBS design. Extensive research have been conducted to develop effective solutions for DDBS design problems. But the great majority of them barely consider the RDDBS\u27s initial design. Therefore, this work aims at proposing a clustering-based horizontal fragmentation and allocation technique to handle both the early and late stages of the DDBS design. To ensure that each operation flows into the next without any increase in complexity, fragmentation and allocation are done simultaneously. With this approach, the main goals are to minimize communication expenses, response time, and irrelevant data access. Most importantly, it has been observed that the proposed approach may effectively expand RDDBS performance by simultaneously fragmenting and assigning various relations. Through simulations and experiments on synthetic and real databases, we demonstrate the viability of our strategy and how it considerably lowers communication costs for typical access patterns at both the early and late stages of design

    Distributed k-Means with Outliers in General Metrics

    Get PDF
    Center-based clustering is a pivotal primitive for unsupervised learning and data analysis. A popular variant is the k-means problem, which, given a set P of points from a metric space and a parameter k < |P|, requires finding a subset S ⊂ P of k points, dubbed centers, which minimizes the sum of all squared distances of points in P from their closest center. A more general formulation, introduced to deal with noisy datasets, features a further parameter z and allows up to z points of P (outliers) to be disregarded when computing the aforementioned sum. We present a distributed coreset-based 3-round approximation algorithm for k-means with z outliers for general metric spaces, using MapReduce as a computational model. Our distributed algorithm requires sublinear local memory per reducer, and yields a solution whose approximation ratio is an additive term O(γ) away from the one achievable by the best known polynomial-time sequential (possibly bicriteria) approximation algorithm, where γ can be made arbitrarily small. An important feature of our algorithm is that it obliviously adapts to the intrinsic complexity of the dataset, captured by its doubling dimension D. To the best of our knowledge, no previous distributed approaches were able to attain similar quality-performance tradeoffs for general metrics

    The application of K‑means clustering for province clustering in Indonesia of the risk of the COVID‑19 pandemic based on COVID‑19 data

    Get PDF
    This study was conducted with the aim to the clustering of provinces in Indonesia of the risk of the COVID-19 pandemic based on coronavirus disease 2019 (COVID-19) data. This clustering was based on the data obtained from the Indonesian COVID-19 Task Force (SATGAS COVID-19) on 19 April 2020. Provinces in Indonesia were grouped based on the data of confirmed, death, and recovered cases of COVID-19. This was performed using the K-Means Clustering method. Clustering generated 3 provincial groups. The results of the provincial clustering are expected to provide input to the government in making policies related to restrictions on community activities or other policies in overcoming the spread of COVID-19. Provincial Clustering based on the COVID-19 cases in Indonesia is an attempt to determine the closeness or similarity of a province based on confirmed, recovered, and death cases. Based on the results of this study, there are 3 clusters of provinces

    Designing a relational model to identify relationships between suspicious customers in anti-money laundering (AML) using social network analysis (SNA)

    Get PDF
    The stability of the economy and political system of any country highly depends on the policy of anti-money laundering (AML). If government policies are incapable of handling money laundering activities in an appropriate way, the control of the economy can be transferred to criminals. The current literature provides various technical solutions, such as clustering-based anomaly detection techniques, rule-based systems, and a decision tree algorithm, to control such activities that can aid in identifying suspicious customers or transactions. However, the literature provides no effective and appropriate solutions that could aid in identifying relationships between suspicious customers or transactions. The current challenge in the field is to identify associated links between suspicious customers who are involved in money laundering. To consider this challenge, this paper discusses the challenges associated with identifying relationships such as business and family relationships and proposes a model to identify links between suspicious customers using social network analysis (SNA). The proposed model aims to identify various mafias and groups involved in money laundering activities, thereby aiding in preventing money laundering activities and potential terrorist financing. The proposed model is based on relational data of customer profiles and social networking functions metrics to identify suspicious customers and transactions. A series of experiments are conducted with financial data, and the results of these experiments show promising results for financial institutions who can gain real benefits from the proposed model

    Desarrollode un simulador de redes de procesadores que evolucionan (NEPS) en la nube (SPARK)

    Full text link
    Máster Universitario en Investigación e Innovación en Tecnologías de la Información y las Comunicaciones (i2-TIC)The natural-inspired computing has becomeone of the most frequently used techniques to handle complex problems such as the NP-Hard optimization problems. This kind of computing has several advantages over traditional computing, including resiliency, parallel data processing, and low consumptionof power. One of the active research areas of the natural-inspired algorithms is Network of Evolutionary Processors (NEPs). A NEP consists of several cells that are attached together; at the same time the edges of the graph are to transfer data between the nodes in system, while cells are representing the nodes.In this thesis we construct a NEPs system which is implemented over the Hadoop spark environment. The use of the spark platform is essential in this work due to the capabilities supplied by this platform. It is a suitable environment used solving some complicated problems. Using the environment is a possible choice in order to design the NEPs system. For this reason, in this thesis, we detailed on how to install, design and operate this system on the Apache the spark environment is used because it has the capability to implement the NEPs system in a distributed manner. The NEPs simulation is delivered in this work. An analysis of system’s parameters was also provided in this work for the system performance evaluation via the examination of each single factor affecting the performance of the NEPs individually. After testing the system, it become clear that using NEPs on the decentralize cloud eco-system can be thought as an effective method to handle data of different formats and also to execute optimization problems such as Adelman, 3-colorabilty and Massive-NEP problems. Moreover, this scheme is also robust that can be adaptable to handle data which might be scaled up to be big data which is characterized by its volume and heterogeneity. In this context heterogeneity might be referring to collecting data from different sources. Moreover, the utilization of the spark environment as a platform to operate the NEPs system has it is prospects. This environment is characterized by its fast task handing chunks of data to Hadoop architecture that is used to implement the spark system which is mainly based on the map and reduce functions. Thus, the task is distributed on NEPs system using the cloud based environment system made it possible to have logical result in all of the three examples investigated and examined in this method

    DENCAST: distributed density-based clustering for multi-target regression

    Get PDF
    Recent developments in sensor networks and mobile computing led to a huge increase in data generated that need to be processed and analyzed efficiently. In this context, many distributed data mining algorithms have recently been proposed. Following this line of research, we propose the DENCAST system, a novel distributed algorithm implemented in Apache Spark, which performs density-based clustering and exploits the identified clusters to solve both single- and multi-target regression tasks (and thus, solves complex tasks such as time series prediction). Contrary to existing distributed methods, DENCAST does not require a final merging step (usually performed on a single machine) and is able to handle large-scale, high-dimensional data by taking advantage of locality sensitive hashing. Experiments show that DENCAST performs clustering more efficiently than a state-of-the-art distributed clustering algorithm, especially when the number of objects increases significantly. The quality of the extracted clusters is confirmed by the predictive capabilities of DENCAST on several datasets: It is able to significantly outperform (p-value <0.05<0.05 ) state-of-the-art distributed regression methods, in both single and multi-target settings
    corecore