16 research outputs found
Pembuatan Program Analisis “File Log” dengan Menggunakan Sistem “Mapreduce” untuk Mengidentifikasi Serangan Pada Aplikasi Web
Serangan Query Injection dan Cross Site Scripting pada aplikasi web bisa diidentifikasi
melalui file log web server. Situs-situs yang memiliki pengunjung dengan jumlah besar
menghadapi suatu permasalahan yaitu membengkaknya ukuran file log web server sehingga
untuk mengolah file tersebut membutuhkan waktu yang lama. Pemrosesan paralel dapat
membantu dalam memproses data-data berukuran besar sehingga waktu eksekusinya bisa lebih
cepat, tetapi hal-hal yang mengatur proses paralel seperti pendistribusian data dan komputasi,
pengendalian sistem, dan penanganan kegagalan perangkat keras sangat rumit dalam
pengimplementasiannya. Sistem Mapreduce merupakan suatu pustaka yang secara otomatis
menangani pendistribusian data dan komputasi, pengendalian sistem, dan penanganan
kegagalan perangkat keras sehingga pembuatan program yang berjalan secara paralel tidak
rumit.
Tujuan penelitian ini adalah membuat program analisis file log untuk mengidentifikasi
serangan pada aplikasi web yang mengimplementasikan sistem Mapreduce. Framework
Mapreduce yang penulis gunakan sebagai sistem Mapreduce adalah Hadoop. Aplikasi yang
penulis buat dibandingkan waktu eksekusinya dengan program analisis file log yang sudah ada
yaitu Webalizer dan Awstat.
Program analisis file log yang penulis buat berjalan secara paralel pada empat buah
komputer yang saling terhubung, sedangkan aplikasi pembanding berjalan pada satu buah
komputer. Masukan untuk program ini adalah file log web server dengan berbagai ukuran dan
akan menghasilkan keluaran berupa hasil analisis adanya serangan pada aplikasi web.
Dari hasil perbandingan waktu eksekusi didapat bahwa Mapreduce memerlukan waktu
eksekusi yang lebih lama dibandingkan dengan Webalizer dan Awstats. Hal ini dikarenakan
sistem Mapreduce lebih cocok dijalankan pada jumlah komputer yang besar dengan ukuran file
yang tidak cukup disimpan dan dieksekusi pada satu buah komputer
Introducing Cloud Computing Topics in Curricula
The demand for graduates with exposure in Cloud Computing is on the rise. For many educational institutions, the challenge is to decide on how to incorporate appropriate cloud-based technologies into their curricula. In this paper, we describe our design and experiences of integrating Cloud Computing components into seven third/fourth-year undergraduate-level information system, computer science, and general science courses that are related to large-scale data processing and analysis at the University of Queensland, Australia. For each course, we aimed at finding the best-available and cost-effective cloud technologies that fit well in the existing curriculum. The cloud related technologies discussed in this paper include open-source distributed computing tools such as Hadoop, Mahout, and Hive, as well as cloud services such as Windows Azure and Amazon Elastic Computing Cloud (EC2). We anticipate that our experiences will prove useful and of interest to fellow academics wanting to introduce Cloud Computing modules to existing courses
A semi-automatic parallelization tool for Java based on fork-join synchronization patterns
Because of the increasing availability of multi-core machines, clusters, Grids, and combinations of these environments, there is now plenty of computational power available for executing compute intensive applications. However, because of the overwhelming and rapid advances in distributed and parallel hardware and environments, today?s programmers are not fully prepared to exploit distribution and parallelism. In this sense, the Java language has helped in handling the heterogeneity of such environments, but there is a lack of facilities and tools to easily distributing and parallelizing applications. One solution to mitigate this problem and make some progress towards producing general tools seems to be the synthesis of semi-automatic parallelism and Parallelism as a Concern (PaaC), which allows parallelizing applications along with as little modifications on sequential codes as possible. In this paper, we discuss a new approach that aims at overcoming the drawbacks of current Java-based parallel and distributed development tools, which precisely exploit these new conceptsFil: Hirsch, Matias. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico - CONICET - Tandil. Instituto Superior de Ingenieria del Software; Argentina;Fil: Zunino, Alejandro. Consejo Nacional de Invest.cientif.y Tecnicas. Ctro Cientifico Tecnologico Conicet - Tandil. Instituto Superior de Ingenieria del Software;Fil: Mateos Diaz, Cristian Maximiliano. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico - CONICET - Tandil. Instituto Superior de Ingenieria del Software
A semi-automatic parallelization tool for Java based on fork-join synchronization patterns
Because of the increasing availability of multi-core machines, clusters, Grids, and combinations of these environments, there is now plenty of computational power available for executing compute intensive applications. However, because of the overwhelming and rapid advances in distributed and parallel hardware and environments, today’s programmers are not fully prepared to exploit distribution and parallelism. In this sense, the Java language has helped in handling the heterogeneity of such environments, but there is a lack of facilities and tools to easily distributing and parallelizing applications. One solution to mitigate this problem and make some progress towards producing general tools seems to be the synthesis of semi-automatic parallelism and Parallelism as a Concern (PaaC), which allows parallelizing applications along with as little modifications on sequential codes as possible. In this paper, we discuss a new approach that aims at overcoming the drawbacks of current Java-based parallel and distributed development tools, which precisely exploit these new concepts.Sociedad Argentina de Informática e Investigación Operativ
El impacto de las aplicaciones intensivas de E/S en la planificación de trabajos en clusters no-dedicados
Con la mayor capacidad de los nodos de procesamiento en relación a la potencia de cómputo, cada vez más aplicaciones intensivas de datos como las aplicaciones de la bioinformática, se llevarán a ejecutar en clusters no dedicados. Los clusters no dedicados se caracterizan por su capacidad de combinar la ejecución de aplicaciones de usuarios locales con aplicaciones, científicas o comerciales, ejecutadas en paralelo. Saber qué efecto las aplicaciones con acceso intensivo a dados producen respecto a la mezcla de otro tipo (batch, interativa, SRT, etc) en los entornos no-dedicados permite el desarrollo de políticas de planificación más eficientes. Algunas de las aplicaciones intensivas de E/S se basan en el paradigma MapReduce donde los entornos que las utilizan, como Hadoop, se ocupan de la localidad de los datos, balanceo de carga de forma automática y trabajan con sistemas de archivos distribuidos. El rendimiento de Hadoop se puede mejorar sin aumentar los costos de hardware, al sintonizar varios parámetros de configuración claves para las especificaciones del cluster, para el tamaño de los datos de entrada y para el procesamiento complejo. La sincronización de estos parámetros de sincronización puede ser demasiado compleja para el usuario y/o administrador pero procura garantizar prestaciones más adecuadas. Este trabajo propone la evaluación del impacto de las aplicaciones intensivas de E/S en la planificación de trabajos en clusters no-dedicados bajo los paradigmas MPI y Mapreduce.Amb la major capacitat dels nodes de processament en relació a potència de còmput, cada vegada més aplicacions intensives de dades com les aplicacions de la bioinformàtica, es duran a executar en clusters no dedicats. Els clusters no dedicats es caracteritzen per la seva capacitat de combinar l'execució d'aplicacions d'usuaris locals amb aplicacions, científiques o comercials, executades en paral·lel. Saber quin efecte les aplicacions amb accés intensiu a daus produeixen respecte a la barreja d'un altre tipus (batch, interès, SRT, etc) en els entorns no-dedicats permet el desenvolupament de polítiques de planificació més eficient. Algunes de les aplicacions intensives d'E/S es basen en el paradigma MapReduce on els entorns que les utilitzen, com Hadoop, s'ocupen de la localitat de les dades, balanceig de càrrega de forma automàtica i treballen amb sistemes d'arxius distribuïts. L'exercici de Hadoop es pot millorar sense augmentar els costos de maquinari, en sintonitzar diversos paràmetres de configuració claus per a les especificacions del cluster, per la mida de les dades d'entrada i per al processament complex. La sincronització d'aquests paràmetres de sincronització pot ser massa complexa per a l'usuari i/o administrador però procura garantir prestacions més adequades. Aquest treball proposa l'avaluació de l'impacte de les aplicacions intensives d'E/S en la planificació de treballs en clusters no-dedicats sota els paradigmes MPI i MapReduce.With the increased capacity of processing nodes in relation to computing power, increasingly data-intensive applications such as applications of bioinformatics, will be run on non-dedicated clusters. The non-dedicated clusters are characterized by their ability to combine the implementation of local user applications with applications, scientific or commercial, executed in parallel. Learn what effect intensive applications to access given for mixed produce other (batch, interest, SRT, etc) in the non-dedicated environment allows the development of more efficient planning policies. Some intensive applications E/S are based on the MapReduce paradigm where environments that use them, such as Hadoop, dealing with data locality, load balancing automatically and work with distributed file systems. Hadoop's performance can be improved without increasing the costs of hardware, tune several key settings to the specifications of the cluster, for the size of the input data and complex processing. The timing of these timing parameters may be too complex for the user or administrator but seeks to ensure more adequate benefits. This master thesis proposes the evaluation of the impact of intensive applications E/S in planning work on non-dedicated clusters under the MPI, MapReduce paradigm
Recommended from our members
A resource aware distributed LSI algorithm for scalable information retrieval
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Latent Semantic Indexing (LSI) is one of the popular techniques in the information retrieval fields. Different from the traditional information retrieval techniques, LSI is not based on the keyword matching simply. It uses statistics and algebraic computations. Based on Singular Value Decomposition (SVD), the higher dimensional matrix is converted to a lower dimensional approximate matrix, of which the noises could be filtered. And also the issues of synonymy and polysemy in the traditional techniques can be overcome based on the investigations of the terms related with the documents. However, it is notable that LSI suffers a scalability issue due to the computing complexity of SVD.
This thesis presents a resource aware distributed LSI algorithm MR-LSI which can solve the scalability issue using Hadoop framework based on the distributed computing model MapReduce. It also solves the overhead issue caused by the involved clustering algorithm. The evaluations indicate that MR-LSI can gain significant enhancement compared to the other strategies on processing large scale of documents. One remarkable advantage of Hadoop is that it supports heterogeneous computing environments so that the issue of unbalanced load among nodes is highlighted. Therefore, a load balancing algorithm based on genetic algorithm for balancing load in static environment is proposed. The results show that it can improve the performance of a cluster according to heterogeneity levels.
Considering dynamic Hadoop environments, a dynamic load balancing strategy with varying window size has been proposed. The algorithm works depending on data selecting decision and modeling Hadoop parameters and working mechanisms. Employing improved genetic algorithm for achieving optimized scheduler, the algorithm enhances the performance of a cluster with certain heterogeneity levels
Recommended from our members
Computing resources sensitive parallelization of neural neworks for large scale diabetes data modelling, diagnosis and prediction
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Diabetes has become one of the most severe deceases due to an increasing number of diabetes patients globally. A large amount of digital data on diabetes has been collected through various channels. How to utilize these data sets to help doctors to make a decision on diagnosis, treatment and prediction of diabetic patients poses many challenges to the research community. The thesis investigates mathematical models with a focus on neural networks for large scale diabetes data modelling and analysis by utilizing modern computing technologies such as grid computing and cloud computing. These computing technologies provide users with an inexpensive way to have access to extensive computing resources over the Internet for solving data and computationally intensive problems. This thesis evaluates the performance of seven representative machine learning techniques in classification of diabetes data and the results show that neural network produces the best accuracy in classification but incurs high overhead in data training. As a result, the thesis develops MRNN, a parallel neural network model based on the MapReduce programming model which has become an enabling technology in support of data intensive applications in the clouds.
By partitioning the diabetic data set into a number of equally sized data blocks, the workload in training is distributed among a number of computing nodes for speedup in data training. MRNN is first evaluated in small scale experimental environments using 12 mappers and subsequently is evaluated in large scale simulated environments using up to 1000 mappers. Both the experimental and simulations results have shown the effectiveness of MRNN in classification, and its high scalability in data training.
MapReduce does not have a sophisticated job scheduling scheme for heterogonous computing environments in which the computing nodes may have varied computing capabilities. For this purpose, this thesis develops a load balancing scheme based on genetic algorithms with an aim to balance the training workload among heterogeneous computing nodes. The nodes with more computing capacities will receive more MapReduce jobs for execution. Divisible load theory is employed to guide the evolutionary process of the genetic algorithm with an aim to achieve fast convergence. The proposed load balancing scheme is evaluated in large scale simulated MapReduce environments with varied levels of heterogeneity using different sizes of data sets. All the results show that the genetic algorithm based load balancing scheme significantly reduce the makespan in job execution in comparison with the time consumed without load balancing.This work is funded by the EPSRC and China Market Association
Running parallel applications on a heterogeneous environment with accessible development practices and automatic scalability
Grid computing makes it possible to gather large quantities of resources to work on a problem. In order to exploit this potential, a framework that presents the resources to the user programmer in a form that maintains productivity is necessary. The framework must not only provide accessible development, but it must make efficient use of the resources. The Seeds framework is proposed. It uses the current Grid and distributed computing middleware to provide a parallel programming environment to a wider community of programmers. The framework was used to investigate the feasibility of scaling skeleton/pattern parallel programming into Grid computing. The research accomplished two goals: it made parallel programming on the Grid more accessible to domainspecific programmers, and it made parallel programs scale on a heterogeneous resource environ ment. Programming is made easier to the programmer by using skeleton and pat ternbased programming approaches that effectively isolate the program from the envi ronment. To extend the pattern approach, the pattern adder operator is proposed, imple mented and tested. The results show the pattern operator can reduce the number of lines of code when compared with an MPJExpress implementation for a stencil algorithm while having an overhead of at most ten microseconds per iteration. The research in scal ability involved adapting existing loadbalancing techniques to skeletons and patterns re quiring little additional configuration on the part of the programmer. The hierarchical de pendency concept is proposed as well, which uses a streamed data flow programming model. The concept introduces data flow computation hibernation and dependencies that can split to accommodate additional processors. The results from implementing skeleton/patterns on hierarchical dependencies show an 18.23% increase in code is neces sary to enable automatic scalability. The concept can increase speedup depending on the
algorithm and grain size