193 research outputs found
An evolutionary algorithm for online, resource constrained, multi-vehicle sensing mission planning
Mobile robotic platforms are an indispensable tool for various scientific and
industrial applications. Robots are used to undertake missions whose execution
is constrained by various factors, such as the allocated time or their
remaining energy. Existing solutions for resource constrained multi-robot
sensing mission planning provide optimal plans at a prohibitive computational
complexity for online application [1],[2],[3]. A heuristic approach exists for
an online, resource constrained sensing mission planning for a single vehicle
[4]. This work proposes a Genetic Algorithm (GA) based heuristic for the
Correlated Team Orienteering Problem (CTOP) that is used for planning sensing
and monitoring missions for robotic teams that operate under resource
constraints. The heuristic is compared against optimal Mixed Integer Quadratic
Programming (MIQP) solutions. Results show that the quality of the heuristic
solution is at the worst case equal to the 5% optimal solution. The heuristic
solution proves to be at least 300 times more time efficient in the worst
tested case. The GA heuristic execution required in the worst case less than a
second making it suitable for online execution.Comment: 8 pages, 5 figures, accepted for publication in Robotics and
Automation Letters (RA-L
Compiler Transformations to Generate Reentrant C Programs to Assist Software Parallelization
As we move through the multi-core era into the many-core era it becomes obvi- ous that thread-based programming is here to stay. This trend in the development of general purpose hardware is augmented by the fact that while writing sequential programs is considered a non-trivial task, writing parallel applications to take ad- vantage of the advances in the number of cores in a processor severely complicates the process. Writing parallel applications requires programs and functions to be reentrant. Therefore, we cannot use globals and statics. However, globals and statics are useful in certain contexts. Globals allow an easy programming mecha- nism to share data between several functions. Statics provide the only mechanism of data hiding in C for variables that are global in scope. Writing parallel programs restricts users from using globals and statics in their programs, as doing so would make the program non-reentrant. Moreover, there is a large existing legacy code base of sequential programs that are non-reentrant, since they rely on statics and globals. Several of these sequential programs dis- play significant amounts of data parallelism by operating on independent chunks of input data, and therefore can be easily converted into parallel versions to ex- ploit multi-core processors. Indeed, several such programs have been manually converted into parallel versions. However, manually eliminating all globals and statics to make the program reentrant is tedious, time-consuming, and error-prone. In this paper we describe a system to provide a semi-automated mechanism for users to still be able to use statics and globals in their programs, and to let the compiler automatically convert them into their semantically-equivalent reentrant versions enabling their parallelization later
Sort-based grouping and aggregation
Database query processing requires algorithms for duplicate removal,
grouping, and aggregation. Three algorithms exist: in-stream aggregation is
most efficient by far but requires sorted input; sort-based aggregation relies
on external merge sort; and hash aggregation relies on an in-memory hash table
plus hash partitioning to temporary storage. Cost-based query optimization
chooses which algorithm to use based on several factors including input and
output sizes, the sort order of the input, and the need for sorted output. For
example, hash-based aggregation is ideal for small output (e.g., TPC-H Query
1), whereas sorting the entire input and aggregating after sorting are
preferable when both aggregation input and output are large and the output
needs to be sorted for a subsequent operation such as a merge join.
Unfortunately, the size information required for a sound choice is often
inaccurate or unavailable during query optimization, leading to sub-optimal
algorithm choices. To address this challenge, this paper introduces a new
algorithm for sort-based duplicate removal, grouping, and aggregation. The new
algorithm always performs at least as well as both traditional hash-based and
traditional sort-based algorithms. It can serve as a system's only aggregation
algorithm for unsorted inputs, thus preventing erroneous algorithm choices.
Furthermore, the new algorithm produces sorted output that can speed up
subsequent operations. Google's F1 Query uses the new algorithm in production
workloads that aggregate petabytes of data every day
Recommended from our members
Hadoop performance modeling and job optimization for big data analytics
This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University LondonBig data has received a momentum from both academia and industry. The MapReduce model has emerged into a major computing model in support of big data analytics. Hadoop, which is an open source implementation of the MapReduce model, has been widely taken up by the community. Cloud service providers such as Amazon EC2 cloud have now supported Hadoop user applications. However, a key challenge is that the cloud service providers do not a have resource provisioning mechanism to satisfy user jobs with deadline requirements. Currently, it is solely the user responsibility to estimate the require amount of resources for their job running in a public cloud. This thesis presents a Hadoop performance model that accurately estimates the execution duration of a job and further provisions the required amount of resources for a job to be completed within a deadline. The proposed model employs Locally Weighted Linear Regression (LWLR) model to estimate execution time of a job and Lagrange Multiplier technique for resource provisioning to satisfy user job with a given deadline. The performance of the propose model is extensively evaluated in both in-house Hadoop cluster and Amazon EC2 Cloud. Experimental results show that the proposed model is highly accurate in job execution estimation and jobs are completed within the required deadlines following on the resource provisioning scheme of the proposed model. In addition, the Hadoop framework has over 190 configuration parameters and some of them have significant effects on the performance of a Hadoop job. Manually setting the optimum values for these parameters is a challenging task and also a time consuming process. This thesis presents optimization works that enhances the performance of Hadoop by automatically tuning its parameter values. It employs Gene Expression Programming (GEP) technique to build an objective function that represents the performance of a job and the correlation among the configuration parameters. For the purpose of optimization, Particle Swarm Optimization (PSO) is employed to find automatically an optimal or a near optimal configuration settings. The performance of the proposed work is intensively evaluated on a Hadoop cluster and the experimental results show that the proposed work enhances the performance of Hadoop significantly compared with the default settings.Abdul Wali Khan University Marda
Fuzzy Differential Evolution Algorithm
The Differential Evolution (DE) algorithm is a powerful search technique for solving global optimization problems over continuous space. The search initialization for this algorithm does not adequately capture vague preliminary knowledge from the problem domain. This thesis proposes a novel Fuzzy Differential Evolution (FDE) algorithm, as an alternative approach, where the vague information of the search space can be represented and used to deliver a more efficient search. The proposed FDE algorithm utilizes fuzzy set theory concepts to modify the traditional DE algorithm search initialization and mutation components. FDE, alongside other key DE features, is implemented in a convenient decision support system software package. Four benchmark functions are used to demonstrate performance of the new FDE and its practical utility. Additionally, the application of the algorithm is illustrated through a water management case study problem. The new algorithm shows faster convergence for most of the benchmark functions
Balancer genetic algorithm-a novel task scheduling optimization approach in cloud computing
Task scheduling is one of the core issues in cloud computing. Tasks are heterogeneous, and they have intensive computational requirements. Tasks need to be scheduled on Virtual Machines (VMs), which are resources in a cloud environment. Due to the immensity of search space for possible mappings of tasks to VMs, meta-heuristics are introduced for task scheduling. In scheduling makespan and load balancing, Quality of Service (QoS) parameters are crucial. This research contributes a novel load balancing scheduler, namely Balancer Genetic Algorithm (BGA), which is presented to improve makespan and load balancing. Insufficient load balancing can cause an overhead of utilization of resources, as some of the resources remain idle. BGA inculcates a load balancing mechanism, where the actual load in terms of million instructions assigned to VMs is considered. A need to opt for multi-objective optimization for improvement in load balancing and makespan is also emphasized. Skewed, normal and uniform distributions of workload and different batch sizes are used in experimentation. BGA has exhibited significant improvement compared with various state-of-the-art approaches for makespan, throughput and load balancing
Improving Data Locality in Distributed Processing of Multi-Channel Remote Sensing Data with Potentially Large Stencils
Distributing a multi-channel remote sensing data processing with potentially large stencils
is a difficult challenge. The goal of this master thesis was to evaluate and investigate the
performance impacts of such a processing on a distributed system and if it is possible to
improve the total execution time by exploiting data locality or memory alignments. The
thesis also gives a brief overview of the actual state of the art in remote sensing distributed
data processing and points out why distributed computing will become more important for
it in the future. For the experimental part of this thesis an application to process huge
arrays on a distributed system was implemented with DASH, a C++ Template Library for
Distributed Data Structures with Support for Hierarchical Locality for High Performance
Computing and Data-Driven Science. On the basis of the first results an optimization model
was developed which has the goal to reduce network traffic while initializing a distributed
data structure and executing computations on it with potentially large stencils. Furthermore,
a software to estimate the memory layouts with the least network communication cost for a
given multi-channel remote sensing data processing workflow was implemented. The results
of this optimization were executed and evaluated afterwards. The results show that it is
possible to improve the initialization speed of a large image by considering the brick locality
by 25%. The optimization model also generate valid decisions for the initialization of the
PGAS memory layouts. However, for a real implementation the optimization model has to
be modified to reflect implementation-dependent sources of overhead. This thesis presented
some approaches towards solving challenges of the distributed computing world that can be
used for real-world remote sensing imaging applications and contributed towards solving the
challenges of the modern Big Data world for future scientific data exploitation
하천 오염물질 혼합 해석을 위한 저장대 모형의 매개변수 산정법 및 경험식 개발
학위논문(석사)--서울대학교 대학원 :공과대학 건설환경공학부,2019. 8. 서일원.Analyses of solute transport and retention mechanism are essential to manage water quality and river ecosystem. As reported by tracer injection studies that have been conducted to identify solute transport mechanism, concentration curves measured in natural stream have steep rising and long tail parts. This phenomenon is due to solute exchange process between transient storage zones and the main river stream. The transient storage model (TSM) is one of the most widely used models for describing solute transport in natural stream, taking transient storage exchange process into consideration. In order to use this model, calibration of four TSM parameters is necessary. Inverse modelling using measured breakthrough curves (BTCs) from tracer injection test is general method for TSM parameter calibration. However, it is not feasible to carry out performing tracer injection tests, for every parameter calibration. For that reasons, empirical formulae with hydraulic data, which is comparatively easier to obtain, have been proposed for the purpose of parameter estimation. This study presents two methods for TSM parameter estimation. At first, inverse modelling method employing global optimization framework Shuffled Complex-Self Adaptive Hybrid EvoLution (SC-SAHEL), that incorporating famous evolutionary algorithms in water resource management field, was suggested. Second, TSM parameter empirical equations were derived adopting Multigene Genetic Programming (MGGP) based symbolic regression library GPTIPS and using Principal Components Regression (PCR). In terms of general performance, equations of this study were superior to published empirical equations.하천의 수질을 관리하기 위해서는 자연하천에서 유입된 물질이 이송되고 지체되는 메카니즘을 규명하고 이해하는 것이 필요하다. 하천에서의 물질 혼합을 이해하기 위해 수행된 추적자 실험 연구들에 따르면 자연하천에서 계측되는 농도곡선에서는 가파른 상승부와 긴 꼬리기 관측되는 것으로 알려졌다. 이러한 현상은 주로 물질이 흐르는 본류대와 잠시 물질이 포획되었다가 재방출되는 본류대와 저장대 간의 물질교환 효과 때문에 일어난다고 알려져 있다. 이러한 저장대 물질교환 효과를 모사하는 저장대모형 중 Transient Storage zone Model (TSM)은 가장 광범위하게 이용되는 모형으로, 이를 이용하기 위해선 네 가지의 저장대 매개변수를 보정하여야 한다. 네 가지 저장대 매개변수를 결정하는 방법으로는 일반적으로 현장실험에서 측정된 농도곡선을 이용한 역산모형이 이용된다. 그러나 매개변수가 필요할 때마다 추적자실험을 수행하여 역산모형을 이용하는 것은 현실적으로 불가능한 경우가 있어 이러한 경우에는 비교적 취득하기 쉬운 수리지형학적 인자들을 이용해 매개변수를 산정하는 방법이 이용될 수 있다. 따라서 본 연구에서는 TSM 매개변수를 결정하기 위해 두 가지 방법을 제시하였다. 첫 번째로, 전역 최적화 프레임워크인 Shuffled Complex-Self Adaptive Hybrid EvoLution (SC-SAHEL)을 이용한 역산모형 기반 TSM 매개변수 산정 프레임워크를 제시하였다. 둘째로는 기호회귀법 라이브러리인 GPTIPS를 이용한 다중유전자 유전 프로그래밍(Multigene Genetic Programming, MGGP) 과 주성분회귀법(Principal Components Regression, PCR)을 통해 네 가지 매개변수 별로 각 두 개씩의 경험식이 개발되었다. 개발된 경험식들의 성능평가 결과, 선행 연구에서 제시된 저장대 매개변수 식에 비해 본 연구에서 제시된 방법이 대체적으로 우수한 것으로 나타났다. 결과적으로 본 연구에서는 분석을 통해 실무적으로 활용 가능한 TSM 매개변수 산정 프레임워크와 경험식들이 제시되었으며, 이 방법들은 추적자 실험 자료의 유무에 따라 TSM의 매개변수 결정에 유용하게 사용될 것으로 기대된다.Chapter 1. Introduction 1
1.1 Necessity and Background of Research 1
1.2 Objectives 12
Chapter 2. Theoretical Background 15
2.1 Transient Storage Model 15
2.1.1. Mechanisms of Transient Storage 15
2.1.2. Models Accounting for Transient Storage 21
2.1.2.1 The one Zone Transient Storage Model (1Z-TSM) 24
2.1.2.2 The two Zone Transient Storage Model (2Z-TSM) 25
2.1.2.3 The Continuous Time Random Walk Approach (CTRW) 26
2.1.2.4 The Modified Advection Dispersion Model (MADE) 27
2.1.2.5 The Fractional Advection Dispersion Equation Model (FADE) 28
2.1.2.6 The Multirate Mass Transfer Model (MRMT) 29
2.1.2.7 The Advective Storage Path Model (ASP) 30
2.1.2.8 The Solute Transport in Rivers Model (STIR) 31
2.1.2.9 The Aggregate Dead Zone Model (ADZ) 34
2.2 Empirical Equations for Predicting Transient Storage Model Parameters 39
2.3 Parameter Estimation 47
2.3.1. The SC-SAHEL Framework 50
2.3.1.1 Modified Competitive Complex Evolution (MCCE) 52
2.3.1.2 Modified Frog Leaping (MFL) 52
2.3.1.3 Modified Grey Wolf Optimizer (GWO) 53
2.3.1.4 Modified Differential Evolution (DE) 53
2.4 Regression Method 54
2.4.1. The Multi-Gene Genetic Programming (MGGP) 56
2.4.1.1 The Simple Genetic Programming 56
2.4.1.2 Scaled Symbolic Regression via Multi-Gene Genetic Programming 57
2.4.2. Evolutionary Polynomial Regression (EPR) 61
2.4.2.1 Main Flow of EPR Procedure 62
Chapter 3. Model Development 66
3.1 Numerical Model 66
3.1.1. Model Validation 69
3.2 Merger of TSM-SC-SAHEL 73
3.3 Further assessments for the parameter estimation framework 76
3.3.1. Tracer Test Description 76
3.3.2. Grid Independency of Estimation 81
3.3.3. Choice of Optimization Setting 85
Chapter 4. Development of Formulae for Predicting TSM Parameter 91
4.1 Dimensional Analysis 91
4.2 Data Collection via Meta Analysis 95
4.3 Formulae Development 106
Chapter 5. Result and Discussion 110
5.1 Model Performances 110
5.2 Sensitivity Analysis 118
5.3 In-stream Application of Empirical Equations 130
Chapter 6. Conclusion 140
References 144
Appendix. I. The mean, minimum, and maximum values of the model fitness value and number of evolution using the SC-SAHEL with single-EA and multi-EA 159
Appendix. II. Used dimensionless datasets for development of empirical equations 161
국문초록 165Maste
Forecasting Government Bond Spreads with Heuristic Models:Evidence from the Eurozone Periphery
This study investigates the predictability of European long-term government bond spreads through the application of heuristic and metaheuristic support vector regression (SVR) hybrid structures. Genetic, krill herd and sine–cosine algorithms are applied to the parameterization process of the SVR and locally weighted SVR (LSVR) methods. The inputs of the SVR models are selected from a large pool of linear and non-linear individual predictors. The statistical performance of the main models is evaluated against a random walk, an Autoregressive Moving Average, the best individual prediction model and the traditional SVR and LSVR structures. All models are applied to forecast daily and weekly government bond spreads of Greece, Ireland, Italy, Portugal and Spain over the sample period 2000–2017. The results show that the sine–cosine LSVR is outperforming its counterparts in terms of statistical accuracy, while metaheuristic approaches seem to benefit the parameterization process more than the heuristic ones
Parallelizing Set Similarity Joins
Eine der größten Herausforderungen in Data Science ist heutzutage, Daten miteinander in Beziehung zu setzen und ähnliche Daten zu finden. Hierzu kann der aus relationalen Datenbanken bekannte Join-Operator eingesetzt werden. Das Konzept der Ähnlichkeit wird häufig durch mengenbasierte Ähnlichkeitsfunktionen gemessen. Um solche Funktionen als Join-Prädikat nutzen zu können, setzt diese Arbeit voraus, dass Records aus Mengen von Tokens bestehen. Die Arbeit fokussiert sich auf den mengenbasierten Ähnlichkeitsjoin, Set Similarity Join (SSJ).
Die Datenmenge, die es heute zu verarbeiten gilt, ist groß und wächst weiter. Der SSJ hingegen ist eine rechenintensive Operation. Um ihn auf großen Daten ausführen zu können, sind neue Ansätze notwendig. Diese Arbeit fokussiert sich auf das Mittel der Parallelisierung. Sie leistet folgende drei Beiträge auf dem Gebiet der SSJs.
Erstens beschreibt und untersucht die Arbeit den aktuellen Stand paralleler SSJ-Ansätze. Diese Arbeit vergleicht zehn Map-Reduce-basierte Ansätze aus der Literatur sowohl analytisch als auch experimentell. Der größte Schwachpunkt aller Ansätze ist überraschenderweise eine geringe Skalierbarkeit aufgrund zu hoher Datenreplikation und/ oder ungleich verteilter Daten. Keiner der Ansätze kann den SSJ auf großen Daten berechnen.
Zweitens macht die Arbeit die verfügbare hohe CPU-Parallelität moderner Rechner für den SSJ nutzbar. Sie stellt einen neuen daten-parallelen multi-threaded SSJ-Ansatz vor. Der vorgestellte Ansatz ermöglicht erhebliche Laufzeit-Beschleunigungen gegenüber der Ausführung auf einem Thread.
Drittens stellt die Arbeit einen neuen hoch skalierbaren verteilten SSJ-Ansatz vor. Mit einer kostenbasierten Heuristik und einem daten-unabhängigen Skalierungsmechanismus vermeidet er Daten-Replikation und wiederholte Berechnungen. Der Ansatz beschleunigt die Join-Ausführung signifikant und ermöglicht die Ausführung auf erheblich größeren Datenmengen als bisher betrachtete parallele Ansätze.One of today's major challenges in data science is to compare and relate data of similar nature. Using the join operation known from relational databases could help solving this problem. Given a collection of records, the join operation finds all pairs of records, which fulfill a user-chosen predicate. Real-world problems could require complex predicates, such as similarity. A common way to measure similarity are set similarity functions. In order to use set similarity functions as predicates, we assume records to be represented by sets of tokens. In this thesis, we focus on the set similarity join (SSJ) operation.
The amount of data to be processed today is typically large and grows continually. On the other hand, the SSJ is a compute-intensive operation. To cope with the increasing size of input data, additional means are needed to develop scalable implementations for SSJ. In this thesis, we focus on parallelization. We make the following three major contributions to SSJ.
First, we elaborate on the state-of-the-art in parallelizing SSJ. We compare ten MapReduce-based approaches from the literature analytically and experimentally. Their main limit is surprisingly a low scalability due to too high and/or skewed data replication. None of the approaches could compute the join on large datasets.
Second, we leverage the abundant CPU parallelism of modern commodity hardware, which has not yet been considered to scale SSJ. We propose a novel data-parallel multi-threaded SSJ. Our approach provides significant speedups compared to single-threaded executions.
Third, we propose a novel highly scalable distributed SSJ approach. With a cost-based heuristic and a data-independent scaling mechanism we avoid data replication and recomputation. A heuristic assigns similar shares of compute costs to each node. Our approach significantly scales up the join execution and processes much larger datasets than all parallel approaches designed and implemented so far
- …