    The Family of MapReduce and Large Scale Data Processing Systems

    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author

    Transactional support for adaptive indexing

    Adaptive indexing initializes and optimizes indexes incrementally, as a side effect of query processing. The goal is to achieve the benefits of indexes while hiding or minimizing the costs of index creation. However, index-optimizing side effects seem to turn read-only queries into update transactions that might, for example, create lock contention. This paper studies concurrency contr

    Automatic Rescaling and Tuning of Big Data Applications on Container-Based Virtual Environments

    Programa Oficial de Doutoramento en Investigación en Tecnoloxías da Información. 524V01[Resumo] As aplicacións Big Data actuais evolucionaron dun xeito significativo, dende fluxos de traballo baseados en procesamento por lotes ata outros máis complexos que poden requirir múltiples etapas de procesamento usando diferentes tecnoloxías, e mesmo executándose en tempo real. Doutra banda, para despregar estas aplicacións, os clusters ‘commodity’ foron substituídos nalgúns casos por paradigmas máis flexibles como o Cloud, ou mesmo por outros emerxentes como a computación ‘serverless’, precisando ambos paradigmas de tecnoloxías de virtualización. Esta Tese propón dúas contornas que proporcionan modos alternativos de realizar unha análise en profundidade e unha mellor xestión dos recursos de aplicacións Big Data despregadas en contornas virtuais baseadas en contedores software. Por unha banda, a contorna BDWatchdog permite realizar unha análise de gran fino e en tempo real en termos do uso dos recursos do sistema e do perfilado do código. Doutra banda, descríbese unha contorna para o reescalado dinámico e en tempo real dos recursos segundo un conxunto de políticas configurables. A primeira política proposta céntrase no reescalado automático dos recursos dos contedores segundo o uso real que as aplicacións fan dos mesmos, proporcionando así unha contorna ‘serverless’. Ademais, preséntase unha política alternativa centrada na xestión enerxética que permite implementar os conceptos de limitación e presuposto de potencia, que poden aplicarse a contedores, aplicacións ou mesmo usuarios. En xeral, as contornas propostas nesta Tese tratan de poñer de relevo o potencial de aplicar novos xeitos de analizar e axustar os recursos das aplicacións Big Data despregadas en clusters de contedores, mesmo en tempo real. Os casos de uso presentados son exemplos diso, demostrando que as aplicacións Big Data poden adaptarse a novas tecnoloxías ou paradigmas sen teren que cambiar as súas características máis intrínsecas.[Resumen] Las aplicaciones Big Data actuales han evolucionado de forma significativa, desde flujos de trabajo basados en procesamiento por lotes hasta otros más complejos que pueden requerir múltiples etapas de procesamiento usando distintas tecnologías, e incluso ejecutándose en tiempo real. Por otra parte, para desplegar estas aplicaciones, los clusters ‘commodity’ se han reemplazado en algunos casos por paradigmas más flexibles como el Cloud, o incluso por otros emergentes como la computación ‘serverless’, requiriendo ambos paradigmas de tecnologías de virtualización. Esta Tesis propone dos entornos que proporcionan formas alternativas de realizar un análisis en profundidad y una mejor gestión de los recursos de aplicaciones Big Data desplegadas en entornos virtuales basados en contenedores software. Por un lado, el entorno BDWatchdog permite realizar un análisis de grano fino y en tiempo real en lo que respecta a la monitorización de los recursos del sistema y al perfilado del código. Por otro lado, se describe un entorno para el reescalado dinámico y en tiempo real de los recursos de acuerdo a un conjunto de políticas configurables. La primera política propuesta se centra en el reescalado automático de los recursos de los contenedores de acuerdo al uso real que las aplicaciones hacen de los mismos, proporcionando así un entorno ‘serverless’. Además, se presenta una política alternativa centrada en la gestión energética que permite implementar los conceptos de limitación y presupuesto de potencia, pudiendo aplicarse a contenedores, aplicaciones o incluso usuarios. En general, los entornos propuestos en esta Tesis tratan de resaltar el potencial de aplicar nuevas formas de analizar y ajustar los recursos de las aplicaciones Big Data desplegadas en clusters de contenedores, incluso en tiempo real. Los casos de uso que se han presentado son ejemplos de esto, demostrando que las aplicaciones Big Data pueden adaptarse a nuevas tecnologías o paradigmas sin tener que cambiar su características más intrínsecas.[Abstract] Current Big Data applications have significantly evolved from its origins, moving from mostly batch workloads to more complex ones that may involve many processing stages using different technologies or even working in real time. Moreover, to deploy these applications, commodity clusters have been in some cases replaced in favor of newer and more flexible paradigms such as the Cloud or even emerging ones such as serverless computing, usually involving virtualization techniques. This Thesis proposes two frameworks that provide alternative ways to perform indepth analysis and improved resource management for Big Data applications deployed on virtual environments based on software containers. On the one hand, the BDWatchdog framework is capable of performing real-time, fine-grain analysis in terms of system resource monitoring and code profiling. On the other hand, a framework for the dynamic and real-time scaling of resources according to several tuning policies is described. The first proposed policy revolves around the automatic scaling of the containers’ resources according to the real usage of the applications, thus providing a serverless environment. Furthermore, an alternative policy focused on energy management is presented in a scenario where power capping and budgeting functionalities are implemented for containers, applications or even users. Overall, the frameworks proposed in this Thesis aim to showcase how novel ways of analyzing and tuning the resources given to Big Data applications in container clusters are possible, even in real time. The supported use cases that were presented are examples of this, and show how Big Data applications can be adapted to newer technologies or paradigms without having to lose their distinctive characteristics

    Crossing the ocean by feeling for the BITs: Investor‐state arbitration in China’s bilateral investment treaties

    This repository item contains a working paper from the Boston University Global Economic Governance Initiative. The Global Economic Governance Initiative (GEGI) is a research program of the Center for Finance, Law & Policy, the Frederick S. Pardee Center for the Study of the Longer-Range Future, and the Frederick S. Pardee School of Global Studies. It was founded in 2008 to advance policy-relevant knowledge about governance for financial stability, human development, and the environment.Although China began to sign bilateral investment treaties (BITs) in the 1970s, it refused to grant foreign investors the right to sue their host government in international arbitration tribunals. Few realize that China’s treaty negotiators have in fact abandoned this restriction in almost every Chinese BIT signed since 1998, including those with Latin America. Scholars have suggested that China reversed its policy in order to support Chinese overseas investors or to fit its general economic liberalization strategy. However, China’s BITs with Mexico, Peru, and Colombia as well as its arbitration case with Peru contradict these theories. I argue that China began signing open BITs to test the risks of granting open access to European countries and the United States, for whom open access is a key condition. China experimented gradually with open arbitration, just as it has experimented gradually with many economic changes since Reform and Opening began in 1978. This theory has interesting implications for China’s future BITs—as international arbitration tribunals threaten to make this experiment permanent, China has added new restrictions that bring China’s BITs closer to the US model and make a US-­‐China BIT more likely. However, the US avoids BITs with capital-­‐exporting countries, and China is now a large capital-­‐exporter. The main obstacle to US-­‐China BIT negotiations may no longer be the two nations’ differences, but rather their similarities

    A high-speed linear algebra library with automatic parallelism

    Parallel or distributed processing is key to getting highest performance workstations. However, designing and implementing efficient parallel algorithms is difficult and error-prone. It is even more difficult to write code that is both portable to and efficient on many different computers. Finally, it is harder still to satisfy the above requirements and include the reliability and ease of use required of commercial software intended for use in a production environment. As a result, the application of parallel processing technology to commercial software has been extremely small even though there are numerous computationally demanding programs that would significantly benefit from application of parallel processing. This paper describes DSSLIB, which is a library of subroutines that perform many of the time-consuming computations in engineering and scientific software. DSSLIB combines the high efficiency and speed of parallel computation with a serial programming model that eliminates many undesirable side-effects of typical parallel code. The result is a simple way to incorporate the power of parallel processing into commercial software without compromising maintainability, reliability, or ease of use. This gives significant advantages over less powerful non-parallel entries in the market

    Letter from the Special Issue Editor

    Editorial work for DEBULL on a special issue on data management on Storage Class Memory (SCM) technologies

    BDWatchdog: real-time monitoring and profiling of Big Data applications and frameworks

    This is a post-peer-review, pre-copyedit version of an article published in Future Generation Computer Systems. The final authenticated version is available online at: https://doi.org/10.1016/j.future.2017.12.068[Abstract] Current Big Data applications are characterized by a heavy use of system resources (e.g., CPU, disk) generally distributed across a cluster. To effectively improve their performance there is a critical need for an accurate analysis of both Big Data workloads and frameworks. This means to fully understand how the system resources are being used in order to identify potential bottlenecks, from resource to code bottlenecks. This paper presents BDWatchdog, a novel framework that allows real-time and scalable analysis of Big Data applications by combining time series for resource monitorization and flame graphs for code profiling, focusing on the processes that make up the workload rather than the underlying instances on which they are executed. This shift from the traditional system-based monitorization to a process-based analysis is interesting for new paradigms such as software containers or serverless computing, where the focus is put on applications and not on instances. BDWatchdog has been evaluated on a Big Data cloud-based service deployed at the CESGA supercomputing center. The experimental results show that a process-based analysis allows for a more effective visualization and overall improves the understanding of Big Data workloads. BDWatchdog is publicly available at http://bdwatchdog.dec.udc.es.Ministerio de Economía, Industria y Competitividad; TIN2016-75845-PMinsiterio de Educación; FPU15/0338

    A Survey on the Contributions of Software-Defined Networking to Traffic Engineering

    Since the appearance of OpenFlow back in 2008, software-defined networking (SDN) has gained momentum. Although there are some discrepancies between the standards developing organizations working with SDN about what SDN is and how it is defined, they all outline traffic engineering (TE) as a key application. One of the most common objectives of TE is the congestion minimization, where techniques such as traffic splitting among multiple paths or advanced reservation systems are used. In such a scenario, this manuscript surveys the role of a comprehensive list of SDN protocols in TE solutions, in order to assess how these protocols can benefit TE. The SDN protocols have been categorized using the SDN architecture proposed by the open networking foundation, which differentiates among data-controller plane interfaces, application-controller plane interfaces, and management interfaces, in order to state how the interface type in which they operate influences TE. In addition, the impact of the SDN protocols on TE has been evaluated by comparing them with the path computation element (PCE)-based architecture. The PCE-based architecture has been selected to measure the impact of SDN on TE because it is the most novel TE architecture until the date, and because it already defines a set of metrics to measure the performance of TE solutions. We conclude that using the three types of interfaces simultaneously will result in more powerful and enhanced TE solutions, since they benefit TE in complementary ways.European Commission through the Horizon 2020 Research and Innovation Programme (GN4) under Grant 691567 Spanish Ministry of Economy and Competitiveness under the Secure Deployment of Services Over SDN and NFV-based Networks Project S&NSEC under Grant TEC2013-47960-C4-3-

    Rice with reduced stomatal density conserves water and has improved drought tolerance under future climate conditions

    Much of humanity relies on rice (Oryza sativa) as a food source, but cultivation is water intensive and the crop is vulnerable to drought and high temperatures. Under climate change, periods of reduced water availability and high temperature are expected to become more frequent, leading to detrimental effects on rice yields. We engineered the high-yielding rice cultivar ‘IR64’ to produce fewer stomata by manipulating the level of a developmental signal. We overexpressed the rice epidermal patterning factor OsEPF1, creating plants with substantially reduced stomatal density and correspondingly low stomatal conductance. Low stomatal density rice lines were more able to conserve water, using c. 60% of the normal amount between weeks 4 and 5 post germination. When grown at elevated atmospheric CO2, rice plants with low stomatal density were able to maintain their stomatal conductance and survive drought and high temperature (40°C) for longer than control plants. Low stomatal density rice gave equivalent or even improved yields, despite a reduced rate of photosynthesis in some conditions. Rice plants with fewer stomata are drought tolerant and more conservative in their water use, and they should perform better in the future when climate change is expected to threaten food security