    Efficient HTTP based I/O on very large datasets for high performance computing with the libdavix library

    Remote data access for data analysis in high performance computing is commonly done with specialized data access protocols and storage systems. These protocols are highly optimized for high throughput on very large datasets, multi-streams, high availability, low latency and efficient parallel I/O. The purpose of this paper is to describe how we have adapted a generic protocol, the Hyper Text Transport Protocol (HTTP) to make it a competitive alternative for high performance I/O and data analysis applications in a global computing grid: the Worldwide LHC Computing Grid. In this work, we first analyze the design differences between the HTTP protocol and the most common high performance I/O protocols, pointing out the main performance weaknesses of HTTP. Then, we describe in detail how we solved these issues. Our solutions have been implemented in a toolkit called davix, available through several recent Linux distributions. Finally, we describe the results of our benchmarks where we compare the performance of davix against a HPC specific protocol for a data analysis use case.Comment: Presented at: Very large Data Bases (VLDB) 2014, Hangzho

    BI (XML) Publisher Conversion from Third Party Software in E-Business Suite: An ERP (Enterprise Resource Planning) Reporting Framework Conversion Model

    Oracle Business Intelligence Publisher (BI Publisher) is an enterprise reporting framework to develop, manage, and deliver all types of highly formatted documents. It eliminates the need for costly point solutions. End users can easily design report layouts directly in a Web browser or using familiar desktop tools, dramatically reducing the time and cost needed to develop and maintain reports. In addition, it is extremely efficient and highly scalable because it can generate tens of thousands of documents per hour with minimal impact to transactional systems. Furthermore, it is “a template-based publishing solution delivered with the Oracle E-Business Suite, PeopleSoft, Enterprise and JD Edwards EnterpriseOne” (“Business Intelligence Publisher Core Components Guide,” 2008). Today, many companies who are using third-party ERP reports framework want to convert their reporting framework to Oracle BI Publisher framework because they want to reduce the overall cost of development, customization, and ongoing maintenance of their ERP reports. However, converting third party software to BI Publisher is not easy. It is technically challenging, can be costly, and can even fail if the project is lacking a thorough plan and careful implementation. This paper exhibits a case study and constructs a step-by-step conversion model for others to follow. The intended audiences are those companies who are planning on converting their reporting framework to BI Publisher in the Oracle EBS (E-Business Suite) environment

    HaDeS: a Scalable Service Oriented Deployment System for Large Scale Installations

    Building large computational facilities requires scalable and flexible deployment tools that can cope with massive loads. Classical installation methods are not very flexible, since they are usually limited in the number of OS supported, rely on transfer solutions that impose constraints on network topology, and do not scale very well. Here we describe HaDeS (Hardware Deployment System), a new deployment system for large scale installation designed to be agnostic with respect to the network topology and the OS deployed and to scale with the number of nodes being deployed.251-25

    An implementation of the Linux software repository model for other operating systems

    Year after year, the frequency of updated releases of soft-ware continues to increase. Without an automated install process, the result is either that a system installs software with known defects and/or vulnerabilities, or systems require increased manual labor to maintain up-to-date software in-stallations. Linux packages, in conjunction with repositories, fill this need for automation to reduce both undesirable situa-tions. This model can be modified to a generic operating sys-tem environment, such as Windows, which currently lacks the capability to update arbitrary software applications. Our application, Appupdater, demonstrates this concept of de-tecting, downloading, and installing upgrades automatically. This provides a completely automated upgrade cycle

    DPMbox: An interactive user-friendly web interface for a disk-based grid storage system

    Disk Pool Manager (DPM) es un sistema de gestión de almacenamiento que se usa dentro del Worldwide LHC Computing Grid. Ha sido desarrollado en el CERN y actualmente es el más usado dentro de esta infraestructura de computación distribuida. Avanzando hacia el uso de estándares que faciliten el uso de DPM, recientemente se implementó una interfaz WebDAV (una extensión del protocolo HTTP) para este sistema. A pesar de ello esta interfaz aún ofrece una funcionalidad básica, sobre todo accediendo desde un navegador web, lo que hace que siga siendo necesario usar algunas herramientas especiales. El objetivo de DPMbox es ofrecer una interfaz realmente amigable, intuitiva y que pueda usarse con herramientas ya conocidas por los usuarios, como es el caso de un navegador web, atrayendo así a usuarios menos técnicos de la comunidad científica. El proyecto basa su construcción en la interfaz WebDAV implementada y hace uso de tecnologías maduras y estándar que permiten este desarrollo como JavaScript/ECMAScript a través de jQuery u otras librerías de apoyo, así como HTML y CSS. Al realizarse como colaboración con el CERN el desarrollo se centra en las funcionalidades requeridas por el sistema DPM. Aún así, uno de los objetivos es que habiendo cumplido los requisitos iniciales, el sistema sea extensible y facilmente adaptable, haciendo posible su uso con otros sistemas que ofrezcan el protocolo WebDAV de manera general.Disk Pool Manager (DPM) is a lightweight storage management system for grid sites. It has been developed in CERN (European Organization for Nuclear Research), and it is the most widely adopted solution in the Worldwide LHC Computing Grid infrastructure. Attracting less technical users has been an objective for the last years, thus, as an effort to move towards standard protocols that removes the need of special tools, DPM started offering a WebDAV (an extension of the HTTP protocol) interface, facilitating the access through commonly available tools, i.e. web browsers or WebDAV clients. However, this interface only provides basic functionality, especially when accessed from a web browser, making it still necessary to use some specific tools. DPMbox is a project for a friendly web interface that allows both technical and nontechnical users to manage their data from and into the grid by accessing it trough their web browsers. The project has been built getting advantage of the implemented WebDAV front-end, and as a web development it uses standard and mature web technologies like HTML, CSS and JavaScript/ECMAScript as its core language. As a collaboration with CERN, the development has been focused on the functionality required by the DPM, but one of the objectives is to make DPMbox easily expandable and flexible, enabling its use with other systems that offer the WebDAV protocol

    Gjxdm Documents and Small Law Enforcement Agencies

    The purpose of this paper is to demonstrate that while the Global Justice XML Data Model (GJXDM) is a complete and effective solution for criminal justice agencies it is complex to implement and difficult for smaller law enforcement agencies to put into practice. The paper presents the current implementation steps for a new GJXDM document and describes the process of implementing an existing GJXDM document. The paper also presents a tool for agencies to start using and processing GJXDM documents. Also offered within the paper is a design for a central repository for increasing GJXDM information sharing and dissemination of GJXDM software artifacts

    Evaluating the impact of traffic sampling in network analysis

    Dissertação de mestrado integrado em Engenharia InformáticaThe sampling of network traffic is a very effective method in order to comprehend the behaviour and flow of a network, essential to build network management tools to control Service Level Agreements (SLAs), Quality of Service (QoS), traffic engineering, and the planning of both the capacity and the safety of the network. With the exponential rise of the amount traffic caused by the number of devices connected to the Internet growing, it gets increasingly harder and more expensive to understand the behaviour of a network through the analysis of the total volume of traffic. The use of sampling techniques, or selective analysis, which consists in the election of small number of packets in order to estimate the expected behaviour of a network, then becomes essential. Even though these techniques drastically reduce the amount of data to be analyzed, the fact that the sampling analysis tasks have to be performed in the network equipment can cause a significant impact in the performance of these equipment devices, and a reduction in the accuracy of the estimation of network state. In this dissertation project, an evaluation of the impact of selective analysis of network traffic will be explored, at a level of performance in estimating network state, and statistical properties such as self-similarity and Long-Range Dependence (LRD) that exist in original network traffic, allowing a better understanding of the behaviour of sampled network traffic.A análise seletiva do tráfego de rede é um método muito eficaz para a compreensão do comportamento e fluxo de uma rede, sendo essencial para apoiar ferramentas de gestão de tarefas tais como o cumprimento de contratos de serviço (Service Level Agreements - SLAs), o controlo da Qualidade de Serviço (QoS), a engenharia de tráfego, o planeamento de capacidade e a segurança das redes. Neste sentido, e face ao exponencial aumento da quantidade de tráfego presente causado pelo número de dispositivos com ligação à rede ser cada vez maior, torna-se cada vez mais complicado e dispendioso o entendimento do comportamento de uma rede através da análise do volume total de tráfego. A utilização de técnicas de amostragem, ou análise seletiva, que consiste na eleição de um pequeno conjunto de pacotes de forma a tentar estimar, ou calcular, o comportamento expectável de uma rede, torna-se assim essencial. Apesar de estas técnicas reduzirem bastante o volume de dados a ser analisado, o facto de as tarefas de análise seletiva terem de ser efetuadas nos equipamentos de rede pode criar um impacto significativo no desempenho dos mesmos e uma redução de acurácia na estimação do estado da rede. Nesta dissertação de mestrado será então feita uma avaliação do impacto da análise seletiva do tráfego de rede, a nível do desempenho na estimativa do estado da rede e a nível das propriedades estatísticas tais como a Long-Range Dependence (LRD) existente no tráfego original, permitindo assim entender melhor o comportamento do tráfego de rede seletivo

    Virtual Machine Lifecycle Management in Grid and Cloud Computing

    Virtualisierungstechnologie ist die Grundlage für zwei wichtige Konzepte: Virtualized Grid Computing und Cloud Computing. Ersteres ist eine Erweiterung des klassischen Grid Computing. Es hat zum Ziel, die Anforderungen kommerzieller Nutzer des Grid hinsichtlich der Isolation von gleichzeitig ausgeführten Batch-Jobs und der Sicherheit der zugehörigen Daten zu erfüllen. Dabei werden Anwendungen in virtuellen Maschinen ausgeführt, um sie voneinander zu isolieren und die von ihnen verarbeiteten Daten vor anderen Nutzern zu schützen. Darüber hinaus löst Virtualized Grid Computing das Problem der Softwarebereitstellung, eines der bestehenden Probleme des klassischen Grid Computing. Cloud Computing ist ein weiteres Konzept zur Verwendung von entfernten Ressourcen. Der Fokus dieser Dissertation bezüglich Cloud Computing liegt auf dem “Infrastructure as a Service Modell”, das Ideen des (Virtualized) Grid Computing mit einem neuartigen Geschäftsmodell kombiniert. Dieses besteht aus der Bereitstellung von virtuellen Maschinen auf Abruf und aus einem Tarifmodell, bei dem lediglich die tatsächliche Nutzung berechnet wird. Der Einsatz von Virtualisierungstechnologie erhöht die Auslastung der verwendeten (physischen) Rechnersysteme und vereinfacht deren Administration. So ist es beispielsweise möglich, eine virtuelle Maschine zu klonen oder einen Snapshot einer virtuellen Maschine zu erstellen, um zu einem definierten Zustand zurückkehren zu können. Jedoch sind noch nicht alle Probleme im Zusammenhang mit der Virtualisierungstechnologie gelöst. Insbesondere entstehen durch den Einsatz in den sehr dynamischen Umgebungen des Virtualized Grid Computing und des Cloud Computing neue Herausforderungen für die Virtualisierungstechnologie. Diese Dissertation befasst sich mit verschiedenen Aspekten des Einsatzes von Virtualisierungstechnologie in Virtualized Grid und Cloud Computing Umgebungen. Zunächst wird der Lebenszyklus von virtuellen Maschinen in diesen Umgebungen untersucht, und es werden Modelle dieses Lebenszyklus entwickelt. Anhand der entwickelten Modelle werden Probleme identifiziert und Lösungen für diese Probleme entwickelt. Der Fokus liegt dabei auf den Bereichen Speicherung, Bereitstellung und Ausführung von virtuellen Maschinen. Virtuelle Maschinen werden üblicherweise in so genannten Disk Images, also Abbildern von virtuellen Festplatten, gespeichert. Dieses Format hat nicht nur Einfluss auf die Speicherung von größeren Mengen virtueller Maschinen, sondern auch auf deren Bereitstellung. In den untersuchten Umgebungen hat es zwei konkrete Nachteile: es verschwendet Speicherplatz und es verhindert eine effiziente Bereitstellung von virtuellen Maschinen. Maßnahmen zur Steigerung der Sicherheit von virtuellen Maschinen haben auf alle drei genannten Bereiche Einfluss. Beispielsweise sollte vor der Bereitstellung einer virtuellen Maschine geprüft werden, ob die darin installierte Software noch aktuell ist. Weiterhin sollte die Ausführungsumgebung Möglichkeiten bereitstellen, um die virtuelle Infrastruktur wirksam zu überwachen. Die erste in dieser Dissertation vorgestellte Lösung ist das Konzept der Image Composition. Es beschreibt die Komposition eines kombinierten Disk Images aus mehreren Schichten. Dadurch können Teile der einzelnen Schichten, die von mehreren virtuellen Maschinen verwendet werden, zwischen diesen geteilt und somit der Speicherbedarf für die Gesamtheit der virtuellen Maschinen reduziert werden. Der Marvin Image Compositor ist die Umsetzung dieses Konzepts. Die zweite Lösung ist der Marvin Image Store, ein Speichersystem für virtuelle Maschinen, das nicht auf den traditionell genutzten Disk Images basiert, sondern die darin enthaltenen Daten und Metadaten auf eine effiziente Weise getrennt voneinander speichert. Weiterhin werden vier Lösungen vorgestellt, die die Sicherheit von virtuellen Maschine verbessern können: Der Update Checker ist eine Lösung, die es ermöglicht, veraltete Software in virtuellen Maschinen zu identifizieren. Dabei spielt es keine Rolle, ob die jeweilige virtuelle Maschine gerade ausgeführt wird oder nicht. Die zweite Sicherheitslösung ermöglicht es, mehrere virtuelle Maschinen, die auf dem Konzept der Image Composition basieren, zentral zu aktualisieren. Das bedeutet, dass die einmalige Installation einer neuen Softwareversion ausreichend ist, um mehrere virtuelle Maschinen auf den neuesten Stand zu bringen. Die dritte Sicherheitslösung namens Online Penetration Suite ermöglicht es, virtuelle Maschinen automatisiert nach Schwachstellen zu durchsuchen. Die Überwachung der virtuellen Infrastruktur auf allen Ebenen ist der Zweck der vierten Sicherheitslösung. Zusätzlich zur Überwachung ermöglicht diese Lösung auch eine automatische Reaktion auf sicherheitsrelevante Ereignisse. Schließlich wird ein Verfahren zur Migration von virtuellen Maschinen vorgestellt, welches auch ohne ein zentrales Speichersystem eine effiziente Migration ermöglicht

    Security plane for data authentication in information-centric networks

    Orientadores: Maurício Ferreira Magalhães, Jussi KangasharjuTese (doutorado) - Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de ComputaçãoResumo: A segurança da informação é responsável pela proteção das informações contra o acesso nãoautorizado, uso, modificação ou a sua destruição. Com o objetivo de proteger os dados contra esses ataques de segurança, vários protocolos foram desenvolvidos, tais como o Internet Protocol Security (IPSEC) e o Transport Layer Security (TLS), provendo mecanismos de autenticação, integridade e confidencialidade dos dados para os usuários. Esses protocolos utilizam o endereço IP como identificador de hosts na Internet, tornando-o referência e identificador no estabelecimento de conexões seguras para a troca de dados entre aplicações na rede. Com o advento da Web e o aumento exponencial do consumo de conteúdos, como vídeos e áudios, há indícios da migração gradual do uso predominante da Internet, passando da ênfase voltada para a conexão entre hosts para uma ênfase voltada para a obtenção de conteúdo da rede, paradigma esse conhecido como information-centric networking. Nesse paradigma, usuários buscam por documentos e recursos na Internet sem se importarem com o conhecimento explícito da localização do conteúdo. Como consequência, o endereço IP que previamente era utilizado como ponto de referência do provedor de dados, torna-se meramente um identificador efêmero do local onde o conteúdo está armazenado, resultando em implicações para a autenticação correta dos dados. Nesse contexto, a simples autenticação de um endereço IP não garante a autenticidade dos dados, uma vez que o servidor identificado por um dado endereço IP não é necessariamente o endereço do produtor do conteúdo. No contexto de redes orientadas à informação, existem propostas na literatura que possibilitam a autenticação dos dados utilizando somente o conteúdo propriamente dito, como a utilização de assinaturas digitais por bloco de dado e a construção de árvores de hash sobre os blocos de dados. A ideia principal dessas abordagens é atrelar uma informação do provedor original do conteúdo nos blocos de dados transportados, por exemplo, uma assinatura digital, possibilitando a autenticação direta dos dados com o provedor, independentemente do host onde o dado foi obtido. Apesar do mecanismo citado anteriormente possibilitar tal verificação, esse procedimento é muito oneroso do ponto de vista de processamento, especialmente quando o número de blocos é grande, tornando-o inviável de ser utilizado na prática. Este trabalho propõe um novo mecanismo de autenticação utilizando árvores de hash com o objetivo de prover a autenticação dos dados de forma eficiente e explícita com o provedor original e, também, de forma independente do host onde os dados foram obtidos. Nesta tese, propomos duas técnicas de autenticação de dados baseadas em árvores de hash, chamadas de skewed hash tree (SHT) e composite hash tree (CHT), para a autenticação de dados em redes orientadas à informação. Uma vez criadas, parte dos dados de autenticação é armazenada em um plano de segurança e uma outra parte permanece acoplada ao dado propriamente dito, possibilitando a verificação baseada no conteúdo e não no host de origem. Além disso, essa tese apresenta o modelo formal, a especificação e a implementação das duas técnicas de árvore de hash para autenticação dos dados em redes de conteúdo através de um plano de segurança. Por fim, esta tese detalha a instanciação do modelo de plano de segurança proposto em dois cenários de autenticação de dados: 1) redes Peer-to-Peer e 2) autenticação paralela de dados sobre o HTTPAbstract: Information security is responsible for protecting information against unauthorized access, use, modification or destruction. In order to protect such data against security attacks, many security protocols have been developed, for example, Internet Protocol Security (IPSec) and Transport Layer Security (TLS), providing mechanisms for data authentication, integrity and confidentiality for users. These protocols use the IP address as host identifier on the Internet, making it as a reference and identifier during the establishment of secure connections for data exchange between applications on the network. With the advent of the Web and the exponential increase in content consumption (e.g., video and audio), there is an evidence of a gradual migration of the predominant usage of the Internet, moving the emphasis on the connection between hosts to the content retrieval from the network, which paradigm is known as information-centric networking. In this paradigm, users look for documents and resources on the Internet without caring about the explicit knowledge of the location of the content. As a result, the IP address that was used previously as a reference point of a data provider, becomes merely an ephemeral identifier of where the content is stored, resulting in implications for the correct authentication data. In this context, the simple authentication of an IP address does not guarantee the authenticity of the data, because a hosting server identified by a given IP address is not necessarily the same one that is producing the requested content. In the context of information-oriented networks, some proposals in the literature proposes authentication mechanisms based on the content itself, for example, digital signatures over a data block or the usage of hash trees over data blocks. The main idea of these approaches is to add some information from the original provider in the transported data blocks, for example, a digital signature, enabling data authentication directly with the original provider, regardless of the host where the data was obtained. Although the mechanism mentioned previously allows for such verification, this procedure is very costly in terms of processing, especially when the number of blocks is large, making it unfeasible in practice. This thesis proposes a new authentication mechanism using hash trees in order to provide efficient data authentication and explicitly with the original provider, and also independently of the host where the data were obtained. We propose two techniques for data authentication based on hash trees, called skewed hash tree (SHT) and composite hash tree (CHT), for data authentication in information-oriented networks. Once created, part of the authentication data is stored in a security plane and another part remains attached to the data itself, allowing for the verification based on content and not on the source host. In addition, this thesis presents the formal model, specification and implementation of two hash tree techniques for data authentication in information-centric networks through a security plane. Finally, this thesis details the instantiation of the security plane model in two scenarios of data authentication: 1) Peer-to-Peer and 2) parallel data authentication over HTTPDoutoradoEngenharia de ComputaçãoDoutor em Engenharia Elétric