523 research outputs found
On the construction of decentralised service-oriented orchestration systems
Modern science relies on workflow technology to capture, process, and analyse data obtained from scientific instruments. Scientific workflows are precise descriptions of experiments in which multiple computational tasks are coordinated based on the dataflows between them. Orchestrating scientific workflows presents a significant research challenge: they are typically executed in a manner such that all data pass through a centralised computer server known as the engine, which causes unnecessary network traffic that leads to a performance bottleneck. These workflows are commonly composed of services that perform computation over geographically distributed resources, and involve the management of dataflows between them. Centralised orchestration is clearly not a scalable approach for coordinating services dispersed across distant geographical locations. This thesis presents a scalable decentralised service-oriented orchestration system that relies on a high-level data coordination language for the specification and execution of workflows. This system’s architecture consists of distributed engines, each of which is responsible for executing part of the overall workflow. It exploits parallelism in the workflow by decomposing it into smaller sub-workflows, and determines the most appropriate engines to execute them using computation placement analysis. This permits the workflow logic to be distributed closer to the services providing the data for execution, which reduces the overall data transfer in the workflow and improves its execution time. This thesis provides an evaluation of the presented system which concludes that decentralised orchestration provides scalability benefits over centralised orchestration, and improves the overall performance of executing a service-oriented workflow
Big Data Pipelines on the Computing Continuum: Tapping the Dark Data
The computing continuum enables new opportunities for managing big data pipelines concerning efficient management of heterogeneous and untrustworthy resources. We discuss the big data pipelines lifecycle on the computing continuum and its associated challenges, and we outline a future research agenda in this area.acceptedVersio
Servitized Enterprises for Distributed Collaborative Commerce
Servitized Enterprises for Distributed Collaborative Commerce: 10.4018/jssmet.2010010105: Agility and innovation are essential for survival in today’s business world. Mergers and acquisitions, new regulations, rapidly changing technolog
Runtime Adaptation of Scientific Service Workflows
Software landscapes are rather subject to change than being complete after having been built. Changes may be caused by a modified customer behavior, the shift to new hardware resources, or otherwise changed requirements. In such situations, several challenges arise. New architectural models have to be designed and implemented, existing software has to be integrated, and, finally, the new software has to be deployed, monitored, and, where appropriate, optimized during runtime under realistic usage scenarios. All of these situations often demand manual intervention, which causes them to be error-prone.
This thesis addresses these types of runtime adaptation. Based on service-oriented architectures, an environment is developed that enables the integration of existing software (i.e., the wrapping of legacy software as web services). A workflow modeling tool that aims at an easy-to-use approach by separating the role of the workflow expert and the role of the domain expert. After the development of workflows, tools that observe the executing infrastructure and perform automatic scale-in and scale-out operations are presented. Infrastructure-as-a-Service providers are used to scale the infrastructure in a transparent and cost-efficient way. The deployment of necessary middleware tools is automatically done.
The use of a distributed infrastructure can lead to communication problems. In order to keep workflows robust, these exceptional cases need to treated. But, in this way, the process logic of a workflow gets mixed up and bloated with infrastructural details, which yields an increase in its complexity. In this work, a module is presented that can deal automatically with infrastructural faults and that thereby allows to keep the separation of these two layers.
When services or their components are hosted in a distributed environment, some requirements need to be addressed at each service separately. Although techniques as object-oriented programming or the usage of design patterns like the interceptor pattern ease the adaptation of service behavior or structures. Still, these methods require to modify the configuration or the implementation of each individual service. On the other side, aspect-oriented programming allows to weave functionality into existing code even without having its source. Since the functionality needs to be woven into the code, it depends on the specific implementation. In a service-oriented architecture, where the implementation of a service is unknown, this approach clearly has its limitations. The request/response aspects presented in this thesis overcome this obstacle and provide a SOA-compliant and new methods to weave functionality into the communication layer of web services.
The main contributions of this thesis are the following:
Shifting towards a service-oriented architecture: The generic and extensible Legacy Code Description Language and the corresponding framework allow to wrap existing software, e.g., as web services, which afterwards can be composed into a workflow by SimpleBPEL without overburdening the domain expert with technical details that are indeed handled by a workflow expert.
Runtime adaption: Based on the standardized Business Process Execution Language an automatic scheduling approach is presented that monitors all used resources and is able to automatically provision new machines in case a scale-out becomes necessary. If the resource's load drops, e.g., because of less workflow executions, a scale-in is also automatically performed. The scheduling algorithm takes the data transfer between the services into account in order to prevent scheduling allocations that eventually increase the workflow's makespan due to unnecessary or disadvantageous data transfers. Furthermore, a multi-objective scheduling algorithm that is based on a genetic algorithm is able to additionally consider cost, in a way that a user can define her own preferences rising from optimized execution times of a workflow and minimized costs. Possible communication errors are automatically detected and, according to certain constraints, corrected.
Adaptation of communication: The presented request/response aspects allow to weave functionality into the communication of web services. By defining a pointcut language that only relies on the exchanged documents, the implementation of services must neither be known nor be available. The weaving process itself is modeled using web services. In this way, the concept of request/response aspects is naturally embedded into a service-oriented architecture
Next Generation Cloud Computing: New Trends and Research Directions
The landscape of cloud computing has significantly changed over the last
decade. Not only have more providers and service offerings crowded the space,
but also cloud infrastructure that was traditionally limited to single provider
data centers is now evolving. In this paper, we firstly discuss the changing
cloud infrastructure and consider the use of infrastructure from multiple
providers and the benefit of decentralising computing away from data centers.
These trends have resulted in the need for a variety of new computing
architectures that will be offered by future cloud infrastructure. These
architectures are anticipated to impact areas, such as connecting people and
devices, data-intensive computing, the service space and self-learning systems.
Finally, we lay out a roadmap of challenges that will need to be addressed for
realising the potential of next generation cloud systems.Comment: Accepted to Future Generation Computer Systems, 07 September 201
Autonomous Incident Response
Trabalho de Projeto de Mestrado, Segurança Informática, 2022, Universidade de Lisboa, Faculdade de CiênciasInformation security is a must-have for any organization willing to stay relevant and
grow, it plays an important role as a business enabler, be it from a regulatory perspective
or a reputation perspective. Having people, process, and technology to solve the ever
growing number of security incidents as fast as possible and with the least amount of
impact is a challenge for small and big companies.
To address this challenge, companies started investing in Security Orchestration, Automation, and Response (SOAR) [39, 68, 70]. Security orchestration is the planning,
integration, cooperation, and coordination of the activities of security tools and experts to
produce and automate required actions in response to any security incident across multiple technology paradigms [40]. In other words, the use of SOAR is a way to translate the
manual procedures followed by the security analysts into automated actions, making the
process faster and scalable while saving on human resources budget.
This project proposes a low-cost cloud native SOAR platform that is based on serverless computing, presenting the underlying details of its design. The performance of the
proposed solution was evaluated through 364 real-world incidents related to 11 use cases
in a large multinational enterprise. The results show that the solution is able to decrease
the duration of the tasks by an average of 98.81% while having an operating expense of
less than $65/month.
Prior to the SOAR, it took the analyst 75.84 hours to perform manual tasks related
to the 11 use cases. Additionally, an estimated 450 hours of the analyst’s time would be
used to run the Update threat intelligence database use case. After the SOAR, the same
tasks were automatically ran in 31.2 minutes and the Update threat intelligence database
use case ran 9.000 times in 5.3 hours
SmartRegio – Employing Spatial Data to Provide Decision Support for SMEs and City Administrations
When decisions have to be made which are based on the characteristics and expected developments in
specific spatial environments (such as finding the best place for a new production site or for a new shop), geo
data and the information that can be derived from it plays a crucial role. While larger companies typically
can afford the setup of the required organisational units as well as the access to relevant data from
commercial providers, smaller organisations such as SMEs or city administrations are at a disadvantage. The
aim of the SmartRegio project was to develop solutions for such organisations that combine freely available
(mass) spatial data from many different sources as a decision-making basis focusing on governmental and
private actors operating with a focus on a specific region. The data sources include data from infrastructures
like energy and mobility, data from public entities, and also data from social media and media channels. The
SmartRegio project successfully identified and tackled major technical and legal challenges when aiming to
exploit such data, while at the same time realising a generic infrastructure that supports the required
processes within the given context
Programming models to support data science workflows
Data Science workflows have become a must to progress in many scientific areas such as life, health, and earth sciences. In contrast to traditional HPC workflows, they are more heterogeneous; combining binary executions, MPI simulations, multi-threaded applications, custom analysis (possibly written in Java, Python, C/C++ or R), and real-time processing. Furthermore, in the past, field experts were capable of programming and running small simulations. However, nowadays, simulations requiring hundreds or thousands of cores are widely used and, to this point, efficiently programming them becomes a challenge even for computer sciences. Thus, programming languages and models make a considerable effort to ease the programmability while maintaining acceptable performance.
This thesis contributes to the adaptation of High-Performance frameworks to support the needs and challenges of Data Science workflows by extending COMPSs, a mature, general-purpose, task-based, distributed programming model. First, we enhance our prototype to orchestrate different frameworks inside a single programming model so that non-expert users can build complex workflows where some steps require highly optimised state of the art frameworks. This extension includes the @binary, @OmpSs, @MPI, @COMPSs, and @MultiNode annotations for both Java and Python workflows.
Second, we integrate container technologies to enable developers to easily port, distribute, and scale their applications to distributed computing platforms. This combination provides a straightforward methodology to parallelise applications from sequential codes along with efficient image management and application deployment that ease the packaging and distribution of applications. We distinguish between static, HPC, and dynamic container management and provide representative use cases for each scenario using Docker, Singularity, and Mesos.
Third, we design, implement and integrate AutoParallel, a Python module to automatically find an appropriate task-based parallelisation of affine loop nests and execute them in parallel in a distributed computing infrastructure. It is based on sequential programming and requires one single annotation (the @parallel Python decorator) so that anyone with intermediate-level programming skills can scale up an application to hundreds of cores.
Finally, we propose a way to extend task-based management systems to support continuous input and output data to enable the combination of task-based workflows and dataflows (Hybrid Workflows) using one single programming model. Hence, developers can build complex Data Science workflows with different approaches depending on the requirements without the effort of combining several frameworks at the same time. Also, to illustrate the capabilities of Hybrid Workflows, we have built a Distributed Stream Library that can be easily integrated with existing task-based frameworks to provide support for dataflows. The library provides a homogeneous, generic, and simple representation of object and file streams in both Java and Python; enabling complex workflows to handle any data type without dealing directly with the streaming back-end.Els fluxos de treball de Data Science s’han convertit en una necessitat per progressar en moltes Ă rees cientĂfiques com les ciències de la vida, la salut i la terra. A diferència dels fluxos de treball tradicionals per a la CAP, els fluxos de Data Science sĂłn mĂ©s heterogenis; combinant l’execuciĂł de binaris, simulacions MPI, aplicacions multiprocĂ©s, anĂ lisi personalitzats (possiblement escrits en Java, Python, C / C ++ o R) i computacions en temps real. Mentre que en el passat els experts de cada camp eren capaços de programar i executar petites simulacions, avui dia, aquestes simulacions representen un repte fins i tot per als experts ja que requereixen centenars o milers de nuclis. Per aquesta raĂł, els llenguatges i models de programaciĂł actuals s’esforcen considerablement en incrementar la programabilitat mantenint un rendiment acceptable. Aquesta tesi contribueix a l’adaptaciĂł de models de programaciĂł per a la CAP per afrontar les necessitats i reptes dels fluxos de Data Science estenent COMPSs, un model de programaciĂł distribuĂŻda madur, de propòsit general, i basat en tasques. En primer lloc, millorem el nostre prototip per orquestrar diferent programari per a que els usuaris no experts puguin crear fluxos complexos usant un Ăşnic model on alguns passos requereixin tecnologies altament optimitzades. Aquesta extensiĂł inclou les anotacions de @binary, @OmpSs, @MPI, @COMPSs, i @MultiNode per a fluxos en Java i Python. En segon lloc, integrem tecnologies de contenidors per permetre als desenvolupadors portar, distribuir i escalar fĂ cilment les seves aplicacions en plataformes distribuĂŻdes. A mĂ©s d’una metodologia senzilla per a paral·lelitzar aplicacions a partir de codis seqĂĽencials, aquesta combinaciĂł proporciona una gestiĂł d’imatges i una implementaciĂł d’aplicacions eficients que faciliten l’empaquetat i la distribuciĂł d’aplicacions. Distingim entre la gestiĂł de contenidors estĂ tica, CAP i dinĂ mica i proporcionem casos d’ús representatius per a cada escenari amb Docker, Singularity i Mesos. En tercer lloc, dissenyem, implementem i integrem AutoParallel, un mòdul de Python per determinar automĂ ticament la paral·lelitzaciĂł basada en tasques de nius de bucles afins i executar-los en paral·lel en una infraestructura distribuĂŻda. AutoParallel estĂ basat en programaciĂł seqĂĽencial, requereix una sola anotaciĂł (el decorador @parallel) i permet a un usuari intermig escalar una aplicaciĂł a centenars de nuclis. Finalment, proposem una forma d’estendre els sistemes basats en tasques per admetre dades d’entrada i sortida continus; permetent aixĂ la combinaciĂł de fluxos de treball i dades (Fluxos HĂbrids) en un Ăşnic model. ConseqĂĽentment, els desenvolupadors poden crear fluxos complexos seguint diferents patrons sense l’esforç de combinar diversos models al mateix temps. A mĂ©s, per a il·lustrar les capacitats dels Fluxos HĂbrids, hem creat una biblioteca (DistroStreamLib) que s’integra fĂ cilment amb els models basats en tasques per suportar fluxos de dades. La biblioteca proporciona una representaciĂł homogènia, genèrica i simple de seqüències contĂnues d’objectes i arxius en Java i Python; permetent gestionar qualsevol tipus de dades sense tractar directament amb el back-end de streaming.Los flujos de trabajo de Data Science se han convertido en una necesidad para progresar en muchas áreas cientĂficas como las ciencias de la vida, la salud y la tierra. A diferencia de los flujos de trabajo tradicionales para la CAP, los flujos de Data Science son más heterogĂ©neos; combinando la ejecuciĂłn de binarios, simulaciones MPI, aplicaciones multiproceso, análisis personalizados (posiblemente escritos en Java, Python, C/C++ o R) y computaciones en tiempo real. Mientras que en el pasado los expertos de cada campo eran capaces de programar y ejecutar pequeñas simulaciones, hoy en dĂa, estas simulaciones representan un desafĂo incluso para los expertos ya que requieren cientos o miles de nĂşcleos. Por esta razĂłn, los lenguajes y modelos de programaciĂłn actuales se esfuerzan considerablemente en incrementar la programabilidad manteniendo un rendimiento aceptable.
Esta tesis contribuye a la adaptaciĂłn de modelos de programaciĂłn para la CAP para
afrontar las necesidades y desafĂos de los flujos de Data Science extendiendo COMPSs, un modelo de programaciĂłn distribuida maduro, de propĂłsito general, y basado en tareas. En primer lugar, mejoramos nuestro prototipo para orquestar diferentes software para que los usuarios no expertos puedan crear flujos complejos usando un Ăşnico modelo donde algunos pasos requieran tecnologĂas altamente optimizadas. Esta extensiĂłn incluye las anotaciones de @binary, @OmpSs, @MPI, @COMPSs, y @MultiNode para flujos en Java y Python.
En segundo lugar, integramos tecnologĂas de contenedores para permitir a los desarrolladores portar, distribuir y escalar fácilmente sus aplicaciones en plataformas distribuidas.
Además de una metodologĂa sencilla para paralelizar aplicaciones a partir de cĂłdigos secuenciales, esta combinaciĂłn proporciona una gestiĂłn de imágenes y una implementaciĂłn de aplicaciones eficientes que facilitan el empaquetado y la distribuciĂłn de aplicaciones.
Distinguimos entre gestión de contenedores estática, CAP y dinámica y proporcionamos casos de uso representativos para cada escenario con Docker, Singularity y Mesos.
En tercer lugar, diseñamos, implementamos e integramos AutoParallel, un módulo de
Python para determinar automáticamente la paralelización basada en tareas de nidos de bucles afines y ejecutarlos en paralelo en una infraestructura distribuida. AutoParallel está basado en programación secuencial, requiere una sola anotación (el decorador @parallel) y permite a un usuario intermedio escalar una aplicación a cientos de núcleos.
Finalmente, proponemos una forma de extender los sistemas basados en tareas para admitir datos de entrada y salida continuos; permitiendo asĂ la combinaciĂłn de flujos de trabajo y datos (Flujos HĂbridos) en un Ăşnico modelo. Consecuentemente, los desarrolladores pueden crear flujos complejos siguiendo diferentes patrones sin el esfuerzo de combinar varios modelos al mismo tiempo. Además, para ilustrar las capacidades de los Flujos HĂbridos, hemos creado una biblioteca (DistroStreamLib) que se integra fácilmente a los modelos basados en tareas para soportar flujos de datos. La biblioteca proporciona una representaciĂłn homogĂ©nea, genĂ©rica y simple de secuencias continuas de objetos y archivos en Java y Python; permitiendo manejar cualquier tipo de datos sin tratar directamente con el back-end de streaming.Postprint (published version
- …