Search CORE

27 research outputs found

An Efficient Architecture for Information Retrieval in P2P Context Using Hypergraph

Author: Durand Nicolas
Hajjar Mohammad
Ismail Anis
Nachouki Gilles
Quafafou Mohamed
Publication venue
Publication date: 01/01/2011
Field of study

Peer-to-peer (P2P) Data-sharing systems now generate a significant portion of Internet traffic. P2P systems have emerged as an accepted way to share enormous volumes of data. Needs for widely distributed information systems supporting virtual organizations have given rise to a new category of P2P systems called schema-based. In such systems each peer is a database management system in itself, ex-posing its own schema. In such a setting, the main objective is the efficient search across peer databases by processing each incoming query without overly consuming bandwidth. The usability of these systems depends on successful techniques to find and retrieve data; however, efficient and effective routing of content-based queries is an emerging problem in P2P networks. This work was attended as an attempt to motivate the use of mining algorithms in the P2P context may improve the significantly the efficiency of such methods. Our proposed method based respectively on combination of clustering with hypergraphs. We use ECCLAT to build approximate clustering and discovering meaningful clusters with slight overlapping. We use an algorithm MTMINER to extract all minimal transversals of a hypergraph (clusters) for query routing. The set of clusters improves the robustness in queries routing mechanism and scalability in P2P Network. We compare the performance of our method with the baseline one considering the queries routing problem. Our experimental results prove that our proposed methods generate impressive levels of performance and scalability with with respect to important criteria such as response time, precision and recall.Comment: 2o pages, 8 figure

arXiv.org e-Print Archive

HAL AMU

Recommended from our members

A Paradigm for Scalable, Transactional, and Efficient Spatial Indexes

Author: Gao Ning
Publication venue: University of Colorado Boulder
Publication date: 01/01/2018
Field of study

With large volumes of geo-tagged data collected in various applications, spatial query pro- cessing becomes essential. Query engines depend on efficient indexes to expedite processing. There are three main challenges: scaling out to accommodate large volumes of spatial data, support- ing transactional primitives for strong consistency guarantees, and adapting to highly dynamic workloads. This thesis proposes a paradigm for scalable, transactional, and efficient spatial indexes to significantly reduce development efforts in designing and comparing multiple spatial indexes.This thesis first introduces a distributed and transactional key value store called DTranx to persist the spatial indexes. DTranx follows the SEDA architecture to exploit high concurrency in multi-core environments and it adopts a hybrid of optimistic concurrency control and two-phase commit protocols to narrow down the critical sections of distributed locking during transaction com- mits. Moreover, DTranx integrates a persistent memory based write-ahead log to reduce durability overhead and combines a garbage collection mechanism without affecting normal transactions. To maintain high throughput for search workloads when databases are constantly updated, snapshot transactions are introduced.Then, a paradigm is presented with a set of intuitive APIs and a Mempool runtime to re- duce development efforts. Mempool transparently synchronizes local states of data structures with DTranx and it handles two critical tasks: address translation and transparent server synchroniza- tion, of which the latter includes transaction construction and data synchronization. Furthermore, a dynamic partitioning strategy is integrated into DTranx to generate partitioning and replication plans that reduce inter-server communications and balance resource usage.Lastly, single-threaded data structures BTree and RTree are converted into distributed versions within two weeks. The BTree and RTree achieve 253.07 kops/sec and 77.83 kops/sec through- put respectively for pure search operations in a 25-server cluster

CU Scholar Institutional Repository

A framework for multidimensional indexes on distributed and highly-available data stores

Author: Cugnasco Cesare
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2019
Field of study

Spatial Big Data is considered an essential trend in future scientific and business applications. Indeed, research instruments, medical devices, and social networks generate hundreds of peta bytes of spatial data per year. However, as many authors have pointed out, the lack of specialized frameworks dealing with such kind of data is limiting possible applications and probably precluding many scientific breakthroughs. In this thesis, we describe three HPC scientific applications, ranging from molecular dynamics, neuroscience analysis, and physics simulations, where we experience first hand the limits of the existing technologies. Thanks to our experience, we define the desirable missing functionalities, and we focus on two features that when combined significantly improve the way scientific data is analyzed. On one side, scientific simulations generate complex datasets where multiple correlated characteristics describe each item. For instance, a particle might have a space position (x,y,z) at a given time (t). If we want to find all elements within the same area and period, we either have to scan the whole dataset, or we must organize the data so that all items in the same space and time are stored together. The second approach is called Multidimensional Indexing (MI), and it uses different techniques to cluster and to organize similar data together. On the other side, approximate analytics has been often indicated as a smart and flexible way to explore large datasets in a short period. Approximate analytics includes a broad family of algorithms which aims to speed up analytical workloads by relaxing the precision of the results within a specific interval of confidence. For instance, if we want to know the average age in a group with 1-year precision, we can consider just a random fraction of all the people, thus reducing the amount of calculation. But if we also want less I/O operations, we need efficient data sampling, which means organizing data in a way that we do not need to scan the whole data set to generate a random sample of it. According to our analysis, combining Multidimensional Indexing with efficient data Sampling (MIS) is a vital missing feature not available in the current distributed data management solutions. This thesis aims to solve such a shortcoming and it provides novel scalable solutions. At first, we describe the existing data management alternatives; then we motivate our preference for NoSQL key-value databases. Secondly, we propose an analytical model to study the influence of data models on the scalability and performance of this kind of distributed database. Thirdly, we use the analytical model to design two novel multidimensional indexes with efficient data sampling: the D8tree and the AOTree. Our first solution, the D8tree, improves state of the art for approximate spatial queries on static and mostly read dataset. Later, we enhanced the data ingestion capability or our approach by introducing the AOTree, an algorithm that enables the query performance of the D8tree even for HPC write-intensive applications. We compared our solution with PostgreSQL and plain storage, and we demonstrate that our proposal has better performance and scalability. Finally, we describe Qbeast, the novel distributed system that implements the D8tree and the AOTree using NoSQL technologies, and we illustrate how Qbeast simplifies the workflow of scientists in various HPC applications providing a scalable and integrated solution for data analysis and management.La gestión de BigData con información espacial está considerada como una tendencia esencial en el futuro de las aplicaciones científicas y de negocio. De hecho, se generan cientos de petabytes de datos espaciales por año mediante instrumentos de investigación, dispositivos médicos y redes sociales. Sin embargo, tal y como muchos autores han señalado, la falta de entornos especializados en manejar este tipo de datos está limitando sus posibles aplicaciones y está impidiendo muchos avances científicos. En esta tesis, describimos 3 aplicaciones científicas HPC, que cubren los ámbitos de dinámica molecular, análisis neurocientífico y simulaciones físicas, donde hemos experimentado en primera mano las limitaciones de las tecnologías existentes. Gracias a nuestras experiencias, hemos podido definir qué funcionalidades serían deseables y no existen, y nos hemos centrado en dos características que, al combinarlas, mejoran significativamente la manera en la que se analizan los datos científicos. Por un lado, las simulaciones científicas generan conjuntos de datos complejos, en los que cada elemento es descrito por múltiples características correlacionadas. Por ejemplo, una partícula puede tener una posición espacial (x, y, z) en un momento dado (t). Si queremos encontrar todos los elementos dentro de la misma área y periodo, o bien recorremos y analizamos todo el conjunto de datos, o bien organizamos los datos de manera que se almacenen juntos todos los elementos que comparten área en un momento dado. Esta segunda opción se conoce como Indexación Multidimensional (IM) y usa diferentes técnicas para agrupar y organizar datos similares. Por otro lado, se suele señalar que las analíticas aproximadas son una manera inteligente y flexible de explorar grandes conjuntos de datos en poco tiempo. Este tipo de analíticas incluyen una amplia familia de algoritmos que acelera el tiempo de procesado, relajando la precisión de los resultados dentro de un determinado intervalo de confianza. Por ejemplo, si queremos saber la edad media de un grupo con precisión de un año, podemos considerar sólo un subconjunto aleatorio de todas las personas, reduciendo así la cantidad de cálculo. Pero si además queremos menos operaciones de entrada/salida, necesitamos un muestreo eficiente de datos, que implica organizar los datos de manera que no necesitemos recorrerlos todos para generar una muestra aleatoria. De acuerdo con nuestros análisis, la combinación de Indexación Multidimensional con Muestreo eficiente de datos (IMM) es una característica vital que no está disponible en las soluciones actuales de gestión distribuida de datos. Esta tesis pretende resolver esta limitación y proporciona unas soluciones novedosas que son escalables. En primer lugar, describimos las alternativas de gestión de datos que existen y motivamos nuestra preferencia por las bases de datos NoSQL basadas en clave-valor. En segundo lugar, proponemos un modelo analítico para estudiar la influencia que tienen los modelos de datos sobre la escalabilidad y el rendimiento de este tipo de bases de datos distribuidas. En tercer lugar, usamos el modelo analítico para diseñar dos novedosos algoritmos IMM: el D8tree y el AOTree. Nuestra primera solución, el D8tree, mejora el estado del arte actual para consultas espaciales aproximadas, cuando el conjunto de datos es estático y mayoritariamente de lectura. Después, mejoramos la capacidad de ingestión introduciendo el AOTree, un algoritmo que conserva el rendimiento del D8tree incluso para aplicaciones HPC intensivas en escritura. Hemos comparado nuestra solución con PostgreSQL y almacenamiento plano demostrando que nuestra propuesta mejora tanto el rendimiento como la escalabilidad. Finalmente, describimos Qbeast, el sistema que implementa los algoritmos D8tree y AOTree, e ilustramos cómo Qbeast simplifica el flujo de trabajo de los científicos ofreciendo una solución escalable e integraPostprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

A framework for multidimensional indexes on distributed and highly-available data stores

Author: Cugnasco Cesare
Publication venue: Universitat Politècnica de Catalunya
Publication date: 22/03/2019
Field of study

Tesis Doctorals en Xarxa

Techniques efficaces basées sur des vues matérialisées pour la gestion des données du Web (algorithmes et systèmes)

Author: KATSIFODIMOS Asterios
MANOLESCU GOUJOT Ioana Gabriela
Publication venue
Publication date: 01/01/2013
Field of study

Le langage XML, proposé par le W3C, est aujourd hui utilisé comme un modèle de données pour le stockage et l interrogation de grands volumes de données dans les systèmes de bases de données. En dépit d importants travaux de recherche et le développement de systèmes efficace, le traitement de grands volumes de données XML pose encore des problèmes des performance dus à la complexité et hétérogénéité des données ainsi qu à la complexité des langages courants d interrogation XML. Les vues matérialisées sont employées depuis des décennies dans les bases de données afin de raccourcir les temps de traitement des requêtes. Elles peuvent être considérées les résultats de requêtes pré-calculées, que l on réutilise afin d éviter de recalculer (complètement ou partiellement) une nouvelle requête. Les vues matérialisées ont fait l objet de nombreuses recherches, en particulier dans le contexte des entrepôts des données relationnelles.Cette thèse étudie l applicabilité de techniques de vues matérialisées pour optimiser les performances des systèmes de gestion de données Web, et en particulier XML, dans des environnements distribués. Dans cette thèse, nos apportons trois contributions.D abord, nous considérons le problème de la sélection des meilleures vues à matérialiser dans un espace de stockage donné, afin d améliorer la performance d une charge de travail des requêtes. Nous sommes les premiers à considérer un sous-langage de XQuery enrichi avec la possibilité de sélectionner des noeuds multiples et à de multiples niveaux de granularités. La difficulté dans ce contexte vient de la puissance expressive et des caractéristiques du langage des requêtes et des vues, et de la taille de l espace de recherche de vues que l on pourrait matérialiser.Alors que le problème général a une complexité prohibitive, nous proposons et étudions un algorithme heuristique et démontrer ses performances supérieures par rapport à l état de l art.Deuxièmement, nous considérons la gestion de grands corpus XML dans des réseaux pair à pair, basées sur des tables de hachage distribuées. Nous considérons la plateforme ViP2P dans laquelle des vues XML distribuées sont matérialisées à partir des données publiées dans le réseau, puis exploitées pour répondre efficacement aux requêtes émises par un pair du réseau. Nous y avons apporté d importantes optimisations orientées sur le passage à l échelle, et nous avons caractérisé la performance du système par une série d expériences déployées dans un réseau à grande échelle. Ces expériences dépassent de plusieurs ordres de grandeur les systèmes similaires en termes de volumes de données et de débit de dissémination des données. Cette étude est à ce jour la plus complète concernant une plateforme de gestion de contenus XML déployée entièrement et testée à une échelle réelle.Enfin, nous présentons une nouvelle approche de dissémination de données dans un système d abonnements, en présence de contraintes sur les ressources CPU et réseau disponibles; cette approche est mise en oeuvre dans le cadre de notre plateforme Delta. Le passage à l échelle est obtenu en déchargeant le fournisseur de données de l effort de répondre à une partie des abonnements. Pour cela, nous tirons profit de techniques de réécriture de requêtes à l aide de vues afin de diffuser les données de ces abonnements, à partir d autres abonnements.Notre contribution principale est un nouvel algorithme qui organise les vues dans un réseau de dissémination d information multi-niveaux ; ce réseau est calculé à l aide d outils techniques de programmation linéaire afin de passer à l échelle pour de grands nombres de vues, respecter les contraintes de capacité du système, et minimiser les délais de propagation des information. L efficacité et la performance de notre algorithme est confirmée par notre évaluation expérimentale, qui inclut l étude d un déploiement réel dans un réseau WAN.XML was recommended by W3C in 1998 as a markup language to be used by device- and system-independent methods of representing information. XML is nowadays used as a data model for storing and querying large volumes of data in database systems. In spite of significant research and systems development, many performance problems are raised by processing very large amounts of XML data. Materialized views have long been used in databases to speed up queries. Materialized views can be seen as precomputed query results that can be re-used to evaluate (part of) another query, and have been a topic of intensive research, in particular in the context of relational data warehousing. This thesis investigates the applicability of materialized views techniques to optimize the performance of Web data management tools, in particular in distributed settings, considering XML data and queries. We make three contributions.We first consider the problem of choosing the best views to materialize within a given space budget in order to improve the performance of a query workload. Our work is the first to address the view selection problem for a rich subset of XQuery. The challenges we face stem from the expressive power and features of both the query and view languages and from the size of the search space of candidate views to materialize. While the general problem has prohibitive complexity, we propose and study a heuristic algorithm and demonstrate its superior performance compared to the state of the art.Second, we consider the management of large XML corpora in peer-to-peer networks, based on distributed hash tables (or DHTs, in short). We consider a platform leveraging distributed materialized XML views, defined by arbitrary XML queries, filled in with data published anywhere in the network, and exploited to efficiently answer queries issued by any network peer. This thesis has contributed important scalability oriented optimizations, as well as a comprehensive set of experiments deployed in a country-wide WAN. These experiments outgrow by orders of magnitude similar competitor systems in terms of data volumes and data dissemination throughput. Thus, they are the most advanced in understanding the performance behavior of DHT-based XML content management in real settings.Finally, we present a novel approach for scalable content-based publish/subscribe (pub/sub, in short) in the presence of constraints on the available computational resources of data publishers. We achieve scalability by off-loading subscriptions from the publisher, and leveraging view-based query rewriting to feed these subscriptions from the data accumulated in others. Our main contribution is a novel algorithm for organizing subscriptions in a multi-level dissemination network in order to serve large numbers of subscriptions, respect capacity constraints, and minimize latency. The efficiency and effectiveness of our algorithm are confirmed through extensive experiments and a large deployment in a WAN.PARIS11-SCD-Bib. électronique (914719901) / SudocSudocFranceF

OpenGrey Repository

Sustainable Value Co-Creation in Welfare Service Ecosystems : Transforming temporary collaboration projects into permanent resource integration

Author: Danielsson Pernilla
Westrup Ulrika
Publication venue: VinUniversity
Publication date: 01/06/2023
Field of study

The aim of this paper is to discuss the unexploited forces of user-orientation and shared responsibility to promote sustainable value co-creation during service innovation projects in welfare service ecosystems. The framework is based on the theoretical field of public service logic (PSL) and our thesis is that service innovation seriously requires a user-oriented approach, and that such an approach enables resource integration based on the service-user’s needs and lifeworld. In our findings, we identify prerequisites and opportunities of collaborative service innovation projects in order to transform these projects into sustainable resource integration once they have ended

Lund University Publications

Seventh Biennial Report : June 2003 - March 2005

Author
Publication venue: Max-Planck-Institut für Informatik
Publication date: 01/01/2005
Field of study

MPG.PuRe

An ontology-based approach toward the configuration of heterogeneous network devices

Author: Martínez Anny
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2015
Field of study

Despite the numerous efforts of standardization, semantic issues remain in effect in many subfields of networking. The inability to exchange data unambiguously between information systems and human resources is an issue that hinders technology implementation, semantic interoperability, service deployment, network management, technology migration, among many others. In this thesis, we will approach the semantic issues in two critical subfields of networking, namely, network configuration management and network addressing architectures. The fact that makes the study in these areas rather appealing is that in both scenarios semantic issues have been around from the very early days of networking. However, as networks continue to grow in size and complexity current practices are becoming neither scalable nor practical. One of the most complex and essential tasks in network management is the configuration of network devices. The lack of comprehensive and standard means for modifying and controlling the configuration of network elements has led to the continuous and extended use of proprietary Command Line Interfaces (CLIs). Unfortunately, CLIs are generally both, device and vendor-specific. In the context of heterogeneous network infrastructures---i.e., networks typically composed of multiple devices from different vendors---the use of several CLIs raises serious Operation, Administration and Management (OAM) issues. Accordingly, network administrators are forced to gain specialized expertise and to continuously keep knowledge and skills up to date as new features, system upgrades or technologies appear. Overall, the utilization of proprietary mechanisms allows neither sharing knowledge consistently between vendors' domains nor reusing configurations to achieve full automation of network configuration tasks---which are typically required in autonomic management. Due to this heterogeneity, CLIs typically provide a help feature which is in turn an useful source of knowledge to enable semantic interpretation of a vendor's configuration space. The large amount of information a network administrator must learn and manage makes Information Extraction (IE) and other forms of natural language analysis of the Artificial Intelligence (AI) field key enablers for the network device configuration space. This thesis presents the design and implementation specification of the first Ontology-Based Information Extraction (OBIE) System from the CLI of network devices for the automation and abstraction of device configurations. Moreover, the so-called semantic overload of IP addresses---wherein addresses are both identifiers and locators of a node at the same time---is one of the main constraints over mobility of network hosts, multi-homing and scalability of the routing system. In light of this, numerous approaches have emerged in an effort to decouple the semantics of the network addressing scheme. In this thesis, we approach this issue from two perspectives, namely, a non-disruptive (i.e., evolutionary) solution to the current Internet and a clean-slate approach for Future Internet. In the first scenario, we analyze the Locator/Identifier Separation Protocol (LISP) as it is currently one of the strongest solutions to the semantic overload issue. However, its adoption is hindered by existing problems in the proposed mapping systems. Herein, we propose the LISP Redundancy Protocol (LRP) aimed to complement the LISP framework and strengthen feasibility of deployment, while at the same time, minimize mapping table size, latency time and maximize reachability in the network. In the second scenario, we explore TARIFA a Next Generation Internet architecture and introduce a novel service-centric addressing scheme which aims to overcome the issues related to routing and semantic overload of IP addresses.A pesar de los numerosos esfuerzos de estandarización, los problemas de semántica continúan en efecto en muchas subáreas de networking. La inabilidad de intercambiar data sin ambiguedad entre sistemas es un problema que limita la interoperabilidad semántica. En esta tesis, abordamos los problemas de semántica en dos áreas: (i) la gestión de configuración y (ii) arquitecturas de direccionamiento. El hecho que hace el estudio en estas áreas de interés, es que los problemas de semántica datan desde los inicios del Internet. Sin embargo, mientras las redes continúan creciendo en tamaño y complejidad, los mecanismos desplegados dejan de ser escalabales y prácticos. Una de las tareas más complejas y esenciales en la gestión de redes es la configuración de equipos. La falta de mecanismos estándar para la modificación y control de la configuración de equipos ha llevado al uso continuado y extendido de interfaces por líneas de comando (CLI). Desafortunadamente, las CLIs son generalmente, específicos por fabricante y dispositivo. En el contexto de redes heterogéneas--es decir, redes típicamente compuestas por múltiples dispositivos de distintos fabricantes--el uso de varias CLIs trae consigo serios problemas de operación, administración y gestión. En consecuencia, los administradores de red se ven forzados a adquirir experiencia en el manejo específico de múltiples tecnologías y además, a mantenerse continuamente actualizados en la medida en que nuevas funcionalidades o tecnologías emergen, o bien con actualizaciones de sistemas operativos. En general, la utilización de mecanismos propietarios no permite compartir conocimientos de forma consistente a lo largo de plataformas heterogéneas, ni reutilizar configuraciones con el objetivo de alcanzar la completa automatización de tareas de configuración--que son típicamente requeridas en el área de gestión autonómica. Debido a esta heterogeneidad, las CLIs suelen proporcionar una función de ayuda que fundamentalmente aporta información para la interpretación semántica del entorno de configuración de un fabricante. La gran cantidad de información que un administrador debe aprender y manejar, hace de la extracción de información y otras formas de análisis de lenguaje natural del campo de Inteligencia Artificial, potenciales herramientas para la configuración de equipos en entornos heterogéneos. Esta tesis presenta el diseño y especificaciones de implementación del primer sistema de extracción de información basada en ontologías desde el CLI de dispositivos de red, para la automatización y abstracción de configuraciones. Por otra parte, la denominada sobrecarga semántica de direcciones IP--en donde, las direcciones son identificadores y localizadores al mismo tiempo--es una de las principales limitaciones sobre mobilidad, multi-homing y escalabilidad del sistema de enrutamiento. Por esta razón, numerosas propuestas han emergido en un esfuerzo por desacoplar la semántica del esquema de direccionamiento de las redes actuales. En esta tesis, abordamos este problema desde dos perspectivas, la primera de ellas una aproximación no-disruptiva (es decir, evolucionaria) al problema del Internet actual y la segunda, una nueva propuesta en torno a futuras arquitecturas del Internet. En el primer escenario, analizamos el protocolo LISP (del inglés, Locator/Identifier Separation Protocol) ya que es en efecto, una de las soluciones con mayor potencial para la resolucion del problema de semántica. Sin embargo, su adopción está limitada por problemas en los sistemas de mapeo propuestos. En esta tesis, proponemos LRP (del inglés, LISP Redundancy Protocol) un protocolo destinado a complementar LISP e incrementar la factibilidad de despliegue, a la vez que, reduce el tamaño de las tablas de mapeo, tiempo de latencia y maximiza accesibilidad. En el segundo escenario, exploramos TARIFA una arquitectura de red de nueva generación e introducimos un novedoso esquema de direccionamiento orientado a servicios

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Tesis Doctorals en Xarxa

Secure Information Sharing with Distributed Ledgers

Author: Putz Benedikt
Publication venue
Publication date: 21/09/2022
Field of study

In 2009, blockchain technology was first introduced as the supporting database technology for digital currencies. Since then, more advanced derivations of the technology have been developed under the broader term Distributed Ledgers, with improved scalability and support for general-purpose application logic. As a distributed database, they are able to support interorganizational information sharing while assuring desirable information security attributes like non-repudiation, auditability and transparency. Based on these characteristics, researchers and practitioners alike have begun to identify a plethora of disruptive use cases for Distributed Ledgers in existing application domains. While these use cases are promising significant efficiency improvements and cost reductions, practical adoption has been slow in the past years. This dissertation focuses on improving three aspects contributing to slow adoption. First, it attempts to identify application areas and substantiated use cases where Distributed Ledgers can considerably advance the security of information sharing. Second, it considers the security aspects of the technology itself, identifying threats to practical applications and detection approaches for these threats. And third, it investigates success factors for successful interorganizational collaborations using Distributed Ledgers

University of Regensburg Publication Server