426 research outputs found

    Structuring visual exploratory analysis of skill demand

    No full text
    The analysis of increasingly large and diverse data for meaningful interpretation and question answering is handicapped by human cognitive limitations. Consequently, semi-automatic abstraction of complex data within structured information spaces becomes increasingly important, if its knowledge content is to support intuitive, exploratory discovery. Exploration of skill demand is an area where regularly updated, multi-dimensional data may be exploited to assess capability within the workforce to manage the demands of the modern, technology- and data-driven economy. The knowledge derived may be employed by skilled practitioners in defining career pathways, to identify where, when and how to update their skillsets in line with advancing technology and changing work demands. This same knowledge may also be used to identify the combination of skills essential in recruiting for new roles. To address the challenges inherent in exploring the complex, heterogeneous, dynamic data that feeds into such applications, we investigate the use of an ontology to guide structuring of the information space, to allow individuals and institutions to interactively explore and interpret the dynamic skill demand landscape for their specific needs. As a test case we consider the relatively new and highly dynamic field of Data Science, where insightful, exploratory data analysis and knowledge discovery are critical. We employ context-driven and task-centred scenarios to explore our research questions and guide iterative design, development and formative evaluation of our ontology-driven, visual exploratory discovery and analysis approach, to measure where it adds value to users’ analytical activity. Our findings reinforce the potential in our approach, and point us to future paths to build on

    Semantic Data Management in Data Lakes

    Full text link
    In recent years, data lakes emerged as away to manage large amounts of heterogeneous data for modern data analytics. One way to prevent data lakes from turning into inoperable data swamps is semantic data management. Some approaches propose the linkage of metadata to knowledge graphs based on the Linked Data principles to provide more meaning and semantics to the data in the lake. Such a semantic layer may be utilized not only for data management but also to tackle the problem of data integration from heterogeneous sources, in order to make data access more expressive and interoperable. In this survey, we review recent approaches with a specific focus on the application within data lake systems and scalability to Big Data. We classify the approaches into (i) basic semantic data management, (ii) semantic modeling approaches for enriching metadata in data lakes, and (iii) methods for ontologybased data access. In each category, we cover the main techniques and their background, and compare latest research. Finally, we point out challenges for future work in this research area, which needs a closer integration of Big Data and Semantic Web technologies

    A graph-based meta-model for heterogeneous data management

    Get PDF
    The wave of interest in data-centric applications has spawned a high variety of data models, making it extremely difficult to evaluate, integrate or access them in a uniform way. Moreover, many recent models are too specific to allow immediate comparison with the others and do not easily support incremental model design. In this paper, we introduce GSMM, a meta-model based on the use of a generic graph that can be instantiated to a concrete data model by simply providing values for a restricted set of parameters and some high-level constraints, themselves represented as graphs. In GSMM, the concept of data schema is replaced by that of constraint, which allows the designer to impose structural restrictions on data in a very flexible way. GSMM includes GSL, a graph-based language for expressing queries and constraints that besides being applicable to data represented in GSMM, in principle, can be specialised and used for existing models where no language was defined. We show some sample applications of GSMM for deriving and comparing classical data models like the relational model, plain XML data, XML Schema, and time-varying semistructured data. We also show how GSMM can represent more recent modelling proposals: the triple stores, the BigTable model and Neo4j, a graph-based model for NoSQL data. A prototype showing the potential of the approach is also described

    Web technologies for environmental big data

    Get PDF
    Recent evolutions in computing science and web technology provide the environmental community with continuously expanding resources for data collection and analysis that pose unprecedented challenges to the design of analysis methods, workflows, and interaction with data sets. In the light of the recent UK Research Council funded Environmental Virtual Observatory pilot project, this paper gives an overview of currently available implementations related to web-based technologies for processing large and heterogeneous datasets and discuss their relevance within the context of environmental data processing, simulation and prediction. We found that, the processing of the simple datasets used in the pilot proved to be relatively straightforward using a combination of R, RPy2, PyWPS and PostgreSQL. However, the use of NoSQL databases and more versatile frameworks such as OGC standard based implementations may provide a wider and more flexible set of features that particularly facilitate working with larger volumes and more heterogeneous data sources

    Web technologies for environmental Big Data

    No full text

    Application-agnostic Personal Storage for Linked Data

    Get PDF
    Personaalsete andmete ristkasutuse puudumine veebirakenduste vahel on viinud olukorrani, kus kasutajate identiteet ja andmed on hajutatud eri teenusepakkujate vahel. Sellest tulenevalt on suuremad teenusepakkujad, kel on rohkem teenuseid ja kasutajaid,\n\rvĂ€iksematega vĂ”rreldes eelisseisus kasutajate andmete pealt lisandvÀÀrtuse, sh analĂŒĂŒtika, pakkumise seisukohast. Lisaks on sellisel andmete eraldamisel negatiivne mĂ”ju lĂ”ppkasutajatele, kellel on vaja sarnaseid andmeid korduvalt esitada vĂ”i uuendada eri teenusepakkujate juures vaid selleks, et kasutada teenust maksimaalselt. KĂ€esolevas töös kirjeldatakse personaalse andmeruumi disaini ja realisatsiooni, mis lihtsustab andmete jagamist rakenduste vahel. Lahenduses kasutatakse AppScale\n\rrakendusemootori identiteedi infrastruktuuri, millele lisatakse personaalse andmeruumi teenus, millele ligipÀÀsu saab hallata kasutaja ise. Andmeruumi kasutatavus eri kasutuslugude jaoks tagatakse lĂ€bi linkandmete pĂ”himĂ”tete rakendamise.Recent advances in cloud-based applications and services have led to the continuous replacement of traditional desktop applications with corresponding SaaS solutions. These cloud applications are provided by different service providers, and typically manage identity and personal data, such as user’s contact details, of its users by its own means.\n\rAs a result, the identities and personal data of users have been spread over different applications and servers, each capturing a partial snapshot of user data at certain time moment. This, however, has made maintenance of personal data for service providers difficult and resource-consuming. Furthermore, such kind of data segregation has the overall negative effect on the user experience of end-users who need to repeatedly re-enter and maintain in parallel the same data to gain the maximum benefit out of their applications. Finally, from an integration point of view – sealing of user data has led to the adoption of point-to-point integration models between service providers, which limits the evolution of application ecosystems compared to the models with content aggregators and brokers.\n\rIn this thesis, we will develop an application-agnostic personal storage, which allows sharing user data among applications. This will be achieved by extending AppScale app store identity infrastructure with a personal data storage, which can be easily accessed by any application in the cloud and it will be under the control of a user. Usability of data is leveraged via adoption of linked data principles

    Flexible Integration and Efficient Analysis of Multidimensional Datasets from the Web

    Get PDF
    If numeric data from the Web are brought together, natural scientists can compare climate measurements with estimations, financial analysts can evaluate companies based on balance sheets and daily stock market values, and citizens can explore the GDP per capita from several data sources. However, heterogeneities and size of data remain a problem. This work presents methods to query a uniform view - the Global Cube - of available datasets from the Web and builds on Linked Data query approaches

    Physical database design in document stores

    Get PDF
    Tesi en modalitat de cotutela, Universitat PolitĂšcnica de Catalunya i UniversitĂ© libre de BruxellesNoSQL is an umbrella term used to classify alternate storage systems to the traditional Relational Database Management Systems (RDBMSs). Among these, Document stores have gained popularity mainly due to the semi-structured data storage model and the rich query capabilities. They encourage users to use a data-first approach as opposed to a design-first one. Database design on document stores is mainly carried out in a trial-and-error or ad-hoc rule-based manner instead of a formal process such as normalization in an RDBMS. However, these approaches could easily lead to a non-optimal design resulting additional costs in the long run. This PhD thesis aims to provide a novel multi-criteria-based approach to database design in document stores. Most of such existing approaches are based on optimizing query performance. However, other factors include storage requirement and complexity of the stored documents specific to each use case. There is a large solution space of alternative designs due to the different combinations of referencing and nesting of data. Thus, we believe multi-criteria optimization is ideal to solve this problem. To achieve this, we need to address several issues that will enable us to apply multi-criteria optimization for the data design problem. First, we evaluate the impact of alternate storage representations of semi-structured data. There are multiple and equivalent ways to physically represent semi-structured data, but there is a lack of evidence about the potential impact on space and query performance. Thus, we embark on the task of quantifying that precisely for document stores. We empirically compare multiple ways of representing semi-structured data, allowing us to derive a set of guidelines for efficient physical database design considering both JSON and relational options in the same palette. Then, we need a formal canonical model that can represent alternative designs. We propose a hypergraph-based approach for representing heterogeneous datastore designs. We extend and formalize an existing common programming interface to NoSQL systems as hypergraphs. We define design constraints and query transformation rules for representative data store types. Next, we propose a simple query rewriting algorithm and provide a prototype implementation together with storage statistics estimator. Next, we require a formal query cost model to estimate and evaluate query performance on alternative document store designs. Document stores use primitive approaches to query processing, such as relying on the end-user to specify the usage of indexes instead of a formal cost model. But we require a reliable approach to compare alternative designs on how they perform on a specific query. For this, we define a generic storage and query cost model based on disk access and memory allocation. As all document stores carry out data operations in memory, we first estimate the memory usage by considering the characteristics of the stored documents, their access patterns, and memory management algorithms. Then, using this estimation and metadata storage size, we introduce a cost model for random access queries. We validate our work on two well-known document store implementations. The results show that the memory usage estimates have an average precision of 91% and predicted costs are highly correlated to the actual execution times. During this work, we also managed to suggest several improvements to document stores. Finally, we implement the automated database design solution using multi-criteria optimization. We introduce an algebra of transformations that can systematically modify a design of our canonical representation. Then, using them, we implement a local search algorithm driven by a loss function that can propose near-optimal designs with high probability. We compare our prototype against an existing document store data design solution. Our proposed designs have better performance and are more compact with less redundancy.NoSQL descriu sistemes d'emmagatzematge alternatius als tradicionals de gestiĂł de bases de dades relacionals (RDBMS). Entre aquests, els magatzems de documents han guanyat popularitat principalment a causa del model de dades semiestructurat i les riques capacitats de consulta. Animen els usuaris a utilitzar un enfocament de dades primer, en lloc d'un enfocament de disseny primer. El disseny de dades en magatzems de documents es porta a terme principalment en forma d'assaig-error o basat en regles ad-hoc en lloc d'un procĂ©s formal i sistemĂ tic com ara la normalitzaciĂł en un RDBMS. Aquest enfocament condueix fĂ cilment a un disseny no ĂČptim que generarĂ  costos addicionals a llarg termini. La majoria dels enfocaments existents es basen en l'optimitzaciĂł del rendiment de les consultes. Aquesta tesi pretĂ©n, en canvi, proporcionar un nou enfocament basat en diversos criteris per al disseny de bases de dades en magatzems de documents, inclouen el requisit d'espai i la complexitat dels documents emmagatzemats especĂ­fics per a cada cas d'Ășs. En general, hi ha un gran espai de solucions de dissenys alternatives. Per tant, creiem que l'optimitzaciĂł multicriteri Ă©s ideal per resoldre aquest problema. Per aconseguir-ho, hem d'abordar diversos problemes que ens permetran aplicar l'optimitzaciĂł multicriteri. En primer, avaluem l'impacte de les representacions alternatives de dades semiestructurades. Hi ha maneres mĂșltiples i equivalents de representar dades semiestructurades, perĂČ hi ha una manca d'evidĂšncia sobre l'impacte potencial en l'espai i el rendiment de les consultes. AixĂ­, ens embarquem en la tasca de quantificar-ho. Comparem empĂ­ricament mĂșltiples representacions de dades semiestructurades, cosa que ens permet derivar directrius per a un disseny eficient tenint en compte les opcions dels JSON i relacionals alhora. Aleshores, necessitem un model canĂČnic que pugui representar dissenys alternatius i proposem un enfocament basat en hipergrafs. Estenem i formalitzem una interfĂ­cie de programaciĂł comuna existent als sistemes NoSQL com a hipergrafs. Definim restriccions de disseny i regles de transformaciĂł de consultes per a tipus de magatzem de dades representatius. A continuaciĂł, proposem un algorisme de reescriptura de consultes senzill i proporcionem una implementaciĂł juntament amb un estimador d'estadĂ­stiques d'emmagatzematge. Els magatzems de documents utilitzen enfocaments primitius per al processament de consultes, com ara confiar en l'usuari final per especificar l'Ășs d'Ă­ndexs en lloc d'un model de cost. ConseqĂŒentment, necessitem un model de cost de consulta per estimar i avaluar el rendiment en dissenys alternatius. Per aixĂČ, definim un model genĂšric propi basat en l'accĂ©s a disc i l'assignaciĂł de memĂČria. Com que tots els magatzems de documents duen a terme operacions de dades a memĂČria, primer estimem l'Ășs de la memĂČria tenint en compte les caracterĂ­stiques dels documents emmagatzemats, els seus patrons d'accĂ©s i els algorismes de gestiĂł de memĂČria. A continuaciĂł, utilitzant aquesta estimaciĂł i la mida d'emmagatzematge de metadades, introduĂŻm un model de costos per a consultes d'accĂ©s aleatori. Validem el nostre treball en dues implementacions conegudes. Els resultats mostren que les estimacions d'Ășs de memĂČria tenen una precisiĂł mitjana del 91% i els costos previstos estan altament correlacionats amb els temps d'execuciĂł reals. Finalment, implementem la soluciĂł de disseny automatitzat de bases de dades mitjançant l'optimitzaciĂł multicriteri. IntroduĂŻm una Ă lgebra de transformacions que pot modificar sistemĂ ticament un disseny en la nostra representaciĂł canĂČnica. A continuaciĂł, utilitzant-la, implementem un algorisme de cerca local impulsat per una funciĂł de pĂšrdua que pot proposar dissenys gairebĂ© ĂČptims amb alta probabilitat. Comparem el nostre prototip amb una soluciĂł de disseny de dades de magatzem de documents existent. Els nostres dissenys proposats tenen un millor rendiment i sĂłn mĂ©s compactes, amb menys redundĂ nciaNoSQL est un terme gĂ©nĂ©rique utilisĂ© pour classer les systĂšmes de stockage alternatifs aux systĂšmes de gestion de bases de donnĂ©es relationnelles (SGBDR) traditionnels. Au moment de la rĂ©daction de cet article, il existe plus de 200 systĂšmes NoSQL disponibles qui peuvent ĂȘtre classĂ©s en quatre catĂ©gories principales sur le modĂšle de stockage de donnĂ©es : magasins de valeurs-clĂ©s, magasins de documents, magasins de familles de colonnes et magasins de graphiques. Les magasins de documents ont gagnĂ© en popularitĂ© principalement en raison du modĂšle de stockage de donnĂ©es semi-structurĂ© et des capacitĂ©s de requĂȘtes riches par rapport aux autres systĂšmes NoSQL, ce qui en fait un candidat idĂ©al pour le prototypage rapide. Les magasins de documents encouragent les utilisateurs Ă  utiliser une approche axĂ©e sur les donnĂ©es plutĂŽt que sur la conception. La conception de bases de donnĂ©es sur les magasins de documents est principalement effectuĂ©e par essais et erreurs ou selon des rĂšgles ad hoc plutĂŽt que par un processus formel tel que la normalisation dans un SGBDR. Cependant, ces approches pourraient facilement conduire Ă  une conception de base de donnĂ©es non optimale entraĂźnant des coĂ»ts supplĂ©mentaires de traitement des requĂȘtes, de stockage des donnĂ©es et de refonte. Cette thĂšse de doctorat vise Ă  fournir une nouvelle approche multicritĂšre de la conception de bases de donnĂ©es dans les magasins de documents. La plupart des approches existantes de conception de bases de donnĂ©es sont basĂ©es sur l’optimisation des performances des requĂȘtes. Cependant, d’autres facteurs incluent les exigences de stockage et la complexitĂ© des documents stockĂ©s spĂ©cifique Ă  chaque cas d’utilisation. De plus, il existe un grand espace de solution de conceptions alternatives en raison des diffĂ©rentes combinaisons de rĂ©fĂ©rencement et d’imbrication des donnĂ©es. Par consĂ©quent, nous pensons que l’optimisation multicritĂšres est idĂ©ale par l’intermĂ©diaire d’une expĂ©rience Ă©prouvĂ©e dans la rĂ©solution de tels problĂšmes dans divers domaines. Cependant, pour y parvenir, nous devons rĂ©soudre plusieurs problĂšmes qui nous permettront d’appliquer une optimisation multicritĂšre pour le problĂšme de conception de donnĂ©es. PremiĂšrement, nous Ă©valuons l’impact des reprĂ©sentations alternatives de stockage des donnĂ©es semi-structurĂ©es. Il existe plusieurs maniĂšres Ă©quivalentes de reprĂ©senter physiquement des donnĂ©es semi-structurĂ©es, mais il y a un manque de preuves concernant l’impact potentiel sur l’espace et sur les performances des requĂȘtes. Ainsi, nous nous lançons dans la tĂąche de quantifier cela prĂ©cisĂ©ment pour les magasins de documents. Nous comparons empiriquement plusieurs façons de reprĂ©senter des donnĂ©es semi-structurĂ©es, ce qui nous permet de dĂ©river un ensemble de directives pour une conception de base de donnĂ©es physique efficace en tenant compte Ă  la fois des options JSON et relationnelles dans la mĂȘme palette. Ensuite, nous avons besoin d’un modĂšle canonique formel capable de reprĂ©senter des conceptions alternatives. Dans cette mesure, nous proposons une approche basĂ©e sur des hypergraphes pour reprĂ©senter des conceptions de magasins de donnĂ©es hĂ©tĂ©rogĂšnes. Prenant une interface de programmation commune existante aux systĂšmes NoSQL, nous l’étendons et la formalisons sous forme d’hypergraphes. Ensuite, nous dĂ©finissons les contraintes de conception et les rĂšgles de transformation des requĂȘtes pour trois types de magasins de donnĂ©es reprĂ©sentatifs. Ensuite, nous proposons un algorithme de rĂ©Ă©criture de requĂȘte simple Ă  partir d’un algorithme gĂ©nĂ©rique dans un magasin de donnĂ©es sous-jacent spĂ©cifique et fournissons une implĂ©mentation prototype. De plus, nous introduisons un estimateur de statistiques de stockage sur les magasins de donnĂ©es sous-jacents. Enfin, nous montrons la faisabilitĂ© de notre approche sur un cas d’utilisation d’un systĂšme polyglotte existant ainsi que son utilitĂ© dans les calculs de mĂ©tadonnĂ©es et de chemins de requĂȘtes physiques. Ensuite, nous avons besoin d’un modĂšle de coĂ»ts de requĂȘtes formel pour estimer et Ă©valuer les performances des requĂȘtes sur des conceptions alternatives de magasin de documents. Les magasins de documents utilisent des approches primitives du traitement des requĂȘtes, telles que l’évaluation de tous les plans de requĂȘte possibles pour trouver le plan gagnant et son utilisation dans les requĂȘtes similaires ultĂ©rieures, ou l’appui sur l’usager final pour spĂ©cifier l’utilisation des index au lieu d’un modĂšle de coĂ»ts formel. Cependant, nous avons besoin d’une approche fiable pour comparer deux conceptions alternatives sur la façon dont elles fonctionnent sur une requĂȘte spĂ©cifique. Pour cela, nous dĂ©finissons un modĂšle de coĂ»ts de stockage et de requĂȘte gĂ©nĂ©rique basĂ© sur l’accĂšs au disque et l’allocation de mĂ©moire qui permet d’estimer l’impact des dĂ©cisions de conception. Étant donnĂ© que tous les magasins de documents effectuent des opĂ©rations sur les donnĂ©es en mĂ©moire, nous estimons d’abord l’utilisation de la mĂ©moire en considĂ©rant les caractĂ©ristiques des documents stockĂ©s, leurs modĂšles d’accĂšs et les algorithmes de gestion de la mĂ©moire. Ensuite, en utilisant cette estimation et la taille de stockage des mĂ©tadonnĂ©es, nous introduisons un modĂšle de coĂ»ts pour les requĂȘtes Ă  accĂšs alĂ©atoire. Il s’agit de la premiĂšre tenta ive d’une telle approche au meilleur de notre connaissance. Enfin, nous validons notre travail sur deux implĂ©mentations de magasin de documents bien connues : MongoDB et Couchbase. Les rĂ©sultats dĂ©montrent que les estimations d’utilisation de la mĂ©moire ont une prĂ©cision moyenne de 91% et que les coĂ»ts prĂ©vus sont fortement corrĂ©lĂ©s aux temps d’exĂ©cution rĂ©els. Au cours de ce travail, nous avons rĂ©ussi Ă  proposer plusieurs amĂ©liorations aux systĂšmes de stockage de documents. Ainsi, ce modĂšle de coĂ»ts contribue Ă©galement Ă  identifier les discordances entre les implĂ©mentations de stockage de documents et leurs attentes thĂ©oriques. Enfin, nous implĂ©mentons la solution de conception automatisĂ©e de bases de donnĂ©es en utilisant l’optimisation multicritĂšres. Tout d’abord, nous introduisons une algĂšbre de transformations qui peut systĂ©matiquement modifier une conception de notre reprĂ©sentation canonique. Ensuite, en utilisant ces transformations, nous implĂ©mentons un algorithme de recherche locale pilotĂ© par une fonction de perte qui peut proposer des conceptions quasi optimales avec une probabilitĂ© Ă©levĂ©e. Enfin, nous comparons notre prototype Ă  une solution de conception de donnĂ©es de magasin de documents existante uniquement basĂ©e sur le coĂ»t des requĂȘtes. Nos conceptions proposĂ©es ont de meilleures performances et sont plus compactes avec moins de redondancePostprint (published version

    The potential of semantic paradigm in warehousing of big data

    Get PDF
    Big data have analytical potential that was hard to realize with available technologies. After new storage paradigms intended for big data such as NoSQL databases emerged, traditional systems got pushed out of the focus. The current research is focused on their reconciliation on different levels or paradigm replacement. Similarly, the emergence of NoSQL databases has started to push traditional (relational) data warehouses out of the research and even practical focus. Data warehousing is known for the strict modelling process, capturing the essence of the business processes. For that reason, a mere integration to bridge the NoSQL gap is not enough. It is necessary to deal with this issue on a higher abstraction level during the modelling phase. NoSQL databases generally lack clear, unambiguous schema, making the comprehension of their contents difficult and their integration and analysis harder. This motivated involving semantic web technologies to enrich NoSQL database contents by additional meaning and context. This paper reviews the application of semantics in data integration and data warehousing and analyses its potential in integrating NoSQL data and traditional data warehouses with some focus on document stores. Also, it gives a proposal of the future pursuit directions for the big data warehouse modelling phases
    • 

    corecore