193 research outputs found

    Flexible Integration and Efficient Analysis of Multidimensional Datasets from the Web

    Get PDF
    If numeric data from the Web are brought together, natural scientists can compare climate measurements with estimations, financial analysts can evaluate companies based on balance sheets and daily stock market values, and citizens can explore the GDP per capita from several data sources. However, heterogeneities and size of data remain a problem. This work presents methods to query a uniform view - the Global Cube - of available datasets from the Web and builds on Linked Data query approaches

    Automating the multidimensional design of data warehouses

    Get PDF
    Les experiències prèvies en l'àmbit dels magatzems de dades (o data warehouse), mostren que l'esquema multidimensional del data warehouse ha de ser fruit d'un enfocament híbrid; això és, una proposta que consideri tant els requeriments d'usuari com les fonts de dades durant el procés de disseny.Com a qualsevol altre sistema, els requeriments són necessaris per garantir que el sistema desenvolupat satisfà les necessitats de l'usuari. A més, essent aquest un procés de reenginyeria, les fonts de dades s'han de tenir en compte per: (i) garantir que el magatzem de dades resultant pot ésser poblat amb dades de l'organització, i, a més, (ii) descobrir capacitats d'anàlisis no evidents o no conegudes per l'usuari.Actualment, a la literatura s'han presentat diversos mètodes per donar suport al procés de modelatge del magatzem de dades. No obstant això, les propostes basades en un anàlisi dels requeriments assumeixen que aquestos són exhaustius, i no consideren que pot haver-hi informació rellevant amagada a les fonts de dades. Contràriament, les propostes basades en un anàlisi exhaustiu de les fonts de dades maximitzen aquest enfocament, i proposen tot el coneixement multidimensional que es pot derivar des de les fonts de dades i, conseqüentment, generen massa resultats. En aquest escenari, l'automatització del disseny del magatzem de dades és essencial per evitar que tot el pes de la tasca recaigui en el dissenyador (d'aquesta forma, no hem de confiar únicament en la seva habilitat i coneixement per aplicar el mètode de disseny elegit). A més, l'automatització de la tasca allibera al dissenyador del sempre complex i costós anàlisi de les fonts de dades (que pot arribar a ser inviable per grans fonts de dades).Avui dia, els mètodes automatitzables analitzen en detall les fonts de dades i passen per alt els requeriments. En canvi, els mètodes basats en l'anàlisi dels requeriments no consideren l'automatització del procés, ja que treballen amb requeriments expressats en llenguatges d'alt nivell que un ordenador no pot manegar. Aquesta mateixa situació es dona en els mètodes híbrids actual, que proposen un enfocament seqüencial, on l'anàlisi de les dades es complementa amb l'anàlisi dels requeriments, ja que totes dues tasques pateixen els mateixos problemes que els enfocament purs.En aquesta tesi proposem dos mètodes per donar suport a la tasca de modelatge del magatzem de dades: MDBE (Multidimensional Design Based on Examples) and AMDO (Automating the Multidimensional Design from Ontologies). Totes dues consideren els requeriments i les fonts de dades per portar a terme la tasca de modelatge i a més, van ser pensades per superar les limitacions dels enfocaments actuals.1. MDBE segueix un enfocament clàssic, en el que els requeriments d'usuari són coneguts d'avantmà. Aquest mètode es beneficia del coneixement capturat a les fonts de dades, però guia el procés des dels requeriments i, conseqüentment, és capaç de treballar sobre fonts de dades semànticament pobres. És a dir, explotant el fet que amb uns requeriments de qualitat, podem superar els inconvenients de disposar de fonts de dades que no capturen apropiadament el nostre domini de treball.2. A diferència d'MDBE, AMDO assumeix un escenari on es disposa de fonts de dades semànticament riques. Per aquest motiu, dirigeix el procés de modelatge des de les fonts de dades, i empra els requeriments per donar forma i adaptar els resultats generats a les necessitats de l'usuari. En aquest context, a diferència de l'anterior, unes fonts de dades semànticament riques esmorteeixen el fet de no tenir clars els requeriments d'usuari d'avantmà.Cal notar que els nostres mètodes estableixen un marc de treball combinat que es pot emprar per decidir, donat un escenari concret, quin enfocament és més adient. Per exemple, no es pot seguir el mateix enfocament en un escenari on els requeriments són ben coneguts d'avantmà i en un escenari on aquestos encara no estan clars (un cas recorrent d'aquesta situació és quan l'usuari no té clares les capacitats d'anàlisi del seu propi sistema). De fet, disposar d'uns bons requeriments d'avantmà esmorteeix la necessitat de disposar de fonts de dades semànticament riques, mentre que a l'inversa, si disposem de fonts de dades que capturen adequadament el nostre domini de treball, els requeriments no són necessaris d'avantmà. Per aquests motius, en aquesta tesi aportem un marc de treball combinat que cobreix tots els possibles escenaris que podem trobar durant la tasca de modelatge del magatzem de dades.Previous experiences in the data warehouse field have shown that the data warehouse multidimensional conceptual schema must be derived from a hybrid approach: i.e., by considering both the end-user requirements and the data sources, as first-class citizens. Like in any other system, requirements guarantee that the system devised meets the end-user necessities. In addition, since the data warehouse design task is a reengineering process, it must consider the underlying data sources of the organization: (i) to guarantee that the data warehouse must be populated from data available within the organization, and (ii) to allow the end-user discover unknown additional analysis capabilities.Currently, several methods for supporting the data warehouse modeling task have been provided. However, they suffer from some significant drawbacks. In short, requirement-driven approaches assume that requirements are exhaustive (and therefore, do not consider the data sources to contain alternative interesting evidences of analysis), whereas data-driven approaches (i.e., those leading the design task from a thorough analysis of the data sources) rely on discovering as much multidimensional knowledge as possible from the data sources. As a consequence, data-driven approaches generate too many results, which mislead the user. Furthermore, the design task automation is essential in this scenario, as it removes the dependency on an expert's ability to properly apply the method chosen, and the need to analyze the data sources, which is a tedious and timeconsuming task (which can be unfeasible when working with large databases). In this sense, current automatable methods follow a data-driven approach, whereas current requirement-driven approaches overlook the process automation, since they tend to work with requirements at a high level of abstraction. Indeed, this scenario is repeated regarding data-driven and requirement-driven stages within current hybrid approaches, which suffer from the same drawbacks than pure data-driven or requirement-driven approaches.In this thesis we introduce two different approaches for automating the multidimensional design of the data warehouse: MDBE (Multidimensional Design Based on Examples) and AMDO (Automating the Multidimensional Design from Ontologies). Both approaches were devised to overcome the limitations from which current approaches suffer. Importantly, our approaches consider opposite initial assumptions, but both consider the end-user requirements and the data sources as first-class citizens.1. MDBE follows a classical approach, in which the end-user requirements are well-known beforehand. This approach benefits from the knowledge captured in the data sources, but guides the design task according to requirements and consequently, it is able to work and handle semantically poorer data sources. In other words, providing high-quality end-user requirements, we can guide the process from the knowledge they contain, and overcome the fact of disposing of bad quality (from a semantical point of view) data sources.2. AMDO, as counterpart, assumes a scenario in which the data sources available are semantically richer. Thus, the approach proposed is guided by a thorough analysis of the data sources, which is properly adapted to shape the output result according to the end-user requirements. In this context, disposing of high-quality data sources, we can overcome the fact of lacking of expressive end-user requirements.Importantly, our methods establish a combined and comprehensive framework that can be used to decide, according to the inputs provided in each scenario, which is the best approach to follow. For example, we cannot follow the same approach in a scenario where the end-user requirements are clear and well-known, and in a scenario in which the end-user requirements are not evident or cannot be easily elicited (e.g., this may happen when the users are not aware of the analysis capabilities of their own sources). Interestingly, the need to dispose of requirements beforehand is smoothed by the fact of having semantically rich data sources. In lack of that, requirements gain relevance to extract the multidimensional knowledge from the sources.So that, we claim to provide two approaches whose combination turns up to be exhaustive with regard to the scenarios discussed in the literaturePostprint (published version

    Flexible Integration and Efficient Analysis of Multidimensional Datasets from the Web

    Get PDF
    If numeric data from the Web are brought together, natural scientists can compare climate measurements with estimations, financial analysts can evaluate companies based on balance sheets and daily stock market values, and citizens can explore the GDP per capita from several data sources. However, heterogeneities and size of data remain a problem. This work presents methods to query a uniform view - the Global Cube - of available datasets from the Web and builds on Linked Data query approaches

    THE ROLE OF BOUNDARY OBJECTS AND BOUNDARY SPANNING IN DATA WAREHOUSING – A RESEARCH-IN-PROGRESS REPORT

    Get PDF
    Data warehouse projects bring together different communities of practice, with the primary objective of producing one body of information which is capable of comparative advantages in business analysis. Due to the number of involved communities and the complexity of their collaboration, data warehouse projects are costly. In this paper we give a closer look at communication problems on boundaries between participating data warehouse projects’ communities. Our analysis enlightens the potential relation between the early creation of language communities of the involved communities and lowering data warehouse project development costs. As today, there is hardly any methodology available for analyzing and aligning mutual understanding between data warehouse project participants. In this paper, we propose a data warehouse development scheme for project improvement based on our discussion as a first step in a design science project

    DataOps for Societal Intelligence: a Data Pipeline for Labor Market Skills Extraction and Matching

    Full text link
    Big Data analytics supported by AI algorithms can support skills localization and retrieval in the context of a labor market intelligence problem. We formulate and solve this problem through specific DataOps models, blending data sources from administrative and technical partners in several countries into cooperation, creating shared knowledge to support policy and decision-making. We then focus on the critical task of skills extraction from resumes and vacancies featuring state-of-the-art machine learning models. We showcase preliminary results with applied machine learning on real data from the employment agencies of the Netherlands and the Flemish region in Belgium. The final goal is to match these skills to standard ontologies of skills, jobs and occupations

    An Approach To Publish a Data Warehouse Content as Linked Data

    Get PDF
    Mestrado em Engenharia Informática - Área de Especialização em Tecnologias do Conhecimento e DecisãoOrganizations are still gathering huge amounts of data/information and storing them in data warehouses (DW) for reporting and data analysis purposes. Most of those DW rely on Relational Databases (RDB) management systems and are structured by a schema (e.g. star schema, snowflake schema, etc). On the other hand, with the advent of Semantic Web, organizations are being pushed to add semantics (i.e. metadata) on their own data in order to find, share, combine and reuse information more easily across applications, organizations and community boundaries. The goal of the Semantic Web is to provide the ability for computers to perform more complex jobs through principles of Linked Data. In that sense, the W3C proposes the adoption of standards like RDF, OWL and SPARQL technologies that help exposing and accessing the data and its semantics by using logical structures called Ontologies. Simply put, an ontology captures/represents the vocabulary and interpretation restrictions of a particular application domain (i.e. concepts, their relations and restrictions), which is further used to describe a set of specific data (instances) for that domain. In this context, the work described in this document is intended to explore and analyze (i) the Vocabulary recommended by W3C to describe a Data Cube represented in RDF and (ii) the languages of mapping relational database (RDB) to RDF, also recommend by W3C, in order to propose their application in a semi-automatic process that should allow, in a quick and easy manner, to publish semantically the content of a existing DW from relational database in accordance with the principles of Linked (Open) data. The semi-automatic process can save time/money in creating a data repository that has an ontology, which could be used as standard “facade” for the content of the Data Warehouse to be use on Semantic Web technologies. The semiautomatic process consists of four sub-processes (cf. chapter 6). The first process, called Setup and Configuration Process, select the tables of data warehouses (cf. chapter 2), from which it will extract the data. The second process, called RDF Data Cube Ontology Structure Definition Process, creates an ontology structure, without data, based on the results obtained in Setup and Configuration Process. The ontology also uses a vocabulary recommended by W3C, so it can be classified and used as a data cube (cf. chapter 5). The third process, called Mappings Specification Process, creates a mapping between the Data Warehouse and the ontology created, using a standard language recommended by the W3C called RDB2RDF R2RML. The last and fourth, called Mapping Execution, that creates the data to be used by the ontology by mapping generated by the Mappings Specification Process.As organizações estão constantemente a recolher enormes quantidades de dados / informações para guardarem em Armazéns de Dados para fins de elaboração de relatórios e análise de dados. A maioria desses Armazéns usa sistemas de gestão de bases de dados relacionais e são estruturadas de acordo com um esquema (e.g. o esquema em estrela, o esquema em floco de neve, etc.). Por outro lado, com o advento da Web Semântica, as organizações estão a ser pressionadas a adicionar semântica (isto é, meta dados) sobre os seus próprios dados, a fim de encontrar, partilhar, combinar e reutilizar informação mais facilmente entre aplicações, organizações e comunidades. O objetivo da Web Semântica é providenciar aos computadores capacidade de executar trabalhos mais complexos através de princípios de Linked Data (ver capitulo 3). Nesse sentido, a W3C tem proposto a adoção de várias recomendações como o RDF, o OWL e o SPARQL. Estas tecnologias ajudam a expor os dados e a sua semântica usando estruturas lógicas, denominadas de Ontologias. De forma simples, uma ontologia captura/representa o vocabulário e restrições de interpretação de um determinado domínio de aplicação (i.e. os conceitos, suas relações e restrições) que posteriormente é usado para descrever um conjunto de dados concretos desse domínio. Neste contexto, o trabalho descrito neste documento visa analisar e explorar (i) o Vocabulário recomendado pela W3C para descrever um Cubo de Dados representado em RDF (ver capitulo 5) e (ii) as linguagens de mapeamento de Dados Relacionais (RDB) para RDF (ver capitulo 4), também recomendadas pela W3C, com o intuito de propor a sua aplicação num processo semiautomático que permita publicar semanticamente de forma rápida e fácil o conteúdo de um Armazém de Dados existente numa base de dados relacional de acordo com os princípios de Linked (Open) Data. O objetivo do processo semiautomático é criar um repositório de dados com uma ontologia, que poderá ser usada como “fachada” standard para o conteúdo do Armazém de Dados para ser usado em tecnologias de Web Semântica. O processo semiautomático proposto é constituído por 4 subprocessos (ver capitulo 6). O primeiro processo, chamado Setup and Configuration Process (ver secção 6.2.2), visa selecionar e categorizar as tabelas do Armazéns de Dados (ver capitulo 2), do qual se irá extrair os dados. O segundo processo, chamado RDF Data Cube Ontology Structure Definition Process (ver secção 6.2.3), cria uma ontologia sem dados cuja estrutura advém tanto (i) do vocabulário recomendado pela W3C para descrição de Cubos de Dados (ver capítulo 5) e (ii) do resultado obtido no Setup and Configuration Process . O terceiro processo, chamado Mappings Specification Process (ver secção 6.2.4), cria um mapeamento entre o Armazém de Dados e a ontologia resultado do processo anterior. Este mapeamento assenta na recomendação da W3C denominado R2RML. O último e quarto processo, chamado Mapping Execution Process (ver secção 6.2.5), expõe os dados do Armazém de Dados de acordo com a ontologia anterior, através do mapeamento gerado pelo Mappings Specification Process. Esta tese está dividida em sete capítulos. O primeiro capítulo providencia uma introdução ao contexto e ao objetivo deste documento. O segundo capítulo apresenta uma visão geral sobre Armazéns de Dados, do qual as suas estruturas e dados são usados pelo processo semiautomático para criar o repositório de dados. O terceiro capítulo apresenta uma análise sobre Linked Data, nomeadamente o seu conceito, os seus princípios e linguagens que podem ser usadas para o expressar. Uma dessas linguagens (RDF ou OWL) em combinação com uma serialização (e.g. XML, N-Triples, etc.) que é usado para descrever o repositório de dados que o processo semiautomático pode criar. O quarto capítulo apresenta um levantamento de linguagens e tecnologias de mapeamento de RDB para RDF, em que R2RML é usado pelo processo semiautomático para criar mapeamentos entre um Armazéns de Dados e o repositório de dados. O quinto capítulo apresenta o vocabulário recomendado pela W3C para descrever um Cubo de Dados que vai ser usado para classificar o repositório de dados, criado pelo processo semiautomático. O sexto capítulo apresenta e descreve o processo semiautomático proposto com um exemplo que decorre e evolui ao longo de cada passo implementado. E o ultimo e sétimo capítulo contém as conclusões obtidas deste trabalho e algumas limitações possíveis. Também contem algumas sugestões de possíveis futuros trabalhos que podem ser acrescentados ao processo semiautomático

    Services for the automatic evaluation of matching tools

    Get PDF
    meilicke2010aIn this deliverable we describe a SEALS evaluation service for ontology matching that is based on the use of a web service interface to be implemented by the tool vendor. Following this approach we can offer an evaluation service before many components of the SEALS platform have been finished. We describe both the system architecture of the evaluation service from a general point of view as well as the specific components and their relation to the modules of the SEALS platform

    A abordagem POESIA para a integração de dados e serviços na Web semantica

    Get PDF
    Orientador: Claudia Bauzer MedeirosTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: POESIA (Processes for Open-Ended Systems for lnformation Analysis), a abordagem proposta neste trabalho, visa a construção de processos complexos envolvendo integração e análise de dados de diversas fontes, particularmente em aplicações científicas. A abordagem é centrada em dois tipos de mecanismos da Web semântica: workflows científicos, para especificar e compor serviços Web; e ontologias de domínio, para viabilizar a interoperabilidade e o gerenciamento semânticos dos dados e processos. As principais contribuições desta tese são: (i) um arcabouço teórico para a descrição, localização e composição de dados e serviços na Web, com regras para verificar a consistência semântica de composições desses recursos; (ii) métodos baseados em ontologias de domínio para auxiliar a integração de dados e estimar a proveniência de dados em processos cooperativos na Web; (iii) implementação e validação parcial das propostas, em urna aplicação real no domínio de planejamento agrícola, analisando os benefícios e as limitações de eficiência e escalabilidade da tecnologia atual da Web semântica, face a grandes volumes de dadosAbstract: POESIA (Processes for Open-Ended Systems for Information Analysis), the approach proposed in this work, supports the construction of complex processes that involve the integration and analysis of data from several sources, particularly in scientific applications. This approach is centered in two types of semantic Web mechanisms: scientific workflows, to specify and compose Web services; and domain ontologies, to enable semantic interoperability and management of data and processes. The main contributions of this thesis are: (i) a theoretical framework to describe, discover and compose data and services on the Web, inc1uding mIes to check the semantic consistency of resource compositions; (ii) ontology-based methods to help data integration and estimate data provenance in cooperative processes on the Web; (iii) partial implementation and validation of the proposal, in a real application for the domain of agricultural planning, analyzing the benefits and scalability problems of the current semantic Web technology, when faced with large volumes of dataDoutoradoCiência da ComputaçãoDoutor em Ciência da Computaçã

    Design and implementation of an autonomous, proactive, and reactive software infrastructure to help improving the management level of projects

    Get PDF
    Dissertação apresentada na Faculdade de Ciências e Tecnologia da Universidade Nova de Lisboa para obtenção do grau de Mestre em Engenharia Electrotécnica e de ComputadoresOver the years, collaboration between humans and organizations have been increasing and becoming vital to face new challenges and achieve the greatest common goals. The development of new technologies and internet capabilities promoted the emergence of new collaboration types, i.e., collaboration using software connected through internet (Collaborative Workspaces software). The use of the internet amplifies the range of action and the speed of communication among the actors involved in a collaboration. The collaboration amongst organizations is project-oriented (the common goal is to deal with projects) where several actors involved in the collaboration share their knowledge with each other. These actors are, indeed, the knowledge holders and the system which supports the collaboration has to collect and assess the knowledge from them. For this reason, this thesis aims to design and implement a software infrastructure to capture and capitalize the knowledge created over several projects. Such software is human-centered and has an autonomous, proactive and reactive behaviour to handle all users‟ needs. This software promotes its own continuous learning by analysing humans‟ behaviour over several projects, extracting information from that behaviour, and having Context-awareness. Additionally, it relies on Data mining technologies and semantic services, in order to provide a continuous monitoring of the whole project during its life cycle. The software developed is called “Companion” and has been assessed as a part of the CoSpaces Integrated Project

    A knowledge management architecture for information technology services delivery

    Get PDF
    Knowledge Management is a scientific area related to the organizational value of knowledge and is understood as a multidisciplinary field of research. Notions and practices are emerging and incorporated in organizations in different areas, as is the case of IT Service Management. Today’s business environment is increasingly unstable, characterized by uncertainties and changes, where technology changes rapidly, competitors multiply, and products and services quickly become obsolete. In this context, management is increasingly focused not only on people management, but on the knowledge they have and how to capture it. An Information System aligned with Knowledge Management and Intellectual Capital aims to represent and manage explicitly the different dimensions associated with an organizational competence. If organizations integrate Knowledge Competencies, Knowledge Engineering, Information Systems and Organizational Memories, these will improve the organization's knowledge and subsequently improve the quality of the service provided to users and customers. This research will use Design Science Research methodology to create an artifact to be applied in a case study from an organization aligned with ITIL best practices. This organization is supported by an Intranet and an ERP for laptop repair process. The outcome of this dissertation aims to demonstrate if Knowledge Management improves the IT services delivery
    corecore