701 research outputs found

    UK utility data integration: overcoming schematic heterogeneity

    Get PDF
    In this paper we discuss syntactic, semantic and schematic issues which inhibit the integration of utility data in the UK. We then focus on the techniques employed within the VISTA project to overcome schematic heterogeneity. A Global Schema based architecture is employed. Although automated approaches to Global Schema definition were attempted the heterogeneities of the sector were too great. A manual approach to Global Schema definition was employed. The techniques used to define and subsequently map source utility data models to this schema are discussed in detail. In order to ensure a coherent integrated model, sub and cross domain validation issues are then highlighted. Finally the proposed framework and data flow for schematic integration is introduced

    Methodological aspects in cross-national research

    Full text link
    Die Beiträge diese Heftes gehen zumeist auf mehrere Tagungen des Research Committee 33 (Logik und Methodologie) der International Sociological Association zurück. Im Mittelpunkt stehen Fragen der Messung sowie die Vergleichbarkeit, Reliabilität und Validität in der international vergleichenden empirischen Forschung. Die Beiträge sind vier Themengruppen zugeordnet. Im ersten Teil geht es um Design und Implementation kulturvergleichender Studien (Instrumentarium, Question Appraisal System, EU-Projekte, Fragebogenverstehen, Interpretation der Ergebnisse). Der zweite Teil ist verschiedenen Aspekten der "Äquivalenz"-Problematik gewidmet, vor allem in Bezug auf das International Social Survey Programme (ISSP) und den European Social Survey (ESS). Im dritten Teil wird die Harmonisierung soziodemographischer Information in unterschiedlichen Untersuchungen behandelt (amtliche Statistik, ESS, ISSP). Im abschließenden vierten Teil werden sozioökonomische Variablen in international vergleichender Perspektive diskutiert (Einkommen, Bildung, Beruf, Ethnizität, Religion). (ICE)"Cross-national and cross-cultural survey research has been growing apace for several decades and interest in how best to do them has possibly never been greater. At the International Sociological Association Research Committee 33 (Logic and Methodology) several sessions were dedicated to cross-cultural cross-national survey methodology and the vast majority of the papers in this volume were presented at that conference. Researchers involved in comparative research have always been worried about measurement issues, comparability, reliability and validity of their data. But the design and execution of comparative studies has changed markedly since the early cross-national projects of the nineteen sixties and nineteen seventies." (excerpt). Contents: Jürgen H.P. Hoffmeyer-Zlotnik, Janet A. Harkness: Methodological aspects in cross-national research: foreword (5-10). I. Designing and implementing cross-cultural surveys - Johnny Blair, Linda Piccinino: The development and testing of instruments for cross-cultural and multi-cultural surveys (13-30); Elizabeth Dean Rachel Caspar, Georgina McAvinchey, Leticia Reed, Rosanna Quiroz: Developing a low-cost technique for parallel cross-cultural instrument development: the Question Appraisal System (QAS-04) (31-46); Felizitas Sagebiel: Using a mixed international comparable methodological approach in a European Commission project on gender and engineering (47-64); Timothy P. Johnson, Young Ik Cho, Allyson Holbrook, Diane O'Rourke, Richard Warnecke, Noel Chávez: Cultural variability in the effects of question design features on respondent comprehension (65-78); Kristen Miller, Gordon Willis, Connie Eason, Lisa Moses, Beth Canfield: Interpreting the results of cross-cultural cognitive interviews: a mixed-method approach (79-92). II. Different issues of comparability or "equivalence" - Michael Braun, Janet A. Harkness: Text and context: challenges to comparability in survey questions (95-108); Nina Rother: Measuring attitudes towards immigration across countries with the ESS: potential problems of equivalence (109-126); Vlasta Zucha: The level of equivalence in the ISSP 1999 and its implications on further analysis (127-146). III. Harmonising socio-demographic information in different types of surveys - Thomas Körner, Iris Meyer: Harmonising socio-demographic information in household surveys of official statistics: experiences from the Federal Statistical Office Germany (149-162); Kirstine Kolsrud, Knut Kalgraff Skjak: Harmonising background variables in the European Social Survey (163-182); Evi Scholz: Harmonisation of survey data in the International Social Survey Programme (ISSP) (183-200). IV. Socio-economic variables in cross-national perspective - Uwe Warner, Jürgen H.P. Hoffmeyer-Zlotnik: Measuring income in comparative social survey research (203-222); Jürgen H.P. Hoffmeyer-Zlotnik, Uwe Warner: How to measure education in cross-national comparison: Hoffmeyer-Zlotnik/Warner-Matrix of Education as a new instrument (223-240); Harry B.G. Ganzeboom: On the cost of being crude: a comparison of detailed and coarse occupational coding in the ISSP 1987 data (241-258); Paul S. Lambert: Ethnicity and the comparative analysis of contemporary survey data (259-278); Christof Wolf: Measuring religious affiliation and religiosity in Europe (279-294)

    Development of the BIRD: a metadata modelling approach for the purpose of harmonising supervisory reporting at the European Central Bank - Directorate of general statistics: master and metadata

    Get PDF
    Internship Report presented as the partial requirement for obtaining a Master's degree in Statistics and Information Management, specialization in Information Analysis and ManagementThe work presented is a report documenting the work completed during an Internship at the European Central Bank (ECB), located in Frankfurt Germany from the 15th March 2019 – 15th March 2020. The internship took place in the Directorate of General Statistics (DG-S), specifically in the Master and Metadata section of the Analytical Credit and Master data division (MAM). It will be a continuation of the ECB Internal Banks’ Integrated Reporting Dictionary (BIRD) project as well as management of ECB’s centralised metadata repository, known as the Single Data Dictionary (SDD). The purpose of the dictionary and BIRD Project is to provide the banks with a harmonized data model that describes precisely the data that should be extracted from the banks' internal IT systems to derive reports demanded by supervisory authorities, like the ECB. In this report, I will provide a basis for understanding the work undertaken in the team, focussing of the technical aspect of relational database modelling and metadata repositories and their role in big data analytical processing systems, current reporting requirements and methods used by the central banking institutions, which coincide with the processes set out by the European Banking Authority (EBA). This report will also provide an in-depth look into the structure of the database, as well as the principles followed to create the data model. It will also document the process of how the SDD is maintained and updated to meet changing needs. The report also includes the process undertaken by the BIRD team and supporting members on the banking community to introduce new reporting frameworks into the data model. During this period, the framework for the Financial Reporting (FinRep) standards was included, through a collaborative effort between banking representatives and the master and metadata team

    Utilising Semantic Web Technologies for Improved Road Network Information Exchange

    Get PDF
    Road asset data harmonisation is a challenge for the Australian road and transport authorities considering their heterogeneous data standards, data formats and tools. Classic data harmonisation techniques require huge databases with many tables, a unified metadata definition and standardised tools to share data with others. In order to find a better way to harmonise heterogeneous road network data, this dissertation uses Semantic Web technologies to investigate fast and efficient road asset data harmonisation

    Mining complex trees for hidden fruit : a graph–based computational solution to detect latent criminal networks : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Information Technology at Massey University, Albany, New Zealand.

    Get PDF
    The detection of crime is a complex and difficult endeavour. Public and private organisations – focusing on law enforcement, intelligence, and compliance – commonly apply the rational isolated actor approach premised on observability and materiality. This is manifested largely as conducting entity-level risk management sourcing ‘leads’ from reactive covert human intelligence sources and/or proactive sources by applying simple rules-based models. Focusing on discrete observable and material actors simply ignores that criminal activity exists within a complex system deriving its fundamental structural fabric from the complex interactions between actors - with those most unobservable likely to be both criminally proficient and influential. The graph-based computational solution developed to detect latent criminal networks is a response to the inadequacy of the rational isolated actor approach that ignores the connectedness and complexity of criminality. The core computational solution, written in the R language, consists of novel entity resolution, link discovery, and knowledge discovery technology. Entity resolution enables the fusion of multiple datasets with high accuracy (mean F-measure of 0.986 versus competitors 0.872), generating a graph-based expressive view of the problem. Link discovery is comprised of link prediction and link inference, enabling the high-performance detection (accuracy of ~0.8 versus relevant published models ~0.45) of unobserved relationships such as identity fraud. Knowledge discovery uses the fused graph generated and applies the “GraphExtract” algorithm to create a set of subgraphs representing latent functional criminal groups, and a mesoscopic graph representing how this set of criminal groups are interconnected. Latent knowledge is generated from a range of metrics including the “Super-broker” metric and attitude prediction. The computational solution has been evaluated on a range of datasets that mimic an applied setting, demonstrating a scalable (tested on ~18 million node graphs) and performant (~33 hours runtime on a non-distributed platform) solution that successfully detects relevant latent functional criminal groups in around 90% of cases sampled and enables the contextual understanding of the broader criminal system through the mesoscopic graph and associated metadata. The augmented data assets generated provide a multi-perspective systems view of criminal activity that enable advanced informed decision making across the microscopic mesoscopic macroscopic spectrum

    Automated Identification of Digital Evidence across Heterogeneous Data Resources

    Get PDF
    Digital forensics has become an increasingly important tool in the fight against cyber and computer-assisted crime. However, with an increasing range of technologies at people’s disposal, investigators find themselves having to process and analyse many systems with large volumes of data (e.g., PCs, laptops, tablets, and smartphones) within a single case. Unfortunately, current digital forensic tools operate in an isolated manner, investigating systems and applications individually. The heterogeneity and volume of evidence place time constraints and a significant burden on investigators. Examples of heterogeneity include applications such as messaging (e.g., iMessenger, Viber, Snapchat, and WhatsApp), web browsers (e.g., Firefox and Google Chrome), and file systems (e.g., NTFS, FAT, and HFS). Being able to analyse and investigate evidence from across devices and applications in a universal and harmonized fashion would enable investigators to query all data at once. In addition, successfully prioritizing evidence and reducing the volume of data to be analysed reduces the time taken and cognitive load on the investigator. This thesis focuses on the examination and analysis phases of the digital investigation process. It explores the feasibility of dealing with big and heterogeneous data sources in order to correlate the evidence from across these evidential sources in an automated way. Therefore, a novel approach was developed to solve the heterogeneity issues of big data using three developed algorithms. The three algorithms include the harmonising, clustering, and automated identification of evidence (AIE) algorithms. The harmonisation algorithm seeks to provide an automated framework to merge similar datasets by characterising similar metadata categories and then harmonising them in a single dataset. This algorithm overcomes heterogeneity issues and makes the examination and analysis easier by analysing and investigating the evidential artefacts across devices and applications based on the categories to query data at once. Based on the merged datasets, the clustering algorithm is used to identify the evidential files and isolate the non-related files based on their metadata. Afterwards, the AIE algorithm tries to identify the cluster holding the largest number of evidential artefacts through searching based on two methods: criminal profiling activities and some information from the criminals themselves. Then, the related clusters are identified through timeline analysis and a search of associated artefacts of the files within the first cluster. A series of experiments using real-life forensic datasets were conducted to evaluate the algorithms across five different categories of datasets (i.e., messaging, graphical files, file system, internet history, and emails), each containing data from different applications across different devices. The results of the characterisation and harmonisation process show that the algorithm can merge all fields successfully, with the exception of some binary-based data found within the messaging datasets (contained within Viber and SMS). The error occurred because of a lack of information for the characterisation process to make a useful determination. However, on further analysis, it was found that the error had a minimal impact on subsequent merged data. The results of the clustering process and AIE algorithm showed the two algorithms can collaborate and identify more than 92% of evidential files.HCED Ira

    Enriching information extraction pipelines in clinical decision support systems

    Get PDF
    Programa Oficial de Doutoramento en Tecnoloxías da Información e as Comunicacións. 5032V01[Resumo] Os estudos sanitarios de múltiples centros son importantes para aumentar a repercusión dos resultados da investigación médica debido ao número de suxeitos que poden participar neles. Para simplificar a execución destes estudos, o proceso de intercambio de datos debería ser sinxelo, por exemplo, mediante o uso de bases de datos interoperables. Con todo, a consecución desta interoperabilidade segue sendo un tema de investigación en curso, sobre todo debido aos problemas de gobernanza e privacidade dos datos. Na primeira fase deste traballo, propoñemos varias metodoloxías para optimizar os procesos de estandarización das bases de datos sanitarias. Este traballo centrouse na estandarización de fontes de datos heteroxéneas nun esquema de datos estándar, concretamente o OMOP CDM, que foi desenvolvido e promovido pola comunidade OHDSI. Validamos a nosa proposta utilizando conxuntos de datos de pacientes con enfermidade de Alzheimer procedentes de distintas institucións. Na seguinte etapa, co obxectivo de enriquecer a información almacenada nas bases de datos de OMOP CDM, investigamos solucións para extraer conceptos clínicos de narrativas non estruturadas, utilizando técnicas de recuperación de información e de procesamento da linguaxe natural. A validación realizouse a través de conxuntos de datos proporcionados en desafíos científicos, concretamente no National NLP Clinical Challenges(n2c2). Na etapa final, propuxémonos simplificar a execución de protocolos de estudos provenientes de múltiples centros, propoñendo solucións novas para perfilar, publicar e facilitar o descubrimento de bases de datos. Algunhas das solucións desenvolvidas están a utilizarse actualmente en tres proxectos europeos destinados a crear redes federadas de bases de datos de saúde en toda Europa.[Resumen] Los estudios sanitarios de múltiples centros son importantes para aumentar la repercusión de los resultados de la investigación médica debido al número de sujetos que pueden participar en ellos. Para simplificar la ejecución de estos estudios, el proceso de intercambio de datos debería ser sencillo, por ejemplo, mediante el uso de bases de datos interoperables. Sin embargo, la consecución de esta interoperabilidad sigue siendo un tema de investigación en curso, sobre todo debido a los problemas de gobernanza y privacidad de los datos. En la primera fase de este trabajo, proponemos varias metodologías para optimizar los procesos de estandarización de las bases de datos sanitarias. Este trabajo se centró en la estandarización de fuentes de datos heterogéneas en un esquema de datos estándar, concretamente el OMOP CDM, que ha sido desarrollado y promovido por la comunidad OHDSI. Validamos nuestra propuesta utilizando conjuntos de datos de pacientes con enfermedad de Alzheimer procedentes de distintas instituciones. En la siguiente etapa, con el objetivo de enriquecer la información almacenada en las bases de datos de OMOP CDM, hemos investigado soluciones para extraer conceptos clínicos de narrativas no estructuradas, utilizando técnicas de recuperación de información y de procesamiento del lenguaje natural. La validación se realizó a través de conjuntos de datos proporcionados en desafíos científicos, concretamente en el National NLP Clinical Challenges (n2c2). En la etapa final, nos propusimos simplificar la ejecución de protocolos de estudios provenientes de múltiples centros, proponiendo soluciones novedosas para perfilar, publicar y facilitar el descubrimiento de bases de datos. Algunas de las soluciones desarrolladas se están utilizando actualmente en tres proyectos europeos destinados a crear redes federadas de bases de datos de salud en toda Europa.[Abstract] Multicentre health studies are important to increase the impact of medical research findings due to the number of subjects that they are able to engage. To simplify the execution of these studies, the data-sharing process should be effortless, for instance, through the use of interoperable databases. However, achieving this interoperability is still an ongoing research topic, namely due to data governance and privacy issues. In the first stage of this work, we propose several methodologies to optimise the harmonisation pipelines of health databases. This work was focused on harmonising heterogeneous data sources into a standard data schema, namely the OMOP CDM which has been developed and promoted by the OHDSI community. We validated our proposal using data sets of Alzheimer’s disease patients from distinct institutions. In the following stage, aiming to enrich the information stored in OMOP CDM databases, we have investigated solutions to extract clinical concepts from unstructured narratives, using information retrieval and natural language processing techniques. The validation was performed through datasets provided in scientific challenges, namely in the National NLP Clinical Challenges (n2c2). In the final stage, we aimed to simplify the protocol execution of multicentre studies, by proposing novel solutions for profiling, publishing and facilitating the discovery of databases. Some of the developed solutions are currently being used in three European projects aiming to create federated networks of health databases across Europe

    Approaches for the clustering of geographic metadata and the automatic detection of quasi-spatial dataset series

    Get PDF
    The discrete representation of resources in geospatial catalogues affects their information retrieval performance. The performance could be improved by using automatically generated clusters of related resources, which we name quasi-spatial dataset series. This work evaluates whether a clustering process can create quasi-spatial dataset series using only textual information from metadata elements. We assess the combination of different kinds of text cleaning approaches, word and sentence-embeddings representations (Word2Vec, GloVe, FastText, ELMo, Sentence BERT, and Universal Sentence Encoder), and clustering techniques (K-Means, DBSCAN, OPTICS, and agglomerative clustering) for the task. The results demonstrate that combining word-embeddings representations with an agglomerative-based clustering creates better quasi-spatial dataset series than the other approaches. In addition, we have found that the ELMo representation with agglomerative clustering produces good results without any preprocessing step for text cleaning

    A semantic and agent-based approach to support information retrieval, interoperability and multi-lateral viewpoints for heterogeneous environmental databases

    Get PDF
    PhDData stored in individual autonomous databases often needs to be combined and interrelated. For example, in the Inland Water (IW) environment monitoring domain, the spatial and temporal variation of measurements of different water quality indicators stored in different databases are of interest. Data from multiple data sources is more complex to combine when there is a lack of metadata in a computation forin and when the syntax and semantics of the stored data models are heterogeneous. The main types of information retrieval (IR) requirements are query transparency and data harmonisation for data interoperability and support for multiple user views. A combined Semantic Web based and Agent based distributed system framework has been developed to support the above IR requirements. It has been implemented using the Jena ontology and JADE agent toolkits. The semantic part supports the interoperability of autonomous data sources by merging their intensional data, using a Global-As-View or GAV approach, into a global semantic model, represented in DAML+OIL and in OWL. This is used to mediate between different local database views. The agent part provides the semantic services to import, align and parse semantic metadata instances, to support data mediation and to reason about data mappings during alignment. The framework has applied to support information retrieval, interoperability and multi-lateral viewpoints for four European environmental agency databases. An extended GAV approach has been developed and applied to handle queries that can be reformulated over multiple user views of the stored data. This allows users to retrieve data in a conceptualisation that is better suited to them rather than to have to understand the entire detailed global view conceptualisation. User viewpoints are derived from the global ontology or existing viewpoints of it. This has the advantage that it reduces the number of potential conceptualisations and their associated mappings to be more computationally manageable. Whereas an ad hoc framework based upon conventional distributed programming language and a rule framework could be used to support user views and adaptation to user views, a more formal framework has the benefit in that it can support reasoning about the consistency, equivalence, containment and conflict resolution when traversing data models. A preliminary formulation of the formal model has been undertaken and is based upon extending a Datalog type algebra with hierarchical, attribute and instance value operators. These operators can be applied to support compositional mapping and consistency checking of data views. The multiple viewpoint system was implemented as a Java-based application consisting of two sub-systems, one for viewpoint adaptation and management, the other for query processing and query result adjustment
    corecore