62 research outputs found

    Relational databases digital preservation

    Get PDF
    Tese doutoramento - Programa Doutoral em InformáticaWith the expansion and growth of information technologies, much of human knowledge is now recorded on digital media. It began in the 20th century, it has been occurring continuously and it seems that there is no turning back. This paradigm brings scenarios where humans need mediators to understand digital information { computer platforms. These platforms are constantly changing and evolving and nothing can guarantee the continuity of access to digital artifacts in their absence. A new problem in the digital universe arises: Digital Preservation. There are huge volumes of information stored digitally and there are also a panoply of di erent classes, formats and types of digital objects. Our work addresses the problematic Digital Preservation and focuses on the logic and conceptual models within a speci c class of digital objects: Relational Databases. This family of digital objects is used by organizations to record their data produced on daily basis by information systems at operational levels or others. This structures are complex and the relational databases software support may di er from one organization to another. It can be proprietary, free or open source. Previously, a neutral format { Database Markup Language (DBML) { was adopted to pursue the goal of platform independence and to achieve standardization concerning the format in the digital preservation of relational databases. This format is able to describe both data and structure (logical model). The key strategies we are adopting are migration and normalization with refreshment. From our rst approach, we evolved the work to address the preservation of relational databases and we focused on the conceptual model of the database. The conceptual model of the database corresponds to the ideas and concepts that in the basis of the designed and/or modeled database, conceived to support a certain information system. We are referring to the semantics of the database and considering it as an important preservation "property". For the representation of this higher layer of abstraction present in databases we use an ontology based approach. At this higher abstraction level exists inherent Knowledge associated to the database semantics that we tentatively represent using Web Ontology Language (OWL). From the initial prototype, we developed a framework (supported by case studies) and establish a mapping algorithm for the conversion between the database and OWL. The ontology approach is adopted to formalize the knowledge associated to the conceptual model of the database and also a methodology to create an abstract representation of it. The system is based on the functional axes (ingestion, administration, dissemination and preservation) of the Open Archival Information System (OAIS) reference model and its information packages, where we include the two levels/layers of abstraction within the digital objects that are the subject of our research: Relational Databases. The framework o ers a set of web interfaces where it is possible to migrate a database into normalized and neutral formats (DBML + OWL) and perform some minor administration tasks on the repository. The system also enables the navigation or browsing through the database (concepts) without loosing technical details on the database relational model. The end consumers will have at their disposal a broad overview of the preserved object: a) the lower level data and structure of the relational database logical model and b) the higher level semantics and knowledge of the database conceptual model! Considering the unpredicted future access to a preserved database content and structure, our preservation policy tries to capture the signi cant properties of databases that should enable the future interpretability and understanding of the digital object.Através do crescimento das tecnologias de informação, grande parte do conhecimento humano passou a ser armazenado em suportes digitais. Esta transformação iniciou-se no seculo XX, tem vido a ocorrer de forma contínua, e tudo indica que fora já ultrapassado o "ponto-sem-retorno". Este novo paradigma implica cenários substancialmente diferentes, cenários estes onde os seres humanos necessitam de mediadores para compreender a informação digital { plataformas computacionais. Estas plataformas estão em constante evolução e não existe nada que nos possa garantir a continuidade de acesso aos artefactos digitais na sua ausência. Surge um novo problema associado ao mundo digital: Preservação Digital. Grandes quantidades de informação estão armazenadas digitalmente numa panóplia de diferentes classes, formatos e tipos. O nosso trabalho concentra-se na problemática da preservação digital, focando concretamente os modelos lógico e conceptual de uma classe específica de objectos digitais: as Bases de Dados Relacionais. Esta família de objectos digitais é amplamente usada pelas organizações para guardar os dados produzidos diariamente pelos seus sistemas de informação, tanto ao nível operacional como a outros níveis. Falamos de estruturas complexas em que os Sistemas Gestores de Bases de Dados que as suportam podem variar de organização para organização. Os sistemas podem ser proprietários, livres e ou de código aberto ("open source"). Inicialmente, um formato neutro { Database Markup Language (DBML) { foi adotado no sentido de garantir a independência de plataformas, e com o objectivo de conseguir estabelecer um formato normalizado para a preservação de bases de dados relacionais; isto tanto para os dados como para a estrutura (modelo lógico). As estratégias que adoptamos são a migração e normalização com refrescamento. A partir da abordagem inicial, evoluímos o nosso trabalho no que concerne à preservação digital de bases de dados relacionais, focando o estudo também no modelo conceptual da base de dados. O modelo conceptual corresponde às ideias e conceitos na base do desenho e/ou modelação de uma determinada base de dados, e concebido para dar suporte a um determinado cenário "real", i.e., a um determinado sistema de informação. Referimo-nos à semântica da base de dados considerando-a como uma importante "propriedade" na preservação. Para a representação desta camada de abstração mais elevada que estão presente nas bases de dados, utilizamos uma abordagem baseada em ontologias. A este nível mais elevado de abstração existe informação e conhecimento intrínseco que estão associados à semântica da base de dados que se pretende representar através de Web Ontology Language (OWL). A partir do protótipo inicial, desenvolvemos uma plataforma aplicacional (suportada por casos de estudo) e estabelecemos um algoritmo de mapeamento para a conversão entre bases de dados e OWL. A abordagem através da ontologia foi adoptada para formalizar o conhecimento associado ao modelo conceptual da base de dados e também foi usada como uma metodologia para criar uma representação abstracta da base de dados. O sistema baseia-se nos eixos funcionais (ingestão, administração, disseminação e preservação) do modelo de referência Open Archival Information System (OAIS) assim como nos seus pacotes de informação (information packages) onde são incluídos dois níveis/camadas de abstração, relativamente aos objectos digitais que são objecto de preservação neste estudo: Bases de Dados Relacionais. O sistema (framework) fornece um conjunto de interfaces web, onde é possível migrar a base de dados para formatos neutros e normalizados (DBML + OWL), e permitem também executar algumas tarefas de administração do repositório. O sistema possibilita ainda a navegação e pesquisa pelas bases de dados (conceitos), sem que se perca aspectos técnicos associados ao modelo relacional das mesmas. Os consumidores finais têm ao seu dispor uma visão global do objecto preservado: a) a um nível inferior os dados e estrutura do modelo relacional lógico e b) a um nível mais elevado a semântica e conhecimento associado ao modelo conceptual da base de dados! Considerando a imprevisibilidade no acesso futuro ao conteúdo e estrutura de bases de dados preservadas, a nossa política de preservação pretende capturar as propriedades significativas das bases de dados capazes de possibilitar futuramente a interpretação e compreensão do objecto digital

    Using natural language processing for question answering in closed and open domains

    Get PDF
    With regard to the growth in the amount of social, environmental, and biomedical information available digitally, there is a growing need for Question Answering (QA) systems that can empower users to master this new wealth of information. Despite recent progress in QA, the quality of interpretation and extraction of the desired answer is not adequate. We believe that striving for higher accuracy in QA systems is subject to on-going research, i.e., it is better to have no answer is better than wrong answers. However, there are diverse queries, which the state of the art QA systems cannot interpret and answer properly. The problem of interpreting a question in a way that could preserve its syntactic-semantic structure is considered as one of the most important challenges in this area. In this work we focus on the problems of semantic-based QA systems and analyzing the effectiveness of NLP techniques, query mapping, and answer inferencing both in closed (first scenario) and open (second scenario) domains. For this purpose, the architecture of Semantic-based closed and open domain Question Answering System (hereafter “ScoQAS”) over ontology resources is presented with two different prototyping: Ontology-based closed domain and an open domain under Linked Open Data (LOD) resource. The ScoQAS is based on NLP techniques combining semantic-based structure-feature patterns for question classification and creating a question syntactic-semantic information structure (QSiS). The QSiS provides an actual potential by building constraints to formulate the related terms on syntactic-semantic aspects and generating a question graph (QGraph) which facilitates making inference for getting a precise answer in the closed domain. In addition, our approach provides a convenient method to map the formulated comprehensive information into SPARQL query template to crawl in the LOD resources in the open domain. The main contributions of this dissertation are as follows: 1. Developing ScoQAS architecture integrated with common and specific components compatible with closed and open domain ontologies. 2. Analysing user’s question and building a question syntactic-semantic information structure (QSiS), which is constituted by several processes of the methodology: question classification, Expected Answer Type (EAT) determination, and generated constraints. 3. Presenting an empirical semantic-based structure-feature pattern for question classification and generalizing heuristic constraints to formulate the relations between the features in the recognized pattern in terms of syntactical and semantical. 4. Developing a syntactic-semantic QGraph for representing core components of the question. 5. Presenting an empirical graph-based answer inference in the closed domain. In a nutshell, a semantic-based QA system is presented which provides some experimental results over the closed and open domains. The efficiency of the ScoQAS is evaluated using measures such as precision, recall, and F-measure on LOD challenges in the open domain. We focus on quantitative evaluation in the closed domain scenario. Due to the lack of predefined benchmark(s) in the first scenario, we define measures that demonstrate the actual complexity of the problem and the actual efficiency of the solutions. The results of the analysis corroborate the performance and effectiveness of our approach to achieve a reasonable accuracy.Con respecto al crecimiento en la cantidad de información social, ambiental y biomédica disponible digitalmente, existe una creciente necesidad de sistemas de la búsqueda de la respuesta (QA) que puedan ofrecer a los usuarios la gestión de esta nueva cantidad de información. A pesar del progreso reciente en QA, la calidad de interpretación y extracción de la respuesta deseada no es la adecuada. Creemos que trabajar para lograr una mayor precisión en los sistemas de QA es todavía un campo de investigación abierto. Es decir, es mejor no tener respuestas que tener respuestas incorrectas. Sin embargo, existen diversas consultas que los sistemas de QA en el estado del arte no pueden interpretar ni responder adecuadamente. El problema de interpretar una pregunta de una manera que podría preservar su estructura sintáctica-semántica es considerado como uno de los desafíos más importantes en esta área. En este trabajo nos centramos en los problemas de los sistemas de QA basados en semántica y en el análisis de la efectividad de las técnicas de PNL, y la aplicación de consultas e inferencia respuesta tanto en dominios cerrados (primer escenario) como abiertos (segundo escenario). Para este propósito, la arquitectura del sistema de búsqueda de respuestas en dominios cerrados y abiertos basado en semántica (en adelante "ScoQAS") sobre ontologías se presenta con dos prototipos diferentes: en dominio cerrado basado en el uso de ontologías y un dominio abierto dirigido a repositorios de Linked Open Data (LOD). El ScoQAS se basa en técnicas de PNL que combinan patrones de características de estructura semánticas para la clasificación de preguntas y la creación de una estructura de información sintáctico-semántica de preguntas (QSiS). El QSiS proporciona una manera la construcción de restricciones para formular los términos relacionados en aspectos sintáctico-semánticos y generar un grafo de preguntas (QGraph) el cual facilita derivar inferencias para obtener una respuesta precisa en el dominio cerrado. Además, nuestro enfoque proporciona un método adecuado para aplicar la información integral formulada en la plantilla de consulta SPARQL para navegar en los recursos LOD en el dominio abierto. Las principales contribuciones de este trabajo son los siguientes: 1. El desarrollo de la arquitectura ScoQAS integrada con componentes comunes y específicos compatibles con ontologías de dominio cerrado y abierto. 2. El análisis de la pregunta del usuario y la construcción de una estructura de información sintáctico-semántica de las preguntas (QSiS), que está constituida por varios procesos de la metodología: clasificación de preguntas, determinación del Tipo de Respuesta Esperada (EAT) y las restricciones generadas. 3. La presentación de un patrón empírico basado en la estructura semántica para clasificar las preguntas y generalizar las restricciones heurísticas para formular las relaciones entre las características en el patrón reconocido en términos sintácticos y semánticos. 4. El desarrollo de un QGraph sintáctico-semántico para representar los componentes centrales de la pregunta. 5. La presentación de la respuesta inferida a partir de un grafo empírico en el dominio cerrado. En pocas palabras, se presenta un sistema semántico de QA que proporciona algunos resultados experimentales sobre los dominios cerrados y abiertos. La eficiencia del ScoQAS se evalúa utilizando medidas tales como una precisión, cobertura y la medida-F en desafíos LOD para el dominio abierto. Para el dominio cerrado, nos centramos en la evaluación cuantitativa; su precisión se analiza en una ontología empresarial. La falta de un banco la pruebas predefinidas es uno de los principales desafíos de la evaluación en el primer escenario. Por lo tanto, definimos medidas que demuestran la complejidad real del problema y la eficiencia real de las soluciones. Los resultados del análisis corroboran el rendimient

    Exploitation dynamique des données de production pour améliorer les méthodes DFM dans l'industrie Microélectronique

    Get PDF
    La conception pour la fabrication ou DFM (Design for Manufacturing) est une méthode maintenant classique pour assurer lors de la conception des produits simultanément la faisabilité, la qualité et le rendement de la production. Dans l'industrie microélectronique, le Design Rule Manual (DRM) a bien fonctionné jusqu'à la technologie 250nm avec la prise en compte des variations systématiques dans les règles et/ou des modèles basés sur l'analyse des causes profondes, mais au-delà de cette technologie, des limites ont été atteintes en raison de l'incapacité à sasir les corrélations entre variations spatiales. D'autre part, l'évolution rapide des produits et des technologies contraint à une mise à jour dynamique des DRM en fonction des améliorations trouvées dans les fabs. Dans ce contexte les contributions de thèse sont (i) une définition interdisciplinaire des AMDEC et analyse de risques pour contribuer aux défis du DFM dynamique, (ii) un modèle MAM (mapping and alignment model) de localisation spatiale pour les données de tests, (iii) un référentiel de données basé sur une ontologie ROMMII (referential ontology Meta model for information integration) pour effectuer le mapping entre des données hétérogènes issues de sources variées et (iv) un modèle SPM (spatial positioning model) qui vise à intégrer les facteurs spatiaux dans les méthodes DFM de la microélectronique, pour effectuer une analyse précise et la modélisation des variations spatiales basées sur l'exploitation dynamique des données de fabrication avec des volumétries importantes.The DFM (design for manufacturing) methods are used during technology alignment and adoption processes in the semiconductor industry (SI) for manufacturability and yield assessments. These methods have worked well till 250nm technology for the transformation of systematic variations into rules and/or models based on the single-source data analyses, but beyond this technology they have turned into ineffective R&D efforts. The reason for this is our inability to capture newly emerging spatial variations. It has led an exponential increase in technology lead times and costs that must be addressed; hence, objectively in this thesis we are focused on identifying and removing causes associated with the DFM ineffectiveness. The fabless, foundry and traditional integrated device manufacturer (IDM) business models are first analyzed to see coherence against a recent shift in business objectives from time-to-market (T2M) and time-to-volume towards (T2V) towards ramp-up rate. The increasing technology lead times and costs are identified as a big challenge in achieving quick ramp-up rates; hence, an extended IDM (e-IDM) business model is proposed to support quick ramp-up rates which is based on improving the DFM ineffectiveness followed by its smooth integration. We have found (i) single-source analyses and (ii) inability to exploit huge manufacturing data volumes as core limiting factors (failure modes) towards DFM ineffectiveness during technology alignment and adoption efforts within an IDM. The causes for single-source root cause analysis are identified as the (i) varying metrology reference frames and (ii) test structures orientations that require wafer rotation prior to the measurements, resulting in varying metrology coordinates (die/site level mismatches). A generic coordinates mapping and alignment model (MAM) is proposed to remove these die/site level mismatches, however to accurately capture the emerging spatial variations, we have proposed a spatial positioning model (SPM) to perform multi-source parametric correlation based on the shortest distance between respective test structures used to measure the parameters. The (i) unstructured model evolution, (ii) ontology issues and (iii) missing links among production databases are found as causes towards our inability to exploit huge manufacturing data volumes. The ROMMII (referential ontology Meta model for information integration) framework is then proposed to remove these issues and enable the dynamic and efficient multi-source root cause analyses. An interdisciplinary failure mode effect analysis (i-FMEA) methodology is also proposed to find cyclic failure modes and causes across the business functions which require generic solutions rather than operational fixes for improvement. The proposed e-IDM, MAM, SPM, and ROMMII framework results in accurate analysis and modeling of emerging spatial variations based on dynamic exploitation of the huge manufacturing data volumes.SAVOIE-SCD - Bib.électronique (730659901) / SudocGRENOBLE1/INP-Bib.électronique (384210012) / SudocGRENOBLE2/3-Bib.électronique (384219901) / SudocSudocFranceF

    Knowledge-centric autonomic systems

    Get PDF
    Autonomic computing revolutionised the commonplace understanding of proactiveness in the digital world by introducing self-managing systems. Built on top of IBM’s structural and functional recommendations for implementing intelligent control, autonomic systems are meant to pursue high level goals, while adequately responding to changes in the environment, with a minimum amount of human intervention. One of the lead challenges related to implementing this type of behaviour in practical situations stems from the way autonomic systems manage their inner representation of the world. Specifically, all the components involved in the control loop have shared access to the system’s knowledge, which, for a seamless cooperation, needs to be kept consistent at all times.A possible solution lies with another popular technology of the 21st century, the Semantic Web,and the knowledge representation media it fosters, ontologies. These formal yet flexible descriptions of the problem domain are equipped with reasoners, inference tools that, among other functions, check knowledge consistency. The immediate application of reasoners in an autonomic context is to ensure that all components share and operate on a logically correct and coherent “view” of the world. At the same time, ontology change management is a difficult task to complete with semantic technologies alone, especially if little to no human supervision is available. This invites the idea of delegating change management to an autonomic manager, as the intelligent control loop it implements is engineered specifically for that purpose.Despite the inherent compatibility between autonomic computing and semantic technologies,their integration is non-trivial and insufficiently investigated in the literature. This gap represents the main motivation for this thesis. Moreover, existing attempts at provisioning autonomic architectures with semantic engines represent bespoke solutions for specific problems (load balancing in autonomic networking, deconflicting high level policies, informing the process of correlating diverse enterprise data are just a few examples). The main drawback of these efforts is that they only provide limited scope for reuse and cross-domain analysis (design guidelines, useful architectural models that would scale well across different applications and modular components that could be integrated in other systems seem to be poorly represented). This work proposes KAS (Knowledge-centric Autonomic System), a hybrid architecture combining semantic tools such as: • an ontology to capture domain knowledge,• a reasoner to maintain domain knowledge consistent as well as infer new knowledge, • a semantic querying engine,• a tool for semantic annotation analysis with a customised autonomic control loop featuring: • a novel algorithm for extracting knowledge authored by the domain expert, • “software sensors” to monitor user requests and environment changes, • a new algorithm for analysing the monitored changes, matching them against known patterns and producing plans for taking the necessary actions, • “software effectors” to implement the planned changes and modify the ontology accordingly. The purpose of KAS is to act as a blueprint for the implementation of autonomic systems harvesting semantic power to improve self-management. To this end, two KAS instances were built and deployed in two different problem domains, namely self-adaptive document rendering and autonomic decision2support for career management. The former case study is intended as a desktop application, whereas the latter is a large scale, web-based system built to capture and manage knowledge sourced by an entire (relevant) community. The two problems are representative for their own application classes –namely desktop tools required to respond in real time and, respectively, online decision support platforms expected to process large volumes of data undergoing continuous transformation – therefore, they were selected to demonstrate the cross-domain applicability (that state of the art approaches tend to lack) of the proposed architecture. Moreover, analysing KAS behaviour in these two applications enabled the distillation of design guidelines and of lessons learnt from practical implementation experience while building on and adapting state of the art tools and methodologies from both fields.KAS is described and analysed from design through to implementation. The design is evaluated using ATAM (Architecture Trade off Analysis Method) whereas the performance of the two practical realisations is measured both globally as well as deconstructed in an attempt to isolate the impact of each autonomic and semantic component. This last type of evaluation employs state of the art metrics for each of the two domains. The experimental findings show that both instances of the proposed hybrid architecture successfully meet the prescribed high-level goals and that the semantic components have a positive influence on the system’s autonomic behaviour

    Developing a Formal Navy Knowledge Management Process

    Get PDF
    Prepared for: Chief of Naval Operations, N1Organization tacit and explicit knowledge are required for high performance, and it is imperative for such knowledge to be managed to ensure that it flows rapidly, reliably and energetically. The Navy N1 organization has yet to develop a formal process for knowledge management (KM). This places N1 in a position of competitive disadvantage, particularly as thousands of people change jobs every day, often taking their hard earned job knowledge out the door with them and leaving their replacements with the need to learn such knowledge anew. Building upon initial efforts to engage with industry and conceptualize a Navy KM strategy, the research described in this study employs a combination of Congruence Model analysis, Knowledge Flow Theory, and qualitative methods to outline an approach for embedding a formal Navy KM process. This work involves surveying best tools and practices in the industry, government and nonprofit sectors, augmented by in depth field research to examine two specific Navy organizations in detail. Results are highly promising, and they serve to illuminate a path toward improving Navy knowledge flows as well as continued research along these lines.Chief of Naval Operations, N1Chief of Naval Operations, N1.Approved for public release; distribution is unlimited

    24th International Conference on Information Modelling and Knowledge Bases

    Get PDF
    In the last three decades information modelling and knowledge bases have become essentially important subjects not only in academic communities related to information systems and computer science but also in the business area where information technology is applied. The series of European – Japanese Conference on Information Modelling and Knowledge Bases (EJC) originally started as a co-operation initiative between Japan and Finland in 1982. The practical operations were then organised by professor Ohsuga in Japan and professors Hannu Kangassalo and Hannu Jaakkola in Finland (Nordic countries). Geographical scope has expanded to cover Europe and also other countries. Workshop characteristic - discussion, enough time for presentations and limited number of participants (50) / papers (30) - is typical for the conference. Suggested topics include, but are not limited to: 1. Conceptual modelling: Modelling and specification languages; Domain-specific conceptual modelling; Concepts, concept theories and ontologies; Conceptual modelling of large and heterogeneous systems; Conceptual modelling of spatial, temporal and biological data; Methods for developing, validating and communicating conceptual models. 2. Knowledge and information modelling and discovery: Knowledge discovery, knowledge representation and knowledge management; Advanced data mining and analysis methods; Conceptions of knowledge and information; Modelling information requirements; Intelligent information systems; Information recognition and information modelling. 3. Linguistic modelling: Models of HCI; Information delivery to users; Intelligent informal querying; Linguistic foundation of information and knowledge; Fuzzy linguistic models; Philosophical and linguistic foundations of conceptual models. 4. Cross-cultural communication and social computing: Cross-cultural support systems; Integration, evolution and migration of systems; Collaborative societies; Multicultural web-based software systems; Intercultural collaboration and support systems; Social computing, behavioral modeling and prediction. 5. Environmental modelling and engineering: Environmental information systems (architecture); Spatial, temporal and observational information systems; Large-scale environmental systems; Collaborative knowledge base systems; Agent concepts and conceptualisation; Hazard prediction, prevention and steering systems. 6. Multimedia data modelling and systems: Modelling multimedia information and knowledge; Contentbased multimedia data management; Content-based multimedia retrieval; Privacy and context enhancing technologies; Semantics and pragmatics of multimedia data; Metadata for multimedia information systems. Overall we received 56 submissions. After careful evaluation, 16 papers have been selected as long paper, 17 papers as short papers, 5 papers as position papers, and 3 papers for presentation of perspective challenges. We thank all colleagues for their support of this issue of the EJC conference, especially the program committee, the organising committee, and the programme coordination team. The long and the short papers presented in the conference are revised after the conference and published in the Series of “Frontiers in Artificial Intelligence” by IOS Press (Amsterdam). The books “Information Modelling and Knowledge Bases” are edited by the Editing Committee of the conference. We believe that the conference will be productive and fruitful in the advance of research and application of information modelling and knowledge bases. Bernhard Thalheim Hannu Jaakkola Yasushi Kiyok

    Traditional and hybrid leadership styles in Rwanda: examining the common leadership styles, influencing factors, and culture in post-genocide Rwanda.

    Get PDF
    For most of Rwanda's post-independence past, the country has been marked by ethnic feuding, mass population movements and long exiles in neighbouring countries, and civil wars that culminated in the genocide in 1994. As this research shows in its review of literature of the history of Rwanda's post-independence period, the civil wars of those with ethnically-differentiated access to power and wealth have had social-, cultural- and economic effects. How has foreign culture - acquired by Rwandaliens - affected indigenous Rwandan culture, and its influence thereof on the present leadership styles? This thesis assesses the most common leadership styles in companies / organisations in Rwanda, in order to build a theory of the predominant leadership styles and culture in Rwanda in the context of the post-genocide era

    Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges

    Get PDF
    International audienceBackground: In high-dimensional data (HDD) settings, the number of variables associated with each observation is very large. Prominent examples of HDD in biomedical research include omics data with a large number of variables such as many measurements across the genome, proteome, or metabolome, as well as electronic health records data that have large numbers of variables recorded for each patient. The statistical analysis of such data requires knowledge and experience, sometimes of complex methods adapted to the respective research questions. Methods: Advances in statistical methodology and machine learning methods offer new opportunities for innovative analyses of HDD, but at the same time require a deeper understanding of some fundamental statistical concepts. Topic group TG9 “High-dimensional data” of the STRATOS (STRengthening Analytical Thinking for Observational Studies) initiative provides guidance for the analysis of observational studies, addressing particular statistical challenges and opportunities for the analysis of studies involving HDD. In this overview, we discuss key aspects of HDD analysis to provide a gentle introduction for non-statisticians and for classically trained statisticians with little experience specific to HDD. Results: The paper is organized with respect to subtopics that are most relevant for the analysis of HDD, in particular initial data analysis, exploratory data analysis, multiple testing, and prediction. For each subtopic, main analytical goals in HDD settings are outlined. For each of these goals, basic explanations for some commonly used analysis methods are provided. Situations are identified where traditional statistical methods cannot, or should not, be used in the HDD setting, or where adequate analytic tools are still lacking. Many key references are provided. Conclusions: This review aims to provide a solid statistical foundation for researchers, including statisticians and non-statisticians, who are new to research with HDD or simply want to better evaluate and understand the results of HDD analyses
    corecore