980 research outputs found

    Data Mining Techniques for Complex User-Generated Data

    Get PDF
    Nowadays, the amount of collected information is continuously growing in a variety of different domains. Data mining techniques are powerful instruments to effectively analyze these large data collections and extract hidden and useful knowledge. Vast amount of User-Generated Data (UGD) is being created every day, such as user behavior, user-generated content, user exploitation of available services and user mobility in different domains. Some common critical issues arise for the UGD analysis process such as the large dataset cardinality and dimensionality, the variable data distribution and inherent sparseness, and the heterogeneous data to model the different facets of the targeted domain. Consequently, the extraction of useful knowledge from such data collections is a challenging task, and proper data mining solutions should be devised for the problem under analysis. In this thesis work, we focus on the design and development of innovative solutions to support data mining activities over User-Generated Data characterised by different critical issues, via the integration of different data mining techniques in a unified frame- work. Real datasets coming from three example domains characterized by the above critical issues are considered as reference cases, i.e., health care, social network, and ur- ban environment domains. Experimental results show the effectiveness of the proposed approaches to discover useful knowledge from different domains

    Machine Learning Framework for Real-World Electronic Health Records Regarding Missingness, Interpretability, and Fairness

    Get PDF
    Machine learning (ML) and deep learning (DL) techniques have shown promising results in healthcare applications using Electronic Health Records (EHRs) data. However, their adoption in real-world healthcare settings is hindered by three major challenges. Firstly, real-world EHR data typically contains numerous missing values. Secondly, traditional ML/DL models are typically considered black-boxes, whereas interpretability is required for real-world healthcare applications. Finally, differences in data distributions may lead to unfairness and performance disparities, particularly in subpopulations. This dissertation proposes methods to address missing data, interpretability, and fairness issues. The first work proposes an ensemble prediction framework for EHR data with large missing rates using multiple subsets with lower missing rates. The second method introduces the integration of medical knowledge graphs and double attention mechanism with the long short-term memory (LSTM) model to enhance interpretability by providing knowledge-based model interpretation. The third method develops an LSTM variant that integrates medical knowledge graphs and additional time-aware gates to handle multi-variable temporal missing issues and interpretability concerns. Finally, a transformer-based model is proposed to learn unbiased and fair representations of diverse subpopulations using domain classifiers and three attention mechanisms

    Structured and unstructured data integration with electronic medical records

    Get PDF
    In recent years there has been a great population and technological evolution all over the world. At the same time, more areas beyond technology and information technology have also developed, namely medicine, which has led to an increase in average life expectancy which in turn, leads to a greater need for healthcare. In order to provide the best possible treatments and healthcare services, nowadays the hospitals store large amounts of data regarding patients and diseases (in the form of electronic medical records) or the logistics of some departments in their storage systems. Therefore, computer science techniques such as data mining and natural language processing have been used to extract knowledge and value from these information-rich sources in order not only to develop, for example, new models for disease prediction, as well as improving existing processes in healthcare centres and hospitals. This data storage can be done in one of three ways: structured, unstructured or semi-structured. In this paper, the author tested the integration of structured and unstructured data from two different departments of the same Portuguese hospital, in order to extract knowledge and improve hospital processes. Aiming to reduce the value loss of loading data that is not used in the healthcare providers systems.Nos últimos anos tem-se assistido a uma grande evolução populacional e tecnológica por todo o mundo. Paralelamente, mais áreas para além da tecnologia e informática têm-se também desenvolvido, nomeadamente a área da medicina, o que tem permitido um aumento na esperança média de vida que por sua vez leva a uma maior necessidade de cuidados de saúde. Com o intuito de fornecer os melhores serviços de saúde possíveis, nos dias que hoje os hospitais guardam nos seus sistemas informáticos grandes quantidades de dados relativamente aos pacientes e doenças (sobre a forma de registos médicos eletrónicos) ou relativos à logística de alguns departamentos dos hospitais, etc. Por conseguinte, a estes dados têm vindo a ser utilizadas técnicas da área das ciências da computação como o data mining e o processamento da língua natural para extrair conhecimento e valor dessas fontes ricas em informação com o intuito não só de desenvolver, por exemplo, novos modelos de predição de doenças, como também de melhorar processos já existentes em centros de saúde e hospitais. Este armazenamento de dados pode ser feito em uma de três formas: de forma estruturada, não estruturada ou semi-estruturada. Neste trabalho o autor testou a integração de dados estruturados e não estruturados de dois departamentos diferentes do mesmo hospital português, com o intuito de extrair conhecimento e melhorar os processos do hospital. Com o intuito de reduzir a perda do armazenamento de dados que não são utilizados

    Iterative annotation to ease neural network training: Specialized machine learning in medical image analysis

    Get PDF
    Neural networks promise to bring robust, quantitative analysis to medical fields, but adoption is limited by the technicalities of training these networks. To address this translation gap between medical researchers and neural networks in the field of pathology, we have created an intuitive interface which utilizes the commonly used whole slide image (WSI) viewer, Aperio ImageScope (Leica Biosystems Imaging, Inc.), for the annotation and display of neural network predictions on WSIs. Leveraging this, we propose the use of a human-in-the-loop strategy to reduce the burden of WSI annotation. We track network performance improvements as a function of iteration and quantify the use of this pipeline for the segmentation of renal histologic findings on WSIs. More specifically, we present network performance when applied to segmentation of renal micro compartments, and demonstrate multi-class segmentation in human and mouse renal tissue slides. Finally, to show the adaptability of this technique to other medical imaging fields, we demonstrate its ability to iteratively segment human prostate glands from radiology imaging data.Comment: 15 pages, 7 figures, 2 supplemental figures (on the last page

    Deep learning for precision medicine

    Get PDF
    As a result of the recent trend towards digitization, an increasing amount of information is recorded in clinics and hospitals, and this increasingly overwhelms the human decision maker. This issue is one of the main reasons why Machine Learning (ML) is gaining attention in the medical domain, since ML algorithms can make use of all the available information to predict the most likely future events that will occur to each individual patient. Physicians can include these predictions in their decision processes which can lead to improved outcomes. Eventually ML can also be the basis for a decision support system that provides personalized recommendations for each individual patient. It is also worth noticing that medical datasets are becoming both longer (i.e. we have more samples collected through time) and wider (i.e. we store more variables). There- fore we need to use ML algorithms capable of modelling complex relationships among a big number of time-evolving variables. A kind of models that can capture very complex relationships are Deep Neural Networks, which have proven to be successful in other areas of ML, like for example Language Modelling, which is a use case that has some some similarities with the medical use case. However, the medical domain has a set of characteristics that make it an almost unique scenario: multiple events can occur at the same time, there are multiple sequences (i.e. multiple patients), each sequence has an associated set of static variables, both inputs and outputs can be a combination of different data types, etc. For these reasons we need to develop approaches specifically designed for the medical use case. In this work we design and develop different kind of models based on Neural Networks that are suitable for modelling medical datasets. Besides, we tackle different medical tasks and datasets, showing which models work best in each case. The first dataset we use is one collected from patients that suffered from kidney failure. The data was collected in the Charité hospital in Berlin and it is the largest data collection of its kind in Europe. Once the kidney has failed, patients face a lifelong treatment and periodic visits to the clinic for the rest of their lives. Until the hospital finds a new kidney for the patient, he or she must attend to the clinic multiple times per week in order to receive dialysis, which is a treatment that replaces many of the functions of the kidney. After the transplant has been performed, the patient receives immunosuppressive therapy to avoid the rejection of the transplanted kidney. Patients must be periodically controlled to check the status of the kidney, adjust the treatment and take care of associated diseases, such as those that arise due to the immunosuppressive therapy. This dataset started being recorded more than 30 years ago and it is composed of more than 4000 patients that underwent a renal transplantation or are waiting for it. The database has been the basis for many studies in the past. Our first goal with the nephrology dataset is to develop a system to predict the next events that will be recorded in the electronic medical record of each patient, and thus to develop the basis for a future clinical decision support system. Specifically, we model three aspects of the patient evolution: medication prescriptions, laboratory tests ordered and laboratory test results. Besides, there are a set of endpoints that can happen after a transplantation and it would be very valuable for the physicians to be able to know beforehand when one of these is going to happen. Specifically, we also predict whether the patient will die, the transplant will be rejected, or the transplant will be lost. For each visit that a patient makes to the clinic, we anticipate which of those three events (if any) will occur both within 6 months and 12 months after the visit. The second dataset that we use in this thesis is the one collected by the MEmind Wellness Tracker, which contains information related to psychiatric patients. Suicide is the second leading cause of death in the 15-29 years age group, and its prevention is one of the top public health priorities. Traditionally, psychiatric patients have been assessed by self-reports, but these su↵er from recall bias. To improve data quantity and quality, the MEmind Wellness Tracker provides a mobile application that enables patients to send daily reports about their status. Thus, this application enables physicians to get information about patients in their natural environments. Therefore this dataset contains sequential information generated by the MEmind application, sequential information generated during medical visits and static information of each patient. Our goal with this dataset is to predict the suicidal ideation value that each patient will report next. In order to model both datasets, we have developed a set of predictive Machine Learning models based on Neural Networks capable of integrating multiple sequences of data withthe background information of each patient. We compare the performance achieved by these approaches with the ones obtained with classical ML algorithms. For the task of predicting the next events that will be observed in the nephrology dataset, we obtained the best performance with a Feedforward Neural Network containing a representation layer. On the other hand, for the tasks of endpoint prediction in nephrology patients and the task of suicidal ideation prediction, we obtained the best performance with a model that combines a Feedforward Neural Network with one or multiple Recurrent Neural Networks (RNNs) using Gated Recurrent Units. We hypothesize that this kind of models that include RNNs provide the best performance when the dataset contains long-term dependencies. To our knowledge, our work is the first one that develops these kind of deep networks that combine both static and several sources of dynamic information. These models can be useful in many other medical datasets and even in datasets within other domains. We show some examples where our approach is successfully applied to non-medical datasets that also present multiple variables evolving in time. Besides, we installed the endpoints prediction model as a standalone system in the Charit ́e hospital in Berlin. For this purpose, we developed a web based user interface that the physicians can use, and an API interface that can be used to connect our predictive system with other IT systems in the hospital. These systems can be seen as a recommender system, however they do not necessarily generate valid prescriptions. For example, for certain patient, a system can predict very high probabilities for all antibiotics in the dataset. Obviously, this patient should not take all antibiotics, but only one of them. Therefore, we need a human decision maker on top of our recommender system. In order to model this decision process, we used an architecture based on a Generative Adversarial Network (GAN). GANs are systems based on Neural Networks that make better generative models than regular Neural Networks. Thus we trained one GAN that works on top of a regular Neural Network and show how the quality of the prescriptions gets improved. We run this experiment with a synthetic dataset that we created for this purpose. The architectures that we developed, are specially designed for modelling medical data, but they can be also useful in other use cases. We run experiments showing how we train them for modelling the readings of a sensor network and also to train a movie recommendation engine
    corecore