4 research outputs found

    Visualization of medical concepts represented using word embeddings: a scoping review.

    No full text
    International audienceBackgroundAnalyzing the unstructured textual data contained in electronic health records (EHRs) has always been a challenging task. Word embedding methods have become an essential foundation for neural network-based approaches in natural language processing (NLP), to learn dense and low-dimensional word representations from large unlabeled corpora that capture the implicit semantics of words. Models like Word2Vec, GloVe or FastText have been broadly applied and reviewed in the bioinformatics and healthcare fields, most often to embed clinical notes or activity and diagnostic codes. Visualization of the learned embeddings has been used in a subset of these works, whether for exploratory or evaluation purposes. However, visualization practices tend to be heterogeneous, and lack overall guidelines.ObjectiveThis scoping review aims to describe the methods and strategies used to visualize medical concepts represented using word embedding methods. We aim to understand the objectives of the visualizations and their limits.MethodsThis scoping review summarizes different methods used to visualize word embeddings in healthcare. We followed the methodology proposed by Arksey and O’Malley (Int J Soc Res Methodol 8:19–32, 2005) and by Levac et al. (Implement Sci 5:69, 2010) to better analyze the data and provide a synthesis of the literature on the matter.ResultsWe first obtained 471 unique articles from a search conducted in PubMed, MedRxiv and arXiv databases. 30 of these were effectively reviewed, based on our inclusion and exclusion criteria. 23 articles were excluded in the full review stage, resulting in the analysis of 7 papers that fully correspond to our inclusion criteria. Included papers pursued a variety of objectives and used distinct methods to evaluate their embeddings and to visualize them. Visualization also served heterogeneous purposes, being alternatively used as a way to explore the embeddings, to evaluate them or to merely illustrate properties otherwise formally assessed.ConclusionsVisualization helps to explore embedding results (further dimensionality reduction, synthetic representation). However, it does not exhaust the information conveyed by the embeddings nor constitute a self-sustaining evaluation method of their pertinence

    Master’s Degree in Health Data Science: Implementation and Assessment After Five Years

    No full text
    International audienceHealth data science is an emerging discipline that bridges computer science, statistics and health domain knowledge. This consists of taking advantage of the large volume of data, often complex, to extract information to improve decision-making. We have created a Master’s degree in Health Data Science to meet the growing need for data scientists in companies and institutions. The training offers, over two years, courses covering computer science, mathematics and statistics, health and biology. With more than 60 professors and lecturers, a total of 835 hours of classes (not including the mandatory 5 months of internship per year), this curriculum has enrolled a total of 53 students today. The feedback from the students and alumni allowed us identifying new needs in terms of training, which may help us to adapt the program for the coming academic years. In particular, we will offer an additional module covering data management, from the edition of the clinical report form to the implementation of a data warehouse with an ETL process. Git and application lifecycle management will be included in programming courses or multidisciplinary projects

    Master’s Degree in Health Data Science: Implementation and Assessment After Five Years

    No full text
    International audienceHealth data science is an emerging discipline that bridges computer science, statistics and health domain knowledge. This consists of taking advantage of the large volume of data, often complex, to extract information to improve decision-making. We have created a Master’s degree in Health Data Science to meet the growing need for data scientists in companies and institutions. The training offers, over two years, courses covering computer science, mathematics and statistics, health and biology. With more than 60 professors and lecturers, a total of 835 hours of classes (not including the mandatory 5 months of internship per year), this curriculum has enrolled a total of 53 students today. The feedback from the students and alumni allowed us identifying new needs in terms of training, which may help us to adapt the program for the coming academic years. In particular, we will offer an additional module covering data management, from the edition of the clinical report form to the implementation of a data warehouse with an ETL process. Git and application lifecycle management will be included in programming courses or multidisciplinary projects

    Automatic de-identification of French electronic health records: a cost-effective approach exploiting distant supervision and deep learning models

    No full text
    Abstract Background Electronic health records (EHRs) contain valuable information for clinical research; however, the sensitive nature of healthcare data presents security and confidentiality challenges. De-identification is therefore essential to protect personal data in EHRs and comply with government regulations. Named entity recognition (NER) methods have been proposed to remove personal identifiers, with deep learning-based models achieving better performance. However, manual annotation of training data is time-consuming and expensive. The aim of this study was to develop an automatic de-identification pipeline for all kinds of clinical documents based on a distant supervised method to significantly reduce the cost of manual annotations and to facilitate the transfer of the de-identification pipeline to other clinical centers. Methods We proposed an automated annotation process for French clinical de-identification, exploiting data from the eHOP clinical data warehouse (CDW) of the CHU de Rennes and national knowledge bases, as well as other features. In addition, this paper proposes an assisted data annotation solution using the Prodigy annotation tool. This approach aims to reduce the cost required to create a reference corpus for the evaluation of state-of-the-art NER models. Finally, we evaluated and compared the effectiveness of different NER methods. Results A French de-identification dataset was developed in this work, based on EHRs provided by the eHOP CDW at Rennes University Hospital, France. The dataset was rich in terms of personal information, and the distribution of entities was quite similar in the training and test datasets. We evaluated a Bi-LSTM + CRF sequence labeling architecture, combined with Flair + FastText word embeddings, on a test set of manually annotated clinical reports. The model outperformed the other tested models with a significant F1 score of 96,96%, demonstrating the effectiveness of our automatic approach for deidentifying sensitive information. Conclusions This study provides an automatic de-identification pipeline for clinical notes, which can facilitate the reuse of EHRs for secondary purposes such as clinical research. Our study highlights the importance of using advanced NLP techniques for effective de-identification, as well as the need for innovative solutions such as distant supervision to overcome the challenge of limited annotated data in the medical domain
    corecore