74 research outputs found

    Word2vec-based latent semantic analysis (W2V-LSA) for topic modeling: A study on blockchain technology trend analysis

    Get PDF
    Blockchain has become one of the core technologies in Industry 4.0. To help decision-makers establish action plans based on blockchain, it is an urgent task to analyze trends in blockchain technology. However, most of existing studies on blockchain trend analysis are based on effort demanding full-text investigation or traditional bibliometric methods whose study scope is limited to a frequency-based statistical analysis. Therefore, in this paper, we propose a new topic modeling method called Word2vec-based Latent Semantic Analysis (W2V-LSA), which is based on Word2vec and Spherical k-means clustering to better capture and represent the context of a corpus. We then used W2V-LSA to perform an annual trend analysis of blockchain research by country and time for 231 abstracts of blockchain-related papers published over the past five years. The performance of the proposed algorithm was compared to Probabilistic LSA, one of the common topic modeling techniques. The experimental results confirmed the usefulness of W2V-LSA in terms of the accuracy and diversity of topics by quantitative and qualitative evaluation. The proposed method can be a competitive alternative for better topic modeling to provide direction for future research in technology trend analysis and it is applicable to various expert systems related to text mining. (C) 2020 The Authors. Published by Elsevier Ltd

    Bilingual Autoencoder-based Efficient Harmonization of Multi-source Private Data for Accurate Predictive Modeling

    Get PDF
    Sharing electronic health record data is essential for advanced analysis, but may put sensitive information at risk. Several studies have attempted to address this risk using contextual embedding, but with many hospitals involved, they are often inefficient and inflexible. Thus, we propose a bilingual autoencoder-based model to harmonize local embeddings in different spaces. Cross-hospital reconstruction of embeddings makes encoders map embeddings from hospitals to a shared space and align them spontaneously. We also suggest two-phase training to prevent distortion of embeddings during harmonization with hospitals that have biased information. In experiments, we used medical event sequences from the Medical Information Mart for Intensive Care-III dataset and simulated the situation of multiple hospitals. For evaluation, we measured the alignment of events from different hospitals and the prediction accuracy of a patient & rsquo;s diagnosis in the next admission in three scenarios in which local embeddings do not work. The proposed method efficiently harmonizes embeddings in different spaces, increases prediction accuracy, and gives flexibility to include new hospitals, so is superior to previous methods in most cases. It will be useful in predictive tasks to utilize distributed data while preserving private information

    Privacy-Preserving Predictive Modeling: Harmonization of Contextual Embeddings From Different Sources

    Get PDF
    Background: Data sharing has been a big challenge in biomedical informatics because of privacy concerns. Contextual embedding models have demonstrated a very strong representative capability to describe medical concepts (and their context), and they have shown promise as an alternative way to support deep-learning applications without the need to disclose original data. However, contextual embedding models acquired from individual hospitals cannot be directly combined because their embedding spaces are different, and naive pooling renders combined embeddings useless. Objective: The aim of this study was to present a novel approach to address these issues and to promote sharing representation without sharing data. Without sacrificing privacy, we also aimed to build a global model from representations learned from local private data and synchronize information from multiple sources. Methods: We propose a methodology that harmonizes different local contextual embeddings into a global model. We used Word2Vec to generate contextual embeddings from each source and Procrustes to fuse different vector models into one common space by using a list of corresponding pairs as anchor points. We performed prediction analysis with harmonized embeddings. Results: We used sequential medical events extracted from the Medical Information Mart for Intensive Care III database to evaluate the proposed methodology in predicting the next likely diagnosis of a new patient using either structured data or unstructured data. Under different experimental scenarios, we confirmed that the global model built from harmonized local models achieves a more accurate prediction than local models and global models built from naive pooling. Conclusions: Such aggregation of local models using our unique harmonization can serve as the proxy for a global model, combining information from a wide range of institutions and information sources. It allows information unique to a certain hospital to become available to other sites, increasing the fluidity of information flow in health care

    An efficient multivariate feature ranking method for gene selection in high-dimensional microarray data

    Get PDF
    Classification of microarray data plays a significant role in the diagnosis and prediction of cancer. However, its high-dimensionality (>tens of thousands) compared to the number of observations (<tens of hundreds) may lead to poor classification accuracy. In addition, only a fraction of genes is really important for the classification of a certain cancer, and thus feature selection is very essential in this field. Due to the time and memory burden for processing the high-dimensional data, univariate feature ranking methods are widely-used in gene selection. However, most of them are not that accurate because they only consider the relevance of features to the target without considering the redundancy among features. In this study, we propose a novel multivariate feature ranking method to improve the quality of gene selection and ultimately to improve the accuracy of microarray data classification. The method can be efficiently applied to high-dimensional microarray data. We embedded the formal definition of relevance into a Markov blanket (MB) to create a new feature ranking method. Using a few microarray datasets, we demonstrated the practicability of MB-based feature ranking having high accuracy and good efficiency. The method outperformed commonly-used univariate ranking methods and also yielded the better result even compared with the other multivariate feature ranking method due to the advantage of data efficiency

    Four Human Cases of Diphyllobothrium latum Infection

    Get PDF
    Diphyllobothrium latum infections in 4 young Korean men detected from 2008 to 2012 are presented. Three were diagnosed based on spontaneously discharged strobila of the adult worm in their feces, and 1 case was diagnosed by finding the worm at colonoscopy examination in a local clinic. The morphologic characteristics of the gravid proglottid and eggs were consistent with D. latum. All patients were treated with praziquantel 15 mg/kg, and follow-up stool examinations were done at 2 months after the medication. The main clinical complaints were intermittent gastrointestinal troubles such as indigestion, abdominal distension, and spontaneous discharge of tapeworm's segments in their feces. The most probable source of infection was the flesh of salmon or trout according to a patient's past history. These are the 45th to 48th recorded cases diagnosed by the adult worm in the Republic of Korea since 1971

    Prediction of type 2 diabetes using genome-wide polygenic risk score and metabolic profiles: A machine learning analysis of population-based 10-year prospective cohort study

    Get PDF
    Background: Previous work on predicting type 2 diabetes by integrating clinical and genetic factors has mostly focused on the Western population. In this study, we use genome-wide polygenic risk score (gPRS) and serum metabolite data for type 2 diabetes risk prediction in the Asian population. Methods: Data of 1425 participants from the Korean Genome and Epidemiology Study (KoGES) Ansan-Ansung cohort were used in this study. For gPRS analysis, genotypic and clinical information from KoGES health examinee (n = 58,701) and KoGES cardiovascular disease association (n = 8105) sub-cohorts were included. Linkage disequilibrium analysis identified 239,062 genetic variants that were used to determine the gPRS, while the metabolites were selected using the Boruta algorithm. We used bootstrapped cross-validation to evaluate logistic regression and random forest (RF)-based machine learning models. Finally, associations of gPRS and selected metabolites with the values of homeostatic model assessment of beta-cell function (HOMA-B) and insulin resistance (HOMA-IR) were further estimated. Findings: During the follow-up period (8.3 ?? 2.8 years), 331 participants (23.2%) were diagnosed with type 2 diabetes. The areas under the curves of the RF-based models were 0.844, 0.876, and 0.883 for the model using only demographic and clinical factors, model including the gPRS, and model with both gPRS and metabolites, respectively. Incorporation of additional parameters in the latter two models improved the classification by 11.7% and 4.2% respectively. While gPRS was significantly associated with HOMA-B value, most metabolites had a significant association with HOMA-IR value. Interpretation: Incorporating both gPRS and metabolite data led to enhanced type 2 diabetes risk prediction by capturing distinct etiologies of type 2 diabetes development. An RF-based model using clinical factors, gPRS, and metabolites predicted type 2 diabetes risk more accurately than the logistic regression-based model

    Nanozyme Based on Porphyrinic Metal-Organic Framework for Electrocatalytic CO2 Reduction

    Get PDF
    Mimicry of natural enzyme systems is an important approach for catalyst design. To create an enzyme-inspired catalyst, it is essential to mimic both the active center and the second coordination sphere. Metal-organic frameworks (MOFs), an emerging class of porous materials, are ideal candidates for heterogeneous catalysts because their versatile building blocks confer a high level of structural tunability, and the chemical environment surrounding the active center can be controlled at the molecular level. Herein, a new 2D porphyrinic MOF, PPF-100, constructed from a nonplanar saddle-distorted porphyrin linker and a Cu paddle-wheel metal node is reported. The strategic introduction of ethyl substituents allows not only to mimic the active center and second coordination sphere but also to increase the catalytic selectivity while completely inhibiting H-2 generation in the CO2 reduction reaction

    Patient-specific molecular response dynamics can predict the possibility of relapse during the second treatment-free remission attempt in chronic myelogenous leukemia

    Get PDF
    In chronic myelogenous leukemia (CML), treatment-free remission (TFR) is defined as maintaining a major molecular response (MMR) without a tyrosine kinase inhibitor (TM), such as imatinib (IM). Several studies have investigated the safety of the first TFR (TFR1) attempt and suggested recommendation guidelines for such an attempt. However, the plausibility and predictive factors for a second TFR (TFR2) have yet to be reported. The present study included 21 patients in chronic myeloid leukemia who participated in twice repeated treatment stop attempts. We develop a mathematical model to analyze and explain the outcomes of TFR2. Our mathematical model framework can explain patient-specific molecular response dynamics. Fitting the model to longitudinal BCR ABL1 transcripts from the patients generated patient-specific parameters. Binary tree decision analyses of the model parameters suggested a model based predictive binary classification factor that separated patients into low- and high-risk groups of TFR2 attempts with an overall accuracy of 76.2% (sensitivity of 81.1% and specificity of 69.9%). The low-risk group maintained a median TFR2 of 28.2 months, while the high-risk group relapsed at a median time of 3.25 months. Further, our model predicted a patient-specific optimal IM treatment duration before the second IM stop that could achieve the desired TFR 2 (e.g., 5 years)
    corecore