2,442 research outputs found

    Approximating Word Ranking and Negative Sampling for Word Embedding

    Get PDF
    CBOW (Continuous Bag-Of-Words) is one of the most commonly used techniques to generate word embeddings in various NLP tasks. However, it fails to reach the optimal performance due to uniform involvements of positive words and a simple sampling distribution of negative words. To resolve these issues, we propose OptRank to optimize word ranking and approximate negative sampling for bettering word embedding. Specifically, we first formalize word embedding as a ranking problem. Then, we weigh the positive words by their ranks such that highly ranked words have more importance, and adopt a dynamic sampling strategy to select informative negative words. In addition, an approximation method is designed to efficiently compute word ranks. Empirical experiments show that OptRank consistently outperforms its counterparts on a benchmark dataset with different sampling scales, especially when the sampled subset is small. The code and datasets can be obtained from https://github.com/ouououououou/OptRank

    Learning Graph Embeddings from WordNet-based Similarity Measures

    Full text link
    We present path2vec, a new approach for learning graph embeddings that relies on structural measures of pairwise node similarities. The model learns representations for nodes in a dense space that approximate a given user-defined graph distance measure, such as e.g. the shortest path distance or distance measures that take information beyond the graph structure into account. Evaluation of the proposed model on semantic similarity and word sense disambiguation tasks, using various WordNet-based similarity measures, show that our approach yields competitive results, outperforming strong graph embedding baselines. The model is computationally efficient, being orders of magnitude faster than the direct computation of graph-based distances.Comment: Accepted to StarSem 201

    Relaxed Softmax for learning from Positive and Unlabeled data

    Full text link
    In recent years, the softmax model and its fast approximations have become the de-facto loss functions for deep neural networks when dealing with multi-class prediction. This loss has been extended to language modeling and recommendation, two fields that fall into the framework of learning from Positive and Unlabeled data. In this paper, we stress the different drawbacks of the current family of softmax losses and sampling schemes when applied in a Positive and Unlabeled learning setup. We propose both a Relaxed Softmax loss (RS) and a new negative sampling scheme based on Boltzmann formulation. We show that the new training objective is better suited for the tasks of density estimation, item similarity and next-event prediction by driving uplifts in performance on textual and recommendation datasets against classical softmax.Comment: 9 pages, 5 figures, 2 tables, published at RecSys 201

    Multimodal Machine Learning for Automated ICD Coding

    Full text link
    This study presents a multimodal machine learning model to predict ICD-10 diagnostic codes. We developed separate machine learning models that can handle data from different modalities, including unstructured text, semi-structured text and structured tabular data. We further employed an ensemble method to integrate all modality-specific models to generate ICD-10 codes. Key evidence was also extracted to make our prediction more convincing and explainable. We used the Medical Information Mart for Intensive Care III (MIMIC -III) dataset to validate our approach. For ICD code prediction, our best-performing model (micro-F1 = 0.7633, micro-AUC = 0.9541) significantly outperforms other baseline models including TF-IDF (micro-F1 = 0.6721, micro-AUC = 0.7879) and Text-CNN model (micro-F1 = 0.6569, micro-AUC = 0.9235). For interpretability, our approach achieves a Jaccard Similarity Coefficient (JSC) of 0.1806 on text data and 0.3105 on tabular data, where well-trained physicians achieve 0.2780 and 0.5002 respectively.Comment: Machine Learning for Healthcare 201
    • …
    corecore