2,648 research outputs found

    Using J-K-fold Cross Validation to Reduce Variance When Tuning NLP Models

    Get PDF
    K-fold cross validation (CV) is a popular method for estimating the true performance of machine learning models, allowing model selection and parameter tuning. However, the very process of CV requires random partitioning of the data and so our performance estimates are in fact stochastic, with variability that can be substantial for natural language processing tasks. We demonstrate that these unstable estimates cannot be relied upon for effective parameter tuning. The resulting tuned parameters are highly sensitive to how our data is partitioned, meaning that we often select sub-optimal parameter choices and have serious reproducibility issues. Instead, we propose to use the less variable J-K-fold CV, in which J independent K-fold cross validations are used to assess performance. Our main contributions are extending J-K-fold CV from performance estimation to parameter tuning and investigating how to choose J and K. We argue that variability is more important than bias for effective tuning and so advocate lower choices of K than are typically seen in the NLP literature, instead use the saved computation to increase J. To demonstrate the generality of our recommendations we investigate a wide range of case-studies: sentiment classification (both general and target-specific), part-of-speech tagging and document classification

    Machine learning and electronic health records

    Get PDF
    In this work, we investigate the benefits and complications of using machine learning on EHR data. We survey some recent literature and conduct experiments on real data collected from hospital EHR systems.Masteroppgave i informatikkINF399MAMN-INFMAMN-PRO

    Simple Recurrent Units for Highly Parallelizable Recurrence

    Full text link
    Common recurrent neural architectures scale poorly due to the intrinsic difficulty in parallelizing their state computations. In this work, we propose the Simple Recurrent Unit (SRU), a light recurrent unit that balances model capacity and scalability. SRU is designed to provide expressive recurrence, enable highly parallelized implementation, and comes with careful initialization to facilitate training of deep models. We demonstrate the effectiveness of SRU on multiple NLP tasks. SRU achieves 5--9x speed-up over cuDNN-optimized LSTM on classification and question answering datasets, and delivers stronger results than LSTM and convolutional models. We also obtain an average of 0.7 BLEU improvement over the Transformer model on translation by incorporating SRU into the architecture.Comment: EMNL

    Analysis and Detection of Information Types of Open Source Software Issue Discussions

    Full text link
    Most modern Issue Tracking Systems (ITSs) for open source software (OSS) projects allow users to add comments to issues. Over time, these comments accumulate into discussion threads embedded with rich information about the software project, which can potentially satisfy the diverse needs of OSS stakeholders. However, discovering and retrieving relevant information from the discussion threads is a challenging task, especially when the discussions are lengthy and the number of issues in ITSs are vast. In this paper, we address this challenge by identifying the information types presented in OSS issue discussions. Through qualitative content analysis of 15 complex issue threads across three projects hosted on GitHub, we uncovered 16 information types and created a labeled corpus containing 4656 sentences. Our investigation of supervised, automated classification techniques indicated that, when prior knowledge about the issue is available, Random Forest can effectively detect most sentence types using conversational features such as the sentence length and its position. When classifying sentences from new issues, Logistic Regression can yield satisfactory performance using textual features for certain information types, while falling short on others. Our work represents a nontrivial first step towards tools and techniques for identifying and obtaining the rich information recorded in the ITSs to support various software engineering activities and to satisfy the diverse needs of OSS stakeholders.Comment: 41st ACM/IEEE International Conference on Software Engineering (ICSE2019

    Dataset Splitting Techniques Comparison For Face Classification on CCTV Images

    Get PDF
    The performance of classification models in machine learning algorithms is influenced by many factors, one of which is dataset splitting method. To avoid overfitting, it is important to apply a suitable dataset splitting strategy. This study presents comparison of four dataset splitting techniques, namely Random Sub-sampling Validation (RSV), k-Fold Cross Validation (k-FCV), Bootstrap Validation (BV) and Moralis Lima Martin Validation (MLMV). This comparison is done in face classification on CCTV images using Convolutional Neural Network (CNN) algorithm and Support Vector Machine (SVM) algorithm. This study is also applied in two image datasets. The results of the comparison are reviewed by using model accuracy in training set, validation set and test set, also bias and variance of the model. The experiment shows that k-FCV technique has more stable performance and provide high accuracy on training set as well as good generalizations on validation set and test set. Meanwhile, data splitting using MLMV technique has lower performance than the other three techniques since it yields lower accuracy. This technique also shows higher bias and variance values and it builds overfitting models, especially when it is applied on validation set

    A Machine Learning Approach to Identify the Preferred Representational System of a Person

    Get PDF
    Whenever people think about something or engage in activities, internal mental processes will be engaged. These processes consist of sensory representations, such as visual, auditory, and kinesthetic, which are constantly being used, and they can have an impact on a person’s performance. Each person has a preferred representational system they use most when speaking, learning, or communicating, and identifying it can explain a large part of their exhibited behaviours and characteristics. This paper proposes a machine learning-based automated approach to identify the preferred representational system of a person that is used unconsciously. A novel methodology has been used to create a specific labelled conversational dataset, four different machine learning models (support vector machine, logistic regression, random forest, and k-nearest neighbour) have been implemented, and the performance of these models has been evaluated and compared. The results show that the support vector machine model has the best performance for identifying a person’s preferred representational system, as it has a better mean accuracy score compared to the other approaches after the performance of 10-fold cross-validation. The automated model proposed here can assist Neuro Linguistic Programming practitioners and psychologists to have a better understanding of their clients’ behavioural patterns and the relevant cognitive processes. It can also be used by people and organisations in order to achieve their goals in personal development and management. The two main knowledge contributions in this paper are the creation of the first labelled dataset for representational systems, which is now publicly available, and the use of machine learning techniques for the first time to identify a person’s preferred representational system in an automated way

    DATA SCIENCE METHODS FOR STANDARDIZATION, SAFETY, AND QUALITY ASSURANCE IN RADIATION ONCOLOGY

    Get PDF
    Radiation oncology is the field of medicine that deals with treating cancer patients through ionizing radiation. The clinical modality or technique used to treat the cancer patients in the radiation oncology domain is referred to as radiation therapy. Radiation therapy aims to deliver precisely measured dose irradiation to a defined tumor volume (target) with as minimal damage as possible to surrounding healthy tissue (organs-at-risk), resulting in eradication of the tumor, high quality of life, and prolongation of survival. A typical radiotherapy process requires the use of different clinical systems at various stages of the workflow. The data generated in these different stages of workflow is stored in an unstructured and non-standard format, which hinders interoperability and interconnectivity of data, thereby making it difficult to translate all of these datasets into knowledge that supports decision-making in routine clinical practice. In this dissertation, we present an enterprise-level informatics platform that can automatically extract and efficiently store clinical, treatment, imaging, and genomics data from radiation oncology patients. Additionally, we propose data science methods for data standardization, safety, and treatment quality analysis in radiation oncology. We demonstrate that our data standardization methods using word embeddings and machine learning are robust and highly generalizable on real-word clinical datasets collected from the nationwide radiation therapy centers administered by the US Veterans\u27 Health Administration. We also present different heterogeneous data integration approaches to enhance the data standardization process. For patient safety, we analyze the radiation oncology incident reports and propose an integrated natural language processing and machine learning based pipeline to automate the incident triage and prioritization process. We demonstrate that a deep learning based transfer learning approach helps in the automated incident triage process. Finally, we address the issue of treatment quality in terms of automated treatment planning in clinical decision support systems. We show that supervised machine learning methods can efficiently generate clinical hypotheses from radiation oncology treatment plans and demonstrate our framework\u27s data analytics capability
    • …
    corecore