8 research outputs found

    Cross Lingual Sentiment Analysis: A Clustering-Based Bee Colony Instance Selection and Target-Based Feature Weighting Approach

    Get PDF
    The lack of sentiment resources in poor resource languages poses challenges for the sentiment analysis in which machine learning is involved. Cross-lingual and semi-supervised learning approaches have been deployed to represent the most common ways that can overcome this issue. However, performance of the existing methods degrades due to the poor quality of translated resources, data sparseness and more specifically, language divergence. An integrated learning model that uses a semi-supervised and an ensembled model while utilizing the available sentiment resources to tackle language divergence related issues is proposed. Additionally, to reduce the impact of translation errors and handle instance selection problem, we propose a clustering-based bee-colony-sample selection method for the optimal selection of most distinguishing features representing the target data. To evaluate the proposed model, various experiments are conducted employing an English-Arabic cross-lingual data set. Simulations results demonstrate that the proposed model outperforms the baseline approaches in terms of classification performances. Furthermore, the statistical outcomes indicate the advantages of the proposed training data sampling and target-based feature selection to reduce the negative effect of translation errors. These results highlight the fact that the proposed approach achieves a performance that is close to in-language supervised models

    Efficiently Reusing Natural Language Processing Models for Phenotype Identification in Free-text Electronic Medical Records: Methodological Study

    Get PDF
    Background: Many efforts have been put into the use of automated approaches, such as natural language processing (NLP), to mine or extract data from free-text medical records to construct comprehensive patient profiles for delivering better health-care. Reusing NLP models in new settings, however, remains cumbersome - requiring validation and/or retraining on new data iteratively to achieve convergent results. Objective: The aim of this work is to minimise the effort involved in reusing NLP models on free-text medical records. Methods: We formally define and analyse the model adaptation problem in phenotype identification tasks. We identify “duplicate waste” and “imbalance waste”, which collectively impede efficient model reuse. We propose a concept embedding based approach to minimise these sources of waste without the need for labelled data from new settings. Results: We conduct experiments on data from a large mental health registry to reuse NLP models in four phenotype identification tasks. The proposed approach can choose the best model for a new task, identifying up to 76% of phenotype mentions without the need for validation and model retraining, and with very good performance (93-97% accuracy). It can also provide guidance for validating and retraining the selected model for novel language patterns in new tasks, saving around 80% of the effort required in “blind” model-adaptation approaches. Conclusions: Adapting pre-trained NLP models for new tasks can be more efficient and effective if the language pattern landscapes of old settings and new settings can be made explicit and comparable. Our experiments show that the phenotype embedding approach is an effective way to model language patterns for phenotype identification tasks and that its use can guide efficient NLP model reuse

    Efficient Reuse of Natural Language Processing Models for Phenotype-Mention Identification in Free-text Electronic Medical Records: A Phenotype Embedding Approach.

    Get PDF
    Background: Many efforts have been put into the use of automated approaches, such as natural language processing (NLP), to mine or extract data from free-text medical records to construct comprehensive patient profiles for delivering better health-care. Reusing NLP models in new settings, however, remains cumbersome - requiring validation and/or retraining on new data iteratively to achieve convergent results. Objective: The aim of this work is to minimize the effort involved in reusing NLP models on free-text medical records. Methods: We formally define and analyse the model adaptation problem in phenotype-mention identification tasks. We identify "duplicate waste" and "imbalance waste", which collectively impede efficient model reuse. We propose a phenotype embedding based approach to minimize these sources of waste without the need for labelled data from new settings. Results: We conduct experiments on data from a large mental health registry to reuse NLP models in four phenotype-mention identification tasks. The proposed approach can choose the best model for a new task, identifying up to 76% (duplicate waste), i.e. phenotype mentions without the need for validation and model retraining, and with very good performance (93-97% accuracy). It can also provide guidance for validating and retraining the selected model for novel language patterns in new tasks, saving around 80% (imbalance waste), i.e. the effort required in "blind" model-adaptation approaches. Conclusions: Adapting pre-trained NLP models for new tasks can be more efficient and effective if the language pattern landscapes of old settings and new settings can be made explicit and comparable. Our experiments show that the phenotype-mention embedding approach is an effective way to model language patterns for phenotype-mention identification tasks and that its use can guide efficient NLP model reuse

    Addressing modern and practical challenges in machine learning: a survey of online federated and transfer learning

    Get PDF
    Online federated learning (OFL) and online transfer learning (OTL) are two collaborative paradigms for overcoming modern machine learning challenges such as data silos, streaming data, and data security. This survey explores OFL and OTL throughout their major evolutionary routes to enhance understanding of online federated and transfer learning. Practical aspects of popular datasets and cutting-edge applications for online federated and transfer learning are also highlighted in this work. Furthermore, this survey provides insight into potential future research areas and aims to serve as a resource for professionals developing online federated and transfer learning frameworks

    Data-Efficient Machine Learning with Focus on Transfer Learning

    Get PDF
    Machine learning (ML) has attracted a significant amount of attention from the artifi- cial intelligence community. ML has shown state-of-art performance in various fields, such as signal processing, healthcare system, and natural language processing (NLP). However, most conventional ML algorithms suffer from three significant difficulties: 1) insufficient high-quality training data, 2) costly training process, and 3) domain dis- crepancy. Therefore, it is important to develop solutions for these problems, so the future of ML will be more sustainable. Recently, a new concept, data-efficient ma- chine learning (DEML), has been proposed to deal with the current bottlenecks of ML. Moreover, transfer learning (TL) has been considered as an effective solution to address the three shortcomings of conventional ML. Furthermore, TL is one of the most active areas in the DEML. Over the past ten years, significant progress has been made in TL. In this dissertation, I propose to address the three problems by developing a software- oriented framework and TL algorithms. Firstly, I introduce a DEML framework and a evaluation system. Moreover, I present two novel TL algorithms and applications on real-world problems. Furthermore, I will first present the first well-defined DEML framework and introduce how it can address the challenges in ML. After that, I will give an updated overview of the state-of-the-art and open challenges in the TL. I will then introduce two novel algorithms for two of the most challenging TL topics: distant domain TL and cross-modality TL (image-text). A detailed algorithm introduction and preliminary results on real-world applications (Covid-19 diagnosis and image clas- sification) will be presented. Then, I will discuss the current trends in TL algorithms and real-world applications. Lastly, I will present the conclusion and future research directions

    Machine Learning Modeling for Image Segmentation in Manufacturing and Agriculture Applications

    Get PDF
    Doctor of PhilosophyDepartment of Industrial & Manufacturing Systems EngineeringShing I ChangThis dissertation focuses on applying machine learning (ML) modelling for image segmentation tasks of various applications such as additive manufacturing monitoring, agricultural soil cover classification, and laser scribing quality control. The proposed ML framework uses various ML models such as gradient boosting classifier and deep convolutional neural network to improve and automate image segmentation tasks. In recent years, supervised ML methods have been widely adopted for imaging processing applications in various industries. The presence of cameras installed in production processes has generated a vast amount of image data that can potentially be used for process monitoring. Specifically, deep supervised machine learning models have been successfully implemented to build automatic tools for filtering and classifying useful information for process monitoring. However, successful implementations of deep supervised learning algorithms depend on several factors such as distribution and size of training data, selected ML models, and consistency in the target domain distribution that may change based on different environmental conditions over time. The proposed framework takes advantage of general-purposed, trained supervised learning models and applies them for process monitoring applications related to manufacturing and agriculture. In Chapter 2, a layer-wise framework is proposed to monitor the quality of 3D printing parts based on top-view images. The proposed statistical process monitoring method starts with self-start control charts that require only two successful initial prints. Unsupervised machine learning methods can be used for problems in which high accuracy is not required, but statistical process monitoring usually demands high classification accuracies to avoid Type I and II errors. Answering the challenges of image processing using unsupervised methods due to lighting, a supervised Gradient Boosting Classifier (GBC) with 93 percent accuracy is adopted to classify each printed layer from the printing bed. Despite the power of GBC or other decision-tree-based ML models to comparable to unsupervised ML models, their capability is limited in terms of accuracy and running time for complex classification problems such as soil cover classification. In Chapter 3, a deep convolutional neural network (DCNN) for semantic segmentation is trained to quantify and monitor soil coverage in agricultural fields. The trained model is capable of accurately quantifying green canopy cover, counting plants, and classifying stubble. Due to the wide variety of scenarios in a real agricultural field, 3942 high-resolution images were collected and labeled for training and test data set. The difficulty and hardship of collecting, cleaning, and labeling the mentioned dataset was the motivation to find a better approach to alleviate data-wrangling burden for any ML model training. One of the most influential factors is the need for a high volume of labeled data from an exact problem domain in terms of feature space and distributions of data of all classes. Image data preparation for deep learning model training is expensive in terms of the time for labelling due to tedious manual processing. Multiple human labelers can work simultaneously but inconsistent labeling will generate a training data set that often compromises model performance. In addition, training a ML model for a complication problem from scratch will also demand vast computational power. One of the potential approaches for alleviating data wrangling challenges is transfer learning (TL). In Chapter 4, a TL approach was adopted for monitoring three laser scribing characteristics – scribe width, straightness, and debris to answer these challenges. The proposed transfer deep convolutional neural network (TDCNN) model can reduce timely and costly processing of data preparation. The proposed framework leverages a deep learning model already trained for a similar problem and only uses 21 images generated gleaned from the problem domain. The proposed TDCNN overcame the data challenge by leveraging the DCNN model called VGG16 already trained for basic geometric features using more than two million pictures. Appropriate image processing techniques were provided to measure scribe width and line straightness as well as total scribe and debris area using classified images with 96 percent accuracy. In addition to the fact that the TDCNN is functioning with less trainable parameters (i.e., 5 million versus 15 million for VGG16), increasing training size to 154 did not provide significant improvement in accuracy that shows the TDCNN does not need high volume of data to be successful. Finally, chapter 5 summarizes the proposed work and lays out the topics for future research

    Instance-based Domain Adaptation via Multiclustering Logistic Approximation

    No full text
    corecore