609 research outputs found

    Automatic high content screening using deep learning

    Get PDF
    Recently, deep learning algorithms have been used with success in a variety of domains. Deep learning has proven to be a very helpful tool for discovering complicated structures in high-dimensional and big datasets. In this work, five deep learning models inspired by AlexNet, VGG, and GoogleNet are developed to predict mechanism of actions (MOAs) based on phenotypic screens of a number of cells in dimly lit and noisy images. We demonstrate that our models can predict the MOA for a compendium of drugs that alter cells through single cell or cell population views without any segmentation and feature extraction steps. According to these results, our models do not need to fully realize single-cell measurements to profile samples because they use the morphology of specific phenomena in the cell population samples. We used an imbalanced High Content Screening big dataset to predict MOAs with the main goal of understanding how to work properly with deep learning algorithms on imbalanced datasets when sampling methods, like Oversampling, Undersampling, and Synthetic Minority Over-sampling (SMOTE) algorithms are used for balancing the dataset. Based on our findings, it is now clear that the SMOTE sampling algorithm must be part of the deep learning algorithms when confronting imbalanced datasets. High Content Screening technologies have to deal with screening thousands of cells to provide a number of parameters for each cell, such as nuclear size, nuclear morphology, DNA replication, etc. The success of High Content Screening (HCS) systems depends on automatic image analysis. Recently, deep learning algorithms have overcome object recognition challenges on tasks with a single centered object per image. Present deep learning algorithms have not been applied to images that include multiple specific complex objects, such as microscopic images of many objects such as cells in these images

    Handling Imbalanced Data through Re-sampling: Systematic Review

    Get PDF
    Handling imbalanced data is an important issue that can affect the validity and reliability of the results. One common approach to addressing this issue is through re-sampling the data. Re-sampling is a technique that allows researchers to balance the class distribution of their dataset by either over-sampling the minority class or under-sampling the majority class. Over-sampling involves adding more copies of the minority class examples to the dataset in order to balance out the class distribution. On the other hand, under-sampling involves removing some of the majority class examples from the dataset in order to balance out the class distribution. It's also common to combine both techniques, usually called hybrid sampling. It is important to note that re-sampling techniques can have an impact on the model's performance, and it is essential to evaluate the model using different evaluation metrics and to consider other techniques such as cost-sensitive learning and anomaly detection. In addition, it is important to keep in mind that increasing the sample size is always a good idea to improve the performance of the model. In this systematic review, we aim to provide an overview of existing methods for re-sampling imbalanced data. We will focus on methods that have been proposed in the literature and evaluate their effectiveness through a thorough examination of experimental results. The goal of this review is to provide practitioners with a comprehensive understanding of the different re-sampling methods available, as well as their strengths and weaknesses, to help them make informed decisions when dealing with imbalanced data

    The Role of Synthetic Data in Improving Supervised Learning Methods: The Case of Land Use/Land Cover Classification

    Get PDF
    A thesis submitted in partial fulfillment of the requirements for the degree of Doctor in Information ManagementIn remote sensing, Land Use/Land Cover (LULC) maps constitute important assets for various applications, promoting environmental sustainability and good resource management. Although, their production continues to be a challenging task. There are various factors that contribute towards the difficulty of generating accurate, timely updated LULC maps, both via automatic or photo-interpreted LULC mapping. Data preprocessing, being a crucial step for any Machine Learning task, is particularly important in the remote sensing domain due to the overwhelming amount of raw, unlabeled data continuously gathered from multiple remote sensing missions. However a significant part of the state-of-the-art focuses on scenarios with full access to labeled training data with relatively balanced class distributions. This thesis focuses on the challenges found in automatic LULC classification tasks, specifically in data preprocessing tasks. We focus on the development of novel Active Learning (AL) and imbalanced learning techniques, to improve ML performance in situations with limited training data and/or the existence of rare classes. We also show that much of the contributions presented are not only successful in remote sensing problems, but also in various other multidisciplinary classification problems. The work presented in this thesis used open access datasets to test the contributions made in imbalanced learning and AL. All the data pulling, preprocessing and experiments are made available at https://github.com/joaopfonseca/publications. The algorithmic implementations are made available in the Python package ml-research at https://github.com/joaopfonseca/ml-research

    DATA MINING AND IMAGE CLASSIFICATION USING GENETIC PROGRAMMING

    Get PDF
    Genetic programming (GP), a capable machine learning and search method, motivated by Darwinian-evolution, is an evolutionary learning algorithm which automatically evolves computer programs in the form of trees to solve problems. This thesis studies the application of GP for data mining and image processing. Knowledge discovery and data mining have been widely used in business, healthcare, and scientific fields. In data mining, classification is supervised learning that identifies new patterns and maps the data to predefined targets. A GP based classifier is developed in order to perform these mappings. GP has been investigated in a series of studies to classify data; however, there are certain aspects which have not formerly been studied. We propose an optimized GP classifier based on a combination of pruning subtrees and a new fitness function. An orthogonal least squares algorithm is also applied in the training phase to create a robust GP classifier. The proposed GP classifier is validated by 10-fold cross validation. Three areas were studied in this thesis. The first investigation resulted in an optimized genetic-programming-based classifier that directly solves multi-class classification problems. Instead of defining static thresholds as boundaries to differentiate between multiple labels, our work presents a method of classification where a GP system learns the relationships among experiential data and models them mathematically during the evolutionary process. Our approach has been assessed on six multiclass datasets. The second investigation was to develop a GP classifier to segment and detect brain tumors on magnetic resonance imaging (MRI) images. The findings indicated the high accuracy of brain tumor classification provided by our GP classifier. The results confirm the strong ability of the developed technique for complicated image classification problems. The third was to develop a hybrid system for multiclass imbalanced data classification using GP and SMOTE which was tested on satellite images. The finding showed that the proposed approach improves both training and test results when the SMOTE technique is incorporated. We compared our approach in terms of speed with previous GP algorithms as well. The analyzed results illustrate that the developed classifier produces a productive and rapid method for classification tasks that outperforms the previous methods for more challenging multiclass classification problems. We tested the approaches presented in this thesis on publicly available datasets, and images. The findings were statistically tested to conclude the robustness of the developed approaches

    An application of user segmentation and predictive modelling at a telecom company

    Get PDF
    Internship report presented as partial requirement for obtaining the Master’s degree in Advanced AnalyticsInternship Report presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics“The squeaky wheel gets the grease” is an American proverb used to convey the notion that only those who speak up tend to be heard. This was believed to be the case at the telecom company I interned at – they believed that while those who complain about an issue (in particular, an issue of no access to the service) get their problem resolved, there are others who have an issue but do not complain about it. The latter are likely to be dissatisfied customers, and must be identified. This report describes the approach taken to address this problem using machine learning. Unsupervised learning was used to segment the customer base into user profiles based on their viewing behaviour, to better understand their needs; and supervised learning was used to develop a predictive model to identify customers who have no access to the TV service, and to explore what factors (or combination of factors) are indicative of this issue

    Data mining guided process for churn prediction in retail: from descriptive to predictive analytics

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Information Systems and Technologies ManagementIn recent years, the development of new technologies has permeated all industries, and with its rapid introduction, technology has brought the need to solve uncertainty in processes. The need to understand and collect data by companies has become a central paradigm, but the journey continues in the efforts to transform it into powerful insight into new processes, goods, and services. In the grocery retail industry has been essential to understanding the need to include academic research to understand different commercial purposes (Perloff & Denbaly, 2007). It has become an essential issue to understand the data coming from all the sources in the industries, allowing to focus the efforts to reduce the gap between the vertical and horizontal relationships and from the different stakeholders in the supply chain. That is why it became relevant to understand the customer experience along the supply chain and maximized by the marketing chain. The complexity of the transactions and the crescent number of customers define challenges for the grocery retail stores to process and provide a high-quality service based on data to their customers. The key to gaining competitive advantage is to understand, classify, and prevent customer churn to maximize profit. It is used to attract and retain new customers with data-driven decisions. For this, it is necessary to understand and label the customers as churners. The organizations tend to focus more on developing plans to deal with the Customers, using CRM (Customer Relationship Management) as the core strategy to handle, maintain and build new long-lasting relationships with the customer as a critical stakeholder (Chorianopoulos, 2015). Data mining techniques help CRM to achieve their goals building tools that lead to informed decisions, creating better, stronger and long-lasting relationships thanks to the analysis of the customer-organization interaction and application of complex models
    corecore