37 research outputs found

    Data Reduction Algorithms in Machine Learning and Data Science

    Get PDF
    Raw data are usually required to be pre-processed for better representation or discrimination of classes. This pre-processing can be done by data reduction, i.e., either reduction in dimensionality or numerosity (cardinality). Dimensionality reduction can be used for feature extraction or data visualization. Numerosity reduction is useful for ranking data points or finding the most and least important data points. This thesis proposes several algorithms for data reduction, known as dimensionality and numerosity reduction, in machine learning and data science. Dimensionality reduction tackles feature extraction and feature selection methods while numerosity reduction includes prototype selection and prototype generation approaches. This thesis focuses on feature extraction and prototype selection for data reduction. Dimensionality reduction methods can be divided into three categories, i.e., spectral, probabilistic, and neural network-based methods. The spectral methods have a geometrical point of view and are mostly reduced to the generalized eigenvalue problem. Probabilistic and network-based methods have stochastic and information theoretic foundations, respectively. Numerosity reduction methods can be divided into methods based on variance, geometry, and isolation. For dimensionality reduction, under the spectral category, I propose weighted Fisher discriminant analysis, Roweis discriminant analysis, and image quality aware embedding. I also propose quantile-quantile embedding as a probabilistic method where the distribution of embedding is chosen by the user. Backprojection, Fisher losses, and dynamic triplet sampling using Bayesian updating are other proposed methods in the neural network-based category. Backprojection is for training shallow networks with a projection-based perspective in manifold learning. Two Fisher losses are proposed for training Siamese triplet networks for increasing and decreasing the inter- and intra-class variances, respectively. Two dynamic triplet mining methods, which are based on Bayesian updating to draw triplet samples stochastically, are proposed. For numerosity reduction, principal sample analysis and instance ranking by matrix decomposition are the proposed variance-based methods; these methods rank instances using inter-/intra-class variances and matrix factorization, respectively. Curvature anomaly detection, in which the points are assumed to be the vertices of polyhedron, and isolation Mondrian forest are the proposed methods based on geometry and isolation, respectively. To assess the proposed tools developed for data reduction, I apply them to some applications in medical image analysis, image processing, and computer vision. Data reduction, used as a pre-processing tool, has different applications because it provides various ways of feature extraction and prototype selection for applying to different types of data. Dimensionality reduction extracts informative features and prototype selection selects the most informative data instances. For example, for medical image analysis, I use Fisher losses and dynamic triplet sampling for embedding histopathology image patches and demonstrating how different the tumorous cancer tissue types are from the normal ones. I also propose offline/online triplet mining using extreme distances for this embedding. In image processing and computer vision application, I propose Roweisfaces and Roweisposes for face recognition and 3D action recognition, respectively, using my proposed Roweis discriminant analysis method. I also introduce the concepts of anomaly landscape and anomaly path using the proposed curvature anomaly detection and use them to denoise images and video frames. I report extensive experiments, on different datasets, to show the effectiveness of the proposed algorithms. By experiments, I demonstrate that the proposed methods are useful for extracting informative features and instances for better accuracy, representation, prediction, class separation, data reduction, and embedding. I show that the proposed dimensionality reduction methods can extract informative features for better separation of classes. An example is obtaining an embedding space for separating cancer histopathology patches from the normal patches which helps hospitals diagnose cancers more easily in an automatic way. I also show that the proposed numerosity reduction methods are useful for ranking data instances based on their importance and reducing data volumes without a significant drop in performance of machine learning and data science algorithms

    Tiny CNN for feature point description for document analysis: approach and dataset

    Get PDF
    In this paper, we study the problem of feature points description in the context of document analysis and template matching. Our study shows that specific training data is required for the task especially if we are to train a lightweight neural network that will be usable on devices with limited computational resources. In this paper, we construct and provide a dataset of photo and synthetically generated images and a method of training patches generation from it. We prove the effectiveness of this data by training a lightweight neural network and show how it performs in both general and documents patches matching. The training was done on the provided dataset in comparison with HPatches training dataset and for the testing, we solve HPatches testing framework tasks and template matching task on two publicly available datasets with various documents pictured on complex backgrounds: MIDV-500 and MIDV-2019.This work was supported by the Russian Foundation for Basic Research (projects 18-29-26033 and 19-29-09064)

    Out-of-Distribution Generalization of Gigapixel Image Representation

    Get PDF
    This thesis addresses the significant challenge of improving the generalization capabilities of artificial deep neural networks in the classification of whole slide images (WSIs) in histopathology across different and unseen hospitals. It is a critical issue in AI applications for vision-based healthcare tasks, given that current standard methodologies struggle with out-of-distribution (OOD) data from varying hospital sources. In histopathology, distribution shifts can arise due to image acquisition variances across different scanner vendors, differences in laboratory routines and staining procedures, and diversity in patient demographics. This work investigates two critical forms of generalization within histopathology: magnification generalization and OOD generalization towards different hospitals. One chapter of this thesis is dedicated to the exploration of magnification generalization, acknowledging the variability in histopathological images due to distinct magnification levels and seeking to enhance the model's robustness by learning invariant features across these levels. However, the major part of this work focuses on OOD generalization, specifically unseen hospital data. The objective is to leverage knowledge encapsulated in pre-existing models to help new models adapt to diverse data scenarios and ensure their efficient operation in different hospital environments. Additionally, the concept of Hospital-Agnostic (HA) learning regimes is introduced, focusing on invariant characteristics across hospitals and aiming to establish a learning model that sustains stable performance in varied hospital settings. The culmination of this research introduces a comprehensive method, termed ALFA (Exploiting All Levels of Feature Abstraction), that not only considers invariant features across hospitals but also extracts a broader set of features from input images, thus maximizing the model's generalization potential. The findings of this research are expected to have significant implications for the deployment of medical image classification systems using deep models in clinical settings. The proposed methods allow for more accurate and reliable diagnostic support across various hospital environments, thereby improving diagnostic accuracy and reliability, and paving the way for enhanced generalization in histopathology diagnostics using deep learning techniques. Future research directions may build on expanding these investigations to further improve generalization in histopathology

    Software and FPGA-Based Hardware to Accelerate Machine Learning Classifiers

    Get PDF
    This thesis improves the accuracy and run-time of two selected machine learning algorithms, the first in software and the second on a field-programmable gate array (FPGA) device. We first implement triplet loss and triplet mining methods on large margin metric learning, inspired by Siamese networks, and we analyze the proposed methods. In addition, we propose a new hierarchical approach to accelerate the optimization, where triplets are selected by stratified sampling in hierarchical hyperspheres. The method results in faster optimization time and in almost all cases, and shows improved accuracy. This method is further studied for high-dimensional feature spaces with the goal of finding a projection subspace to increase and decrease the inter- and intra class variances, respectively. We also studied hardware acceleration of random forests (RFs) to improve the classification run-time for large datasets. RFs are a widely used classification and regression algorithm, typically implemented in software. Hardware implementations can be used to accelerate RF especially on FPGA platforms due to concurrent memory access and parallel computational abilities. This thesis proposes a method to decrease the training time by expanding on memory usage on an Intel Arria 10 (10AX115N 3F 45I2SG) FPGA, while keeping high accuracy comparable with CPU implementations

    Multi-Magnification Search in Digital Pathology

    Get PDF
    This research study investigates the effect of magnification on content-based image search in digital pathology archives and proposes to use multi-magnification image representation. Image search in large archives of digital pathology slides provides researchers and medical professionals with an opportunity to match records of current and past patients and learn from evidently diagnosed and treated cases. When working with microscopes, pathologists switch between different magnification levels while examining tissue specimens to find and evaluate various morphological features. Inspired by the conventional pathology workflow, this thesis investigates several magnification levels in digital pathology and their combinations to minimize the gap between AI-enabled image search methods and clinical settings. This thesis suggests two approaches for combining magnification levels and compares their performance. The first approach obtains a single-vector deep feature representation for a WSI, whereas the second approach works with a multi-vector deep feature representation. The proposed content-based searching framework does not rely on any pixel-level annotation and potentially applies to millions of unlabelled (raw) WSIs. This thesis proposes using binary masks generated by U-Net as the primary step of patch preparation to locating tissue regions in a WSI. As a part of this thesis, a multi-magnification dataset of histopathology patches is created by applying the proposed patch preparation method on more than 8,000 WSIs of TCGA repository. The performance of both MMS methods is evaluated by investigating the top three most similar WSIs to a query WSI found by the search. The search is considered successful if two out of three matched cases have the same malignancy subtype as the query WSI. Experimental search results across tumors of several anatomical sites at different magnification levels, i.e., 20×, 10×, and 5× magnifications and their combinations, are reported in this thesis. The experiments verify that cell-level information at the highest magnification is essential for searching for diagnostic purposes. In contrast, low-magnification information may improve this assessment depending on the tumor type. Both proposed search methods generally performed more accurately at 20× magnification or the combination of the 20× magnification with 10×, 5×, or both. The multi-magnification searching approach achieved up to 11% increase in F1-score for searching among some tumor types, including the urinary tract and brain tumor subtypes compared to the single-magnification image search

    On Improving Generalization of CNN-Based Image Classification with Delineation Maps Using the CORF Push-Pull Inhibition Operator

    Get PDF
    Deployed image classification pipelines are typically dependent on the images captured in real-world environments. This means that images might be affected by different sources of perturbations (e.g. sensor noise in low-light environments). The main challenge arises by the fact that image quality directly impacts the reliability and consistency of classification tasks. This challenge has, hence, attracted wide interest within the computer vision communities. We propose a transformation step that attempts to enhance the generalization ability of CNN models in the presence of unseen noise in the test set. Concretely, the delineation maps of given images are determined using the CORF push-pull inhibition operator. Such an operation transforms an input image into a space that is more robust to noise before being processed by a CNN. We evaluated our approach on the Fashion MNIST data set with an AlexNet model. It turned out that the proposed CORF-augmented pipeline achieved comparable results on noise-free images to those of a conventional AlexNet classification model without CORF delineation maps, but it consistently achieved significantly superior performance on test images perturbed with different levels of Gaussian and uniform noise

    Computational Approaches to Drug Profiling and Drug-Protein Interactions

    Get PDF
    Despite substantial increases in R&D spending within the pharmaceutical industry, denovo drug design has become a time-consuming endeavour. High attrition rates led to a long period of stagnation in drug approvals. Due to the extreme costs associated with introducing a drug to the market, locating and understanding the reasons for clinical failure is key to future productivity. As part of this PhD, three main contributions were made in this respect. First, the web platform, LigNFam enables users to interactively explore similarity relationships between ‘drug like’ molecules and the proteins they bind. Secondly, two deep-learning-based binding site comparison tools were developed, competing with the state-of-the-art over benchmark datasets. The models have the ability to predict offtarget interactions and potential candidates for target-based drug repurposing. Finally, the open-source ScaffoldGraph software was presented for the analysis of hierarchical scaffold relationships and has already been used in multiple projects, including integration into a virtual screening pipeline to increase the tractability of ultra-large screening experiments. Together, and with existing tools, the contributions made will aid in the understanding of drug-protein relationships, particularly in the fields of off-target prediction and drug repurposing, helping to design better drugs faster

    Irish Machine Vision and Image Processing Conference Proceedings 2017

    Get PDF

    Tracking the Temporal-Evolution of Supernova Bubbles in Numerical Simulations

    Get PDF
    The study of low-dimensional, noisy manifolds embedded in a higher dimensional space has been extremely useful in many applications, from the chemical analysis of multi-phase flows to simulations of galactic mergers. Building a probabilistic model of the manifolds has helped in describing their essential properties and how they vary in space. However, when the manifold is evolving through time, a joint spatio-temporal modelling is needed, in order to fully comprehend its nature. We propose a first-order Markovian process that propagates the spatial probabilistic model of a manifold at fixed time, to its adjacent temporal stages. The proposed methodology is demonstrated using a particle simulation of an interacting dwarf galaxy to describe the evolution of a cavity generated by a Supernov
    corecore