132 research outputs found

    Classification of THz pulse signals using two-dimensional cross-correlation feature extraction and non-linear classifiers

    Get PDF
    This work provides a performance comparison of four different machine learning classifiers: multinomial logistic regression with ridge estimators (MLR) classifier, k-nearest neighbours (KNN), support vector machine (SVM) and naïve Bayes (NB) as applied to terahertz (THz) transient time domain sequences associated with pixelated images of different powder samples. The six substances considered, although have similar optical properties, their complex insertion loss at the THz part of the spectrum is significantly different because of differences in both their frequency dependent THz extinction coefficient as well as differences in their refractive index and scattering properties. As scattering can be unquantifiable in many spectroscopic experiments, classification solely on differences in complex insertion loss can be inconclusive. The problem is addressed using two-dimensional (2-D) cross-correlations between background and sample interferograms, these ensure good noise suppression of the datasets and provide a range of statistical features that are subsequently used as inputs to the above classifiers. A cross-validation procedure is adopted to assess the performance of the classifiers. Firstly the measurements related to samples that had thicknesses of 2 mm were classified, then samples at thicknesses of 4 mm, and after that 3 mm were classified and the success rate and consistency of each classifier was recorded. In addition, mixtures having thicknesses of 2 and 4 mm as well as mixtures of 2, 3 and 4 mm were presented simultaneously to all classifiers. This approach provided further cross-validation of the classification consistency of each algorithm. The results confirm the superiority in classification accuracy and robustness of the MLR (least accuracy 88.24%) and KNN (least accuracy 90.19%) algorithms which consistently outperformed the SVM (least accuracy 74.51%) and NB (least accuracy 56.86%) classifiers for the same number of feature vectors across all studies. The work establishes a general methodology for assessing the performance of other hyperspectral dataset classifiers on the basis of 2-D cross-correlations in far-infrared spectroscopy or other parts of the electromagnetic spectrum. It also advances the wider proliferation of automated THz imaging systems across new application areas e.g., biomedical imaging, industrial processing and quality control where interpretation of hyperspectral images is still under development

    Redução de dimensionalidade para dados espectrais colineares

    Get PDF
    Na análise de dados, a identificação das variáveis relevantes para uma determinada tarefa de aprendizagem da máquina pode ajudar a construir modelos mais precisos, robustos e explicáveis. Embora avanços recentes em redes neurais, como autoencoders e redes neurais profundas, tenham proporcionado abordagens que implicitamente realizam a redução de dimensionalidade, tais modelos usualmente requerem grandes tamanhos de amostra e podem não ser explicáveis, podendo ter aplicabilidade restrita em diversos tipos de bancos de dados, como os de espectroscopia. Bancos de dados espectroscópicos têm como característica um elevado número de variáveis que tendem a ser colineares e geralmente se apoiam em menor número de amostras do que variáveis, o que pode deteriorar o desempenho de diversas técnicas multivariadas aplicadas a tais dados. Desta forma, esta tese propõe métodos de seleção de variáveis aplicados a dados espectroscópicos com o objetivo de realizar agrupamento, classificação e regressão em conjuntos de dados abrangendo diferentes áreas. Esta tese é composta de quatro artigos, três de pesquisa aplicada, e uma comunicação. No primeiro artigo, um índice de importância de variáveis (IIV) é proposto para selecionar os comprimentos de onda mais relevantes para o agrupamento de amostras de acordo com suas similaridades. O IIV proposto é baseado na combinação do escalonamento multidimensional (para redução de dimensionalidade) e análise de Procrustes para derivar uma matriz de projeção. No segundo artigo, com o objetivo de selecionar variáveis para um problema de regressão, outro VII é derivado com base nos pesos da matriz de projeção obtida a partir de uma redução de dimensão através da regressão inversa por fatias localizadas (LSIR). No terceiro artigo, uma comunicação relacionada a um artigo publicado recentemente, foram apontadas falhas de projeto em um experimento com o objetivo de classificar espectros Raman de plasma sanguíneo de pacientes positivos para COVID e controles. Esta comunicação também estabeleceu baselines não enviesados para o quarto artigo, no qual o algoritmo de Máxima Relevância Mínima Redundância (mRMR) para seleção de variáveis é melhorado a fim de levar em conta as dependências lineares no conjunto de variáveis selecionadas. O aprimoramento proposto, denominado PCA-mRMR, é aplicado ao mesmo conjunto de dados do terceiro artigo com propósito de classificação. Em todos os três artigos de pesquisa, os métodos propostos foram comparados com abordagens de seleção de variáveis já existentes e seu desempenho foi avaliado.In data analysis, identifying the most relevant features for a given machine learning task can help build more accurate, robust, and explainable models. Although recent advances in neural networks, such as autoencoders and deep neural nets, have provided approaches that implicitly perform dimension reduction, they usually require large sample sizes and may not be explainable. One of such cases is the analysis of spectroscopic data, which is characterised by colinear features (variables or wavelengths) and usually have less samples than features, thus suffering for the curse of dimensionality. Considering this setting, this thesis presents propositions for features election methods applied to spectroscopic data with the goal to perform clustering, classification, and regression in datasets spanning different areas. This thesis is comprised of four articles, three applied research ones, and one communication. In the first article, a feature importance index (FII) is proposed to select the most relevant wavelengths for clustering. This FII is based on the combination of multidimensional scaling (for dimension reduction) and Procrustes analysis to derive a projection matrix. In the second article, with the goal of selecting features for a regression problem, another FII is derived based on the weights of the projection matrix from a Localized Sliced Inverse Regression dimension reduction. In the third article, a communication related to a recent published article, design flaws were pointed out in an experiment aiming to classify Raman spectra of blood plasma of COVID positive patients and controls. This article also established unbiased baselines for the fourth article. In the fourth article, the Maximum Relevancy Minimum Redundancy (mRMR) algorithm for feature selection is improved in order to account for linear dependencies in the selected features. The proposed improved, named PCA-mRMR, is applied to the same dataset of article three, being a classification task. In all three research articles, the proposed methods were compared against existing baseline approaches and their performance were assessed

    Pattern identification of biomedical images with time series: contrasting THz pulse imaging with DCE-MRIs

    Get PDF
    Objective We provide a survey of recent advances in biomedical image analysis and classification from emergent imaging modalities such as terahertz (THz) pulse imaging (TPI) and dynamic contrast-enhanced magnetic resonance images (DCE-MRIs) and identification of their underlining commonalities. Methods Both time and frequency domain signal pre-processing techniques are considered: noise removal, spectral analysis, principal component analysis (PCA) and wavelet transforms. Feature extraction and classification methods based on feature vectors using the above processing techniques are reviewed. A tensorial signal processing de-noising framework suitable for spatiotemporal association between features in MRI is also discussed. Validation Examples where the proposed methodologies have been successful in classifying TPIs and DCE-MRIs are discussed. Results Identifying commonalities in the structure of such heterogeneous datasets potentially leads to a unified multi-channel signal processing framework for biomedical image analysis. Conclusion The proposed complex valued classification methodology enables fusion of entire datasets from a sequence of spatial images taken at different time stamps; this is of interest from the viewpoint of inferring disease proliferation. The approach is also of interest for other emergent multi-channel biomedical imaging modalities and of relevance across the biomedical signal processing community

    Performance-Based Quality Specifications: The Link between Product Development and Clinical Outcomes

    Get PDF
    The design of drug delivery systems and their corresponding dosing guidelines are critical product development functions supported by clinical pharmacokinetic (PK) and pharmacodynamic (PD) data. Largely, the importance of variance and covariance in product and patient attributes is poorly understood. The existence of PK/PD diversity among myriad patient sub-populations further complicates efforts to gauge the importance of product quality variation. Nevertheless, a platform capable of evaluating the effects of product and patient variability on clinical performance was constructed. This dissertation was predicated on requests to re-define pharmaceutical quality in terms of risk by relating clinical attributes to production characteristics. To avoid in vivo studies, simulated experimental trials were conducted using the model drug, theophylline, for which data and models could be acquired from the literature. Where comprehensive data were unavailable (e.g., production variability statistics), initial estimates were acquired via laboratory-scale experiments. Model asthmatic patients were generated using Monte Carlo simulation and published population distributions of various anothropometric measurements, disease rates, and lifestyle factors. Mathematical constructs for in vitro-in vivo correlations provide a linkage between Quality by Design (QbD) product and process models, PK/PD models, and patient population statistics. The combined models formed the foundation for Monte Carlo risk assessments, which characterized the risk of inefficacy and toxicity for dosing of extended-release theophylline tablets. Sensitivity analyses revealed that patient compliance and content uniformity significantly influenced the probability of observing an adverse event. The Monte Carlo risk assessment platform defined the link between the critical quality attributes (CQAs) and clinical performance (i.e., performance-based quality specifications (PBQS)). The PBQS were subsequently utilized to generate process independent design spaces conditioned on inefficacy and toxicity risk. These design spaces, which directly account for the conditional relationships between product quality and patient variability, can be transferred to a specific process via models that relate process critical control parameters to the CQAs. Process Analytical Technology, therefore, can be integrated into the QbD production environment to control the safety and efficacy of the final product. This work demonstrated that process and product knowledge can be used to estimate the risk that final product quality imparts to clinical performance

    Signal and data processing for machine olfaction and chemical sensing: A review

    Get PDF
    Signal and data processing are essential elements in electronic noses as well as in most chemical sensing instruments. The multivariate responses obtained by chemical sensor arrays require signal and data processing to carry out the fundamental tasks of odor identification (classification), concentration estimation (regression), and grouping of similar odors (clustering). In the last decade, important advances have shown that proper processing can improve the robustness of the instruments against diverse perturbations, namely, environmental variables, background changes, drift, etc. This article reviews the advances made in recent years in signal and data processing for machine olfaction and chemical sensing

    Generalized Partial Least Squares Approach for Nominal Multinomial Logit Regression Models with a Functional Covariate

    Get PDF
    Functional Data Analysis (FDA) has attracted substantial attention for the last two decades. Within FDA, classifying curves into two or more categories is consistently of interest to scientists, but multi-class prediction within FDA is challenged in that most classification tools have been limited to binary response applications. The functional logistic regression (FLR) model was developed to forecast a binary response variable in the functional case. In this study, a functional nominal multinomial logit regression (F-NM-LR) model was developed that shifts the FLR model into a multiple logit model. However, the model generates inaccurate parameter function estimates due to multicollinearity in the design matrix. A generalized partial least squares (GPLS) approach with cubic B-spline basis expansions was developed to address the multicollinearity and high dimensionality problems that preclude accurate estimates and curve discrimination with the F-NM-LR model. The GPLS method extends partial least squares (PLS) and improves upon current methodology by introducing a component selection criterion that reconstructs the parameter function with fewer predictors. The GPLS regression estimates are derived via Iteratively ReWeighted Partial Least Squares (IRWPLS), defining a set of uncorrelated latent variables to use as predictors for the F-GPLS-NM-LR model. This methodology was compared to the classic alternative estimation method of principal component regression (PCR) in a simulation study. The performance of the proposed methodology was tested via simulations and applications on a spectrometric dataset. The results indicate that the GPLS method performs well in multi-class prediction with respect to the F-NM-LR model. The main difference between the two approaches was that PCR usually requires more components than GPLS to achieve similar accuracy of parameter function estimates of the F-GPLS-NM-LR model. The results of this research imply that the GPLS method is preferable to the F-NM-LR model, and it is a useful contribution to FDA techniques. This method may be particularly appropriate for practical situations where accurate prediction of a response variable with fewer components is a priority

    Digital soil mapping of soil physical and chemical properties using proximal and remote sensed data in Australian cotton growing areas

    Full text link
    In Australian cotton-growing areas, information of soil physical and chemical properties is required as they decide soil structure, nutrient availability and water holding capacity. However, using conventional laboratory methods to determine these properties is impractical as they are time-consuming and costly. This is especially the case when considering samples from different depths and across heterogenous fields and districts. Thus, there is a need for efficient and affordable methods to enable data generation. To answer this need, digital soil mapping (DSM) can be used, in which limited laboratory measured soil data is coupled with cheaper-to acquire digital data through models and then the model and spatially exhaustive digital data are used to predict soil properties on unsampled locations. This thesis evaluates DSM methods for the prediction of soil physical (e.g., clay content) and chemical (e.g., cation exchange capacity [CEC] and exchangeable [exch.] cations) properties at various depths across cotton growing areas in south-eastern Australia, at field and district scales. Chapter 1 is the general introduction where research problems are defined, and research objectives are introduced. To point out gaps in the application of DSM on the prediction of soil properties, Chapter 2 comprehensively reviews DSM concepts, the applicability of proximally (e.g., electromagnetic induction (EM), visible near-infrared spectroscopy (vis-NIR)) or remotely (e.g., γ-ray spectrometer) sensed digital data for prediction of soil properties at various depths and the modelling techniques. The first research chapter (Chapter 3) compares various strategies to build the vis-NIR spectral library for clay content prediction at two depths across seven cotton growing areas using Cubist model. The results show that the area-specific vis-NIR library achieve the best results. The improvement in model performance is possible using spiking. The Chapter 4 compares multivariate methods for estimating clay content and its uncertainty map at two depths and the effect of weighted model averaging is evaluated. The results show that random forest (RF) model generally performs the best and model averaging could further improve the prediction accuracy. The Chapter 5 evaluates the potential of vis-NIR as a tool for the simultaneous prediction of soil physical and chemical properties across cotton growing areas and considering two calibration models. The results show that satisfactory predictions of clay and CEC are achieved with silt and sand prediction moderate, while the prediction of pH and exchangeable sodium percentage (ESP) are unsatisfactory. A multi-depth vis-NIR library generally performs better than depth-specific libraries on prediction of soil properties. The Chapter 6 builds a topsoil (0 – 0.3 m) vis-NIR spectral library to predict topsoil exch. cations considering four different calibration models and explores the applicability of the topsoil library to predict exch. cations at deeper depths considering spiking or not. The results show that the vis-NIR could provide satisfactory prediction of exch. calcium and magnesium. Topsoil spectral library could be used to predict exch. cations at deeper depth with spiking further improving the result. The Chapter 7 estimates spatial variation of CEC at various depths using quasi-3d joint inversion of EM38 and EM31 data in an irrigated cotton field. The results indicate that the joint-inversion approach developed in this study could generate accurate 3D predictions of soil CEC in the cotton growing field. This thesis explores DSM methods for the prediction of soil physical and chemical properties in Australian cotton growing areas and the results deliver new evidence of the potential to use proximally and remotely sensed digital data and state-to-art models for rapid and efficient generation of soil information. New findings will serve to advance the existing knowledge on application of DSM at field and district scales
    corecore