96 research outputs found

    Feature Selection and Classifier Development for Radio Frequency Device Identification

    Get PDF
    The proliferation of simple and low-cost devices, such as IEEE 802.15.4 ZigBee and Z-Wave, in Critical Infrastructure (CI) increases security concerns. Radio Frequency Distinct Native Attribute (RF-DNA) Fingerprinting facilitates biometric-like identification of electronic devices emissions from variances in device hardware. Developing reliable classifier models using RF-DNA fingerprints is thus important for device discrimination to enable reliable Device Classification (a one-to-many looks most like assessment) and Device ID Verification (a one-to-one looks how much like assessment). AFITs prior RF-DNA work focused on Multiple Discriminant Analysis/Maximum Likelihood (MDA/ML) and Generalized Relevance Learning Vector Quantized Improved (GRLVQI) classifiers. This work 1) introduces a new GRLVQI-Distance (GRLVQI-D) classifier that extends prior GRLVQI work by supporting alternative distance measures, 2) formalizes a framework for selecting competing distance measures for GRLVQI-D, 3) introducing response surface methods for optimizing GRLVQI and GRLVQI-D algorithm settings, 4) develops an MDA-based Loadings Fusion (MLF) Dimensional Reduction Analysis (DRA) method for improved classifier-based feature selection, 5) introduces the F-test as a DRA method for RF-DNA fingerprints, 6) provides a phenomenological understanding of test statistics and p-values, with KS-test and F-test statistic values being superior to p-values for DRA, and 7) introduces quantitative dimensionality assessment methods for DRA subset selection

    Genomic applications of statistical signal processing

    Get PDF
    Biological phenomena in the cells can be explained in terms of the interactions among biological macro-molecules, e.g., DNAs, RNAs and proteins. These interactions can be modeled by genetic regulatory networks (GRNs). This dissertation proposes to reverse engineering the GRNs based on heterogeneous biological data sets, including time-series and time-independent gene expressions, Chromatin ImmunoPrecipatation (ChIP) data, gene sequence and motifs and other possible sources of knowledge. The objective of this research is to propose novel computational methods to catch pace with the fast evolving biological databases. Signal processing techniques are exploited to develop computationally efficient, accurate and robust algorithms, which deal individually or collectively with various data sets. Methods of power spectral density estimation are discussed to identify genes participating in various biological processes. Information theoretic methods are applied for non-parametric inference. Bayesian methods are adopted to incorporate several sources with prior knowledge. This work aims to construct an inference system which takes into account different sources of information such that the absence of some components will not interfere with the rest of the system. It has been verified that the proposed algorithms achieve better inference accuracy and higher computational efficiency compared with other state-of-the-art schemes, e.g. REVEAL, ARACNE, Bayesian Networks and Relevance Networks, at presence of artificial time series and steady state microarray measurements. The proposed algorithms are especially appealing when the the sample size is small. Besides, they are able to integrate multiple heterogeneous data sources, e.g. ChIP and sequence data, so that a unified GRN can be inferred. The analysis of biological literature and in silico experiments on real data sets for fruit fly, yeast and human have corroborated part of the inferred GRN. The research has also produced a set of potential control targets for designing gene therapy strategies

    Machine learning methods for pattern analysis and clustering

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Selection and Clustering for Disease Associated Genetic Variants

    Get PDF
    Epigenetics is the study of chemical reactions, which are orchestrated for the development and maintenance of an organism. Genetic or epigenetic variants (GEVs) encompass different types of genetic measures, such as Deoxyribonucleic acid (DNA) methylation at different CpG sites, expression level of genes, or single nucleotide polymorphisms (SNPs). With the development of technology, huge amount of genetic and epigenetic information is produced. However, the rich information potentially brings in challenges in data analyses. Thus it is necessary to reduce the dimension of data to improve efficiency. In my dissertation, I will focus on two directions of dimension reduction: variable selection and clustering. The first project on dimension reduction was motivated by an epigenetic project aiming to identifying GEVs that are associated with a health outcome. Due to the potential non-linear interaction between GEVs, we designed a backward variable selection procedure to select informative GEVs. It is built upon a reproducing kernel-based method for evaluating the joint effect of a set of GEVs, e.g, a set of CpG sites. These GEVs may interact with each other in an unknown and complex way. Simulation studies indicate that the selection method is robust to different types of interaction effects, linear or non-linear. We demonstrate the method using two data sets with the first data selecting important SNPs that are associated with lung function and the second identifying important CpG sites such that their methylation is jointly associated with active smoking measured by cotinine levels. The second project was motivated by the potential heterogeneity in clusters identified by existing methods. Traditional approaches focus on the clustering of either subjects or (response) variables. However, clusters formed through these approaches possibly lack homogeneity. To improve the quality of clusters, we propose a clustering method through joint clustering. Specifically, the variables are first clustered based on the agreement of relationships (unknown) between variable measures and covariates of interest, and then subjects within each variable cluster are further clustered to form refined joint clusters. A Bayesian method is proposed for this purpose, in which a semi-parametric model is used to evaluate any unknown relationship between variables and covariates of interest, and a Dirichlet process is utilized in the process of second-step subjects clustering. The proposed method has the ability to produce homogeneous clusters composed of a certain number of subjects sharing common features on the relationship between some (response) variables and covariates. Simulation studies are used to examine the performance and efficiency of the proposed method. The method is then applied to DNA methylation measures of multiple CpG sites

    Fusion features ensembling models using Siamese convolutional neural network for kinship verification

    Get PDF
    Family is one of the most important entities in the community. Mining the genetic information through facial images is increasingly being utilized in wide range of real-world applications to facilitate family members tracing and kinship analysis to become remarkably easy, inexpensive, and fast as compared to the procedure of profiling Deoxyribonucleic acid (DNA). However, the opportunities of building reliable models for kinship recognition are still suffering from the insufficient determination of the familial features, unstable reference cues of kinship, and the genetic influence factors of family features. This research proposes enhanced methods for extracting and selecting the effective familial features that could provide evidences of kinship leading to improve the kinship verification accuracy through visual facial images. First, the Convolutional Neural Network based on Optimized Local Raw Pixels Similarity Representation (OLRPSR) method is developed to improve the accuracy performance by generating a new matrix representation in order to remove irrelevant information. Second, the Siamese Convolutional Neural Network and Fusion of the Best Overlapping Blocks (SCNN-FBOB) is proposed to track and identify the most informative kinship clues features in order to achieve higher accuracy. Third, the Siamese Convolutional Neural Network and Ensembling Models Based on Selecting Best Combination (SCNN-EMSBC) is introduced to overcome the weak performance of the individual image and classifier. To evaluate the performance of the proposed methods, series of experiments are conducted using two popular benchmarking kinship databases; the KinFaceW-I and KinFaceW-II which then are benchmarked against the state-of-art algorithms found in the literature. It is indicated that SCNN-EMSBC method achieves promising results with the average accuracy of 92.42% and 94.80% on KinFaceW-I and KinFaceW-II, respectively. These results significantly improve the kinship verification performance and has outperformed the state-of-art algorithms for visual image-based kinship verification

    Biometrics

    Get PDF
    Biometrics uses methods for unique recognition of humans based upon one or more intrinsic physical or behavioral traits. In computer science, particularly, biometrics is used as a form of identity access management and access control. It is also used to identify individuals in groups that are under surveillance. The book consists of 13 chapters, each focusing on a certain aspect of the problem. The book chapters are divided into three sections: physical biometrics, behavioral biometrics and medical biometrics. The key objective of the book is to provide comprehensive reference and text on human authentication and people identity verification from both physiological, behavioural and other points of view. It aims to publish new insights into current innovations in computer systems and technology for biometrics development and its applications. The book was reviewed by the editor Dr. Jucheng Yang, and many of the guest editors, such as Dr. Girija Chetty, Dr. Norman Poh, Dr. Loris Nanni, Dr. Jianjiang Feng, Dr. Dongsun Park, Dr. Sook Yoon and so on, who also made a significant contribution to the book

    Sparse machine learning models in bioinformatics

    Get PDF
    The meaning of parsimony is twofold in machine learning: either the structure or (and) the parameter of a model can be sparse. Sparse models have many strengths. First, sparsity is an important regularization principle to reduce model complexity and therefore avoid overfitting. Second, in many fields, for example bioinformatics, many high-dimensional data may be generated by a very few number of hidden factors, thus it is more reasonable to use a proper sparse model than a dense model. Third, a sparse model is often easy to interpret. In this dissertation, we investigate the sparse machine learning models and their applications in high-dimensional biological data analysis. We focus our research on five types of sparse models as follows. First, sparse representation is a parsimonious principle that a sample can be approximated by a sparse linear combination of basis vectors. We explore existing sparse representation models and propose our own sparse representation methods for high dimensional biological data analysis. We derive different sparse representation models from a Bayesian perspective. Two generic dictionary learning frameworks are proposed. Also, kernel and supervised dictionary learning approaches are devised. Furthermore, we propose fast active-set and decomposition methods for the optimization of sparse coding models. Second, gene-sample-time data are promising in clinical study, but challenging in computation. We propose sparse tensor decomposition methods and kernel methods for the dimensionality reduction and classification of such data. As the extensions of matrix factorization, tensor decomposition techniques can reduce the dimensionality of the gene-sample-time data dramatically, and the kernel methods can run very efficiently on such data. Third, we explore two sparse regularized linear models for multi-class problems in bioinformatics. Our first method is called the nearest-border classification technique for data with many classes. Our second method is a hierarchical model. It can simultaneously select features and classify samples. Our experiment, on breast tumor subtyping, shows that this model outperforms the one-versus-all strategy in some cases. Fourth, we propose to use spectral clustering approaches for clustering microarray time-series data. The approaches are based on two transformations that have been recently introduced, especially for gene expression time-series data, namely, alignment-based and variation-based transformations. Both transformations have been devised in order to take into account temporal relationships in the data, and have been shown to increase the ability of a clustering method in detecting co-expressed genes. We investigate the performances of these transformations methods, when combined with spectral clustering on two microarray time-series datasets, and discuss their strengths and weaknesses. Our experiments on two well known real-life datasets show the superiority of the alignment-based over the variation-based transformation for finding meaningful groups of co-expressed genes. Fifth, we propose the max-min high-order dynamic Bayesian network (MMHO-DBN) learning algorithm, in order to reconstruct time-delayed gene regulatory networks. Due to the small sample size of the training data and the power-low nature of gene regulatory networks, the structure of the network is restricted by sparsity. We also apply the qualitative probabilistic networks (QPNs) to interpret the interactions learned. Our experiments on both synthetic and real gene expression time-series data show that, MMHO-DBN can obtain better precision than some existing methods, and perform very fast. The QPN analysis can accurately predict types of influences and synergies. Additionally, since many high dimensional biological data are subject to missing values, we survey various strategies for learning models from incomplete data. We extend the existing imputation methods, originally for two-way data, to methods for gene-sample-time data. We also propose a pair-wise weighting method for computing kernel matrices from incomplete data. Computational evaluations show that both approaches work very robustly

    Univariate and multivariate statistical approaches for the analyses of omics data: sample classification and two-block integration.

    Get PDF
    The wealth of information generated by high-throughput omics technologies in the context of large-scale epidemiological studies has made a significant contribution to the identification of factors influencing the onset and progression of common diseases. Advanced computational and statistical modelling techniques are required to manipulate and extract meaningful biological information from these omics data as several layers of complexity are associated with them. Recent research efforts have concentrated in the development of novel statistical and bioinformatic tools; however, studies thoroughly investigating the applicability and suitability of these novel methods in real data have often fallen behind. This thesis focuses in the analyses of proteomics and transcriptomics data from the EnviroGenoMarker project with the purpose of addressing two main research objectives: i) to critically appraise established and recently developed statistical approaches in their ability to appropriately accommodate the inherently complex nature of real-world omics data and ii) to improve the current understanding of a prevalent condition by identifying biological markers predictive of disease as well as possible biological mechanisms leading to its onset. The specific disease endpoint of interest corresponds to B-cell Lymphoma, a common haematological malignancy for which many challenges related to its aetiology remain unanswered. The seven chapters comprising this thesis are structured in the following manner: the first two correspond to introductory chapters where I describe the main omics technologies and statistical methods employed for their analyses. The third chapter provides a description of the epidemiological project giving rise to the study population and the disease outcome of interest. These are followed by three results chapters that address the research aims described above by applying univariate and multivariate statistical approaches for sample classification and data integration purposes. A summary of findings, concluding general remarks and discussion of open problems offering potential avenues for future research are presented in the final chapter.Open Acces
    corecore