15,656 research outputs found

    Kernel Multivariate Analysis Framework for Supervised Subspace Learning: A Tutorial on Linear and Kernel Multivariate Methods

    Full text link
    Feature extraction and dimensionality reduction are important tasks in many fields of science dealing with signal processing and analysis. The relevance of these techniques is increasing as current sensory devices are developed with ever higher resolution, and problems involving multimodal data sources become more common. A plethora of feature extraction methods are available in the literature collectively grouped under the field of Multivariate Analysis (MVA). This paper provides a uniform treatment of several methods: Principal Component Analysis (PCA), Partial Least Squares (PLS), Canonical Correlation Analysis (CCA) and Orthonormalized PLS (OPLS), as well as their non-linear extensions derived by means of the theory of reproducing kernel Hilbert spaces. We also review their connections to other methods for classification and statistical dependence estimation, and introduce some recent developments to deal with the extreme cases of large-scale and low-sized problems. To illustrate the wide applicability of these methods in both classification and regression problems, we analyze their performance in a benchmark of publicly available data sets, and pay special attention to specific real applications involving audio processing for music genre prediction and hyperspectral satellite images for Earth and climate monitoring

    Simultaneous clustering with mixtures of factor analysers

    Get PDF
    This work details the method of Simultaneous Model-based Clustering. It also presents an extension to this method by reformulating it as a model with a mixture of factor analysers. This allows for the technique, known as Simultaneous Model-Based Clustering with a Mixture of Factor Analysers, to be able to cluster high dimensional gene-expression data. A new table of allowable and non-allowable models is formulated, along with a parameter estimation scheme for one such allowable model. Several numerical procedures are tested and various datasets, both real and generated, are clustered. The results of clustering the Iris data find a 3 component VEV model to have the lowest misclassification rate with comparable BIC values to the best scoring model. The clustering of Genetic data was less successful, where the 2-component model could successfully uncover the healthy tissue, but partitioned the cancerous tissue in half

    Statistical Methods for Integrative Analysis, Subgroup Identification, and Variable Selection Using Cancer Genomic Data

    Get PDF
    In recent years, comprehensive cancer genomics platform, such as The Cancer Genome Atlas (TCGA), provides access to an enormous amount of high throughput genomic datasets for each patient, including gene expression, DNA copy number alteration, DNA methylation, and somatic mutation. Currently most existing analysis approaches focused only on gene-level analysis and suffered from limited interpretability and low reproducibility of findings. Additionally, with increasing availability of the modern compositional data including immune cellular fraction data and high-dimensional zero-inflated microbiome data, variable selection techniques for compositional data became of great interest because they allow inference of key immune cell types (immunology data) and key microbial species (microbiome data) associated with development and progression of various diseases. In the first dissertation aim, we address these challenges by developing a Bayesian sparse latent factor model for pathway-guided integrative genomic data analysis. Specifically, we constructed a unified framework to simultaneously identify cancer patient subgroups (clustering) and key molecular markers (variable selection) based on the joint analysis of continuous, binary and count data. In addition, we applied Polya-Gamma mixtures of normal for binary and count data to promote an exact and fully automatic posterior sampling. Moreover, pathway information was used to improve accuracy and robustness in identification of cancer patient subgroups and key molecular features. In the second dissertation aim, we developed the R package InGRiD , a comprehensive software for pathway-guided integrative genomic data analysis. We further implemented the statistical model developed in Aim 1 and provide it as a part of this software. The third dissertation aim exploits variable selection in compositional data analysis with application to immunology data and microbiome data. Specifically, we identified key immune cell types by applying a stepwise pairwise log-ratio procedure to the immune cellular fractions data, while selecting key species in the microbiome data by using zero-inflated Wilcoxon rank sum test. These approaches consider key components specific to these data types, such as compositionality (i.e., sum-to-one), zero inflation, and high dimensionality, among others. The proposed methods were developed and evaluated on: 1) large scale, high dimensional, and multi-modal datasets from the TCGA database, including gene expression, DNA copy number alteration, and somatic mutation data (Aim 1); 2) cellular fraction data induced from Colorectal Adenocarcinoma TCGA Pan-Cancer study (Aim 3); 3) high dimensional zero-inflated microbiome data from studies of colorectal cancer (Aim 3)

    Power system stability scanning and security assessment using machine learning

    Get PDF
    Future grids planning requires a major departure from conventional power system planning, where only a handful of the most critical scenarios is analyzed. To account for a wide range of possible future evolutions, scenario analysis has been proposed in many industries. As opposed to the conventional power system planning, where the aim is to ïŹnd an optimal transmission and/or generation expansion plan for an existing grid, the aim in future grids scenario analysis is to analyze possible evolution pathways to inform power system planning and policy making. Therefore, future grids’ planning may involve large amount of scenarios and the existing planning tools may no longer suitable. Other than the raised future grids’ planning issues, operation of future grids using conventional tools is also challenged by the new features of future grids such as intermittent generation, demand response and fast responding power electronic plants which lead to much more diverse operation conditions compared to the existing networks. Among all operation issues, monitoring stability as well as security of a power system and action with deliberated preventive or remedial adjustment is of vital important. On- line Dynamic Security Assessment (DSA) can evaluate security of a power system almost instantly when current or imminent operation conditions are supplied. The focus of this dissertation are, for future grid planning, to develop a framework using Machine Learning (ML) to effectively assess the security of future grids by analyzing a large amount of the scenarios; for future grids operation, to propose approaches to address technique issues brought by future grids’ diverse operation conditions using ML techniques. Unsupervised learning, supervised learning and semi-supervised learning techniques are utilized in a set of proposed planning and operation security assessment tools

    Grafische Modelle zur Darstellung komplexer Assoziationsstrukturen von FunktionsfÀhigkeitsdaten

    Get PDF

    Measurement in marketing

    Get PDF
    We distinguish three senses of the concept of measurement (measurement as the selection of observable indicators of theoretical concepts, measurement as the collection of data from respondents, and measurement as the formulation of measurement models linking observable indicators to latent factors representing the theoretical concepts), and we review important issues related to measurement in each of these senses. With regard to measurement in the first sense, we distinguish the steps of construct definition and item generation, and we review scale development efforts reported in three major marketing journals since 2000 to illustrate these steps and derive practical guidelines. With regard to measurement in the second sense, we look at the survey process from the respondent's perspective and discuss the goals that may guide participants' behavior during a survey, the cognitive resources that respondents devote to answering survey questions, and the problems that may occur at the various steps of the survey process. Finally, with regard to measurement in the third sense, we cover both reflective and formative measurement models, and we explain how researchers can assess the quality of measurement in both types of measurement models and how they can ascertain the comparability of measurements across different populations of respondents or conditions of measurement. We also provide a detailed empirical example of measurement analysis for reflective measurement models

    Advanced and novel modeling techniques for simulation, optimization and monitoring chemical engineering tasks with refinery and petrochemical unit applications

    Get PDF
    Engineers predict, optimize, and monitor processes to improve safety and profitability. Models automate these tasks and determine precise solutions. This research studies and applies advanced and novel modeling techniques to automate and aid engineering decision-making. Advancements in computational ability have improved modeling software’s ability to mimic industrial problems. Simulations are increasingly used to explore new operating regimes and design new processes. In this work, we present a methodology for creating structured mathematical models, useful tips to simplify models, and a novel repair method to improve convergence by populating quality initial conditions for the simulation’s solver. A crude oil refinery application is presented including simulation, simplification tips, and the repair strategy implementation. A crude oil scheduling problem is also presented which can be integrated with production unit models. Recently, stochastic global optimization (SGO) has shown to have success of finding global optima to complex nonlinear processes. When performing SGO on simulations, model convergence can become an issue. The computational load can be decreased by 1) simplifying the model and 2) finding a synergy between the model solver repair strategy and optimization routine by using the initial conditions formulated as points to perturb the neighborhood being searched. Here, a simplifying technique to merging the crude oil scheduling problem and the vertically integrated online refinery production optimization is demonstrated. To optimize the refinery production a stochastic global optimization technique is employed. Process monitoring has been vastly enhanced through a data-driven modeling technique Principle Component Analysis. As opposed to first-principle models, which make assumptions about the structure of the model describing the process, data-driven techniques make no assumptions about the underlying relationships. Data-driven techniques search for a projection that displays data into a space easier to analyze. Feature extraction techniques, commonly dimensionality reduction techniques, have been explored fervidly to better capture nonlinear relationships. These techniques can extend data-driven modeling’s process-monitoring use to nonlinear processes. Here, we employ a novel nonlinear process-monitoring scheme, which utilizes Self-Organizing Maps. The novel techniques and implementation methodology are applied and implemented to a publically studied Tennessee Eastman Process and an industrial polymerization unit
    • 

    corecore