787 research outputs found

    Data Analysis of Expression with Gene Microarray and Investigation for Gene Regulatory Networks

    Get PDF
    基因芯片表达数据可以揭示在各种不同条件下基因的活动情况及基因之间相互作用关系等,因此,对该数据的分析有着非常广泛的应用前景。近年来,随着基因芯片表达数据快速增长,传统的分析方法已经不能适应现代生物迅速发展的要求,迫切需要更加高效的方法来分析处理这些数据。 本文首先简述了基因芯片表达数据分析的关键技术及其研究进展,指出了现有方法存在处理冗余和噪声数据能力不足、结果缺乏生物学解释以及调控网络识别率有限等问题。针对这些问题,我们研究了基因芯片表达数据分析三种新方法,包括:基因芯片表达数据的特征选择方法、维度约简方法以及重构基因调控网络方法,为建立分类模型辅助诊断以及识别疾病相关基因等提供了解决方...The gene microarray data is able to reveal the gene activity in various conditions and the interactions between different genes, therefore, it has a very broad application prospects for analyzing gene microarray data. Recently, the amount of gene microarray data is growing faster than the rate at which it can be analyzed and more effective techniques and methods are needed to analysis these data. ...学位:工学博士院系专业:信息科学与技术学院_电路与系统学号:2312009015369

    Challenges of Big Data Analysis

    Full text link
    Big Data bring new opportunities to modern society and challenges to data scientists. On one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity, and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This article give overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasis on the viability of the sparsest solution in high-confidence set and point out that exogeneous assumptions in most statistical methods for Big Data can not be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions

    Feature Augmentation via Nonparametrics and Selection (FANS) in High Dimensional Classification

    Full text link
    We propose a high dimensional classification method that involves nonparametric feature augmentation. Knowing that marginal density ratios are the most powerful univariate classifiers, we use the ratio estimates to transform the original feature measurements. Subsequently, penalized logistic regression is invoked, taking as input the newly transformed or augmented features. This procedure trains models equipped with local complexity and global simplicity, thereby avoiding the curse of dimensionality while creating a flexible nonlinear decision boundary. The resulting method is called Feature Augmentation via Nonparametrics and Selection (FANS). We motivate FANS by generalizing the Naive Bayes model, writing the log ratio of joint densities as a linear combination of those of marginal densities. It is related to generalized additive models, but has better interpretability and computability. Risk bounds are developed for FANS. In numerical analysis, FANS is compared with competing methods, so as to provide a guideline on its best application domain. Real data analysis demonstrates that FANS performs very competitively on benchmark email spam and gene expression data sets. Moreover, FANS is implemented by an extremely fast algorithm through parallel computing.Comment: 30 pages, 2 figure

    Sparse Bilinear Logistic Regression

    Full text link
    In this paper, we introduce the concept of sparse bilinear logistic regression for decision problems involving explanatory variables that are two-dimensional matrices. Such problems are common in computer vision, brain-computer interfaces, style/content factorization, and parallel factor analysis. The underlying optimization problem is bi-convex; we study its solution and develop an efficient algorithm based on block coordinate descent. We provide a theoretical guarantee for global convergence and estimate the asymptotical convergence rate using the Kurdyka-{\L}ojasiewicz inequality. A range of experiments with simulated and real data demonstrate that sparse bilinear logistic regression outperforms current techniques in several important applications.Comment: 27 pages, 5 figure

    Censored Data Regression in High-Dimension and Low-Sample Size Settings For Genomic Applications

    Get PDF
    New high-throughput technologies are generating various types of high-dimensional genomic and proteomic data and meta-data (e.g., networks and pathways) in order to obtain a systems-level understanding of various complex diseases such as human cancers and cardiovascular diseases. As the amount and complexity of the data increase and as the questions being addressed become more sophisticated, we face the great challenge of how to model such data in order to draw valid statistical and biological conclusions. One important problem in genomic research is to relate these high-throughput genomic data to various clinical outcomes, including possibly censored survival outcomes such as age at disease onset or time to cancer recurrence. We review some recently developed methods for censored data regression in the high-dimension and low-sample size setting, with emphasis on applications to genomic data. These methods include dimension reduction-based methods, regularized estimation methods such as Lasso and threshold gradient descent method, gradient descent boosting methods and nonparametric pathways-based regression models. These methods are demonstrated and compared by analysis of a data set of microarray gene expression profiles of 240 patients with diffuse large B-cell lymphoma together with follow-up survival information. Areas of further research are also presented
    corecore