Search CORE

4 research outputs found

Forecasting of Enterprise's Credit Risk Based on Network-logistic Model

Author: 方匡南
范新妍
马双鸽
Publication venue
Publication date: 15/04/2016
Field of study

随着计算机和互联网的快速发展,特别是在大数据时代,企业积累了大量有关企业经营、财务等相关数据,变量众多且关系纷繁复杂,如果利用传统的logistic回归建立企业信用风险预警模型往往效果不好。本文在充分考虑变量间的网络结构(Network)关系基础上,提出了网络结构Logistic模型,通过惩罚方法同时实现变量选择和参数估计。蒙特卡洛模拟表明网络结构Logistic模型要优于其他方法。最后,我们将其应用到我国企业信用风险预警中,充分考虑财务指标间的网络结构关系,科学地选择评估指标,构建更加适合我国国情的企业信用风险预警方法。With the rapid development of computer and the Internet,especially in the era of big data,some enterprises has accumulated a lot about their operation and finance data. Since the data is numerous and complicated,if we use the traditional logistic regression to build up the enterprise credit risk,the performance usually isn't good. In this paper,we propose network-logistic model based on considering the network relationship among variables,via penalized method to conduct variable selection and parameters estimation simultaneously. Simulation results show that network-logistic model performs better than other compared methods. Finally,we apply it to forecast enterprise's credit risk,under considering the network relationship between financial indicators,select significant variables and build up a suitable credit risk forecasting model for Chinese enterprises.国家自然科学基金面上项目“广义线性模型的组变量选择及其在信用评分中的应用”(71471152);; 国家社会科学基金重大项目“大数据与统计学理论的发展研究”(13&ZD148);国家社会科学基金青年项目“大数据的高维变量选择方法及其应用研究”(13CTJ001)的资

Xiamen University Institutional Repository

Integrative Analysis for Big Data

Author: 方匡南
王小燕
马双鸽
Publication venue
Publication date: 15/11/2015
Field of study

大数据具有数据来源差异性、高维性及稀疏性等特点,如何挖掘数据集间的异质性和共同性并降维去噪是大数据分析的目标与挑战之一。整合分析(InTEgrATIVE AnAlySIS)同时分析多个独立数据集,避免因地域、时间等因素造成的样本差异而引起模型不稳定,是研究大数据差异性的有效方法。它的特点是将每个解释变量在所有数据集中的系数视为一组,通过惩罚函数对系数组进行压缩,研究变量间的关联性并实现降维。本文从同构数据整合分析、异构数据整合分析以及考虑网络结构的整合分析三方面梳理了惩罚整合分析方法的原理、算法和研究现状。统计模拟发现,在弱相关、一般相关和强相关三种情形下,l1grOuP brIdgE、l1grOuP MCP、COMPOSITE MCP都表现良好,其中l1grOuP brIdgE的假阳数最低且最稳定。最后,将整合分析用于研究具有来源差异性的新农合家庭医疗支出,以及具有超高维、小样本等大数据典型特征的癌症基因数据,得到了一些有意义的结论。The difference of data source,high dimensionality and sparsity are the main characteristics of big data.How to mining the heterogeneity and association of different datasets and achieve dimension reduction is one of goals and challenges of big data analysis.Integrative analysis provides an effective way of analyzing big data.It simultaneously analyzes multiple datasets,avoiding the model instability from individual variations caused by regional and time factor and so on.The coefficients of each covariate across all datasets are treated as a group and penalty function is used to shrinkage these groups of coefficients to achieve variable selection.In this paper,we review the existing research of penalized integrative analysis from three aspects of homogeneity integrative analysis,heterogeneity integrative analysis and network integrative analysis.Three simulations are conducted to verify the performance of integrative analysis,including weak,moderate and strong correlations.It shows that L1 Group Bridge、L1Group MCP、Composite MCP perform well,and L1 Group Bridge has the lowest false positive and is most stable.Finally,integrative analysis is applied to analyze the new rural cooperative medical expenditure data with source difference,as well as cancer genetics data with typical characteristics of big data such as super high dimension and small sample.国家统计局重大项目“大数据的统计方法研究”(2012LD001);国家统计局重点项目“大数据线性、理论及处理技术的发展和创新研究”(2013LZ53); 国家社会科学基金重大项目“大数据与统计学理论的发展研究”(13&ZD148);国家社会科学基金青年项目“大数据的高维变量选择方法及其应用研究”(13CTJ001); 国家自然科学基金面上项目“广义线性模型的组变量选择及其在信用评分中的应用”(71471152)资

Xiamen University Institutional Repository

双向聚类方法综述

Author: 张庆昭
方匡南
陈远星
马双鸽
Publication venue: 'Intellect'
Publication date: 19/08/2019
Field of study

传统的聚类方法由于无法提取样本和变量间的局部对应关系,并且当数据具有高维性和稀疏性时表现不佳,因此学者们提出了双向聚类,基于样本和变量间的局部关系,同时对样本和变量进行聚类,形成一个子矩阵的聚类结果。近年来,双向聚类发展迅速,在基因分析、文本聚类、推荐系统等领域应用广泛。首先,对双向聚类方法进行梳理与归纳,重点阐述稀疏双向聚类、谱双向聚类和信息双向聚类三类方法,分析它们之间的区别和联系,并且介绍这三类方法在多源数据的整合分析、多层聚类、半监督学习以及集成学习上的发展现状和趋势;其次,重点介绍双向聚类在基因分析、文本聚类、推荐系统等领域的应用研究情况;最后,结合大数据时代的数据特征和双向聚类的存在的问题,展望双向聚类未来的研究方向

Xiamen University Institutional Repository

A Review of Penalized Group Variable Selection Methods in High Dimensional Data

Author: 方匡南
王小燕
谢邦昌
马双鸽
Publication venue
Publication date: 22/11/2015
Field of study

变量选择是统计建模的重要环节,选择合适的变量可以建立结构简单、含义明确、预测精准的稳健模型。在实际应用中,有些变量具有群组结构,本文概括了三类群组变量选择惩罚方法,包括处理高度相关变量、仅选择组变量、即选择组又选择单个变量的方法,着重比较了它们的统计性质和优缺点,总结了相关算法和调整参数选择的方法。最后文章归纳了相关应用情况,并讨论了最新发展方向和所面临的挑战。Variable selection is of great importance in statistical modeling.Suitable variables can make the model simple,meaningful and have favorite performance of prediction.Actually,there exist group structures among the predictors.This paper gives a review of three types of penalized group variable selection methods,including strongly correlated variable selection,group level selection and bi-level selection.We highlight their statistical properties,advantages and disadvantages.We also summarize the algorithms and tuning parameter selection.We discuss their applications,the further studies and the challenges in the end.国家社会科学基金(13&ZD148;13CTJ001); 国家自然科学基金(71471152); 国家统计局项目(2013LZ53;2012LD001

Xiamen University Institutional Repository