82 research outputs found

    Statistical Aggregation: Theory and Applications

    Get PDF
    Due to their size and complexity, massive data sets bring many computational challenges for statistical analysis, such as overcoming the memory limitation and improving computational efficiency of traditional statistical methods. In the dissertation, I propose the statistical aggregation strategy to conquer such challenges posed by massive data sets. Statistical aggregation partitions the entire data set into smaller subsets, compresses each subset into certain low-dimensional summary statistics and aggregates the summary statistics to approximate the desired computation based on the entire data. Results from statistical aggregation are required to be asymptotically equivalent. Statistical aggregation processes the entire data set part by part, and hence overcomes memory limitation. Moreover, statistical aggregation can also improve the computational efficiency of statistical algorithms with computational complexity at the order of O(Nm): m \u3e 1) or even higher, where N is the size of the data. Statistical aggregation is particularly useful for online analytical processing: OLAP) in data cubes and stream data, where fast response to queries is the top priority. The &ldquo partition-compression-aggregation&rdquo strategy in statistical aggregation actually has been considered previously for OLAP computing in data cubes. But existing research in this area tends to overlook the statistical property of the analysis and aims to obtain identical results from aggregation, which has limited the application of this strategy to very simple analyses. Statistical aggregation instead can support OLAP in more sophisticated statistical analyses. In this dissertation, I apply statistical aggregation to two large families of statistical methods, estimating equation: EE) estimation and U-statistics, develop proper compression-aggregation schemes and show that the statistical aggregation tremendously reduces their computational burden while maintaining their efficiency. I further apply statistical aggregation to U-statistic based estimating equations and propose new estimating equations that need much less computational time but give asymptotically equivalent estimators

    Community Detection by L0L_0-penalized Graph Laplacian

    Full text link
    Community detection in network analysis aims at partitioning nodes in a network into KK disjoint communities. Most currently available algorithms assume that KK is known, but choosing a correct KK is generally very difficult for real networks. In addition, many real networks contain outlier nodes not belonging to any community, but currently very few algorithm can handle networks with outliers. In this paper, we propose a novel model free tightness criterion and an efficient algorithm to maximize this criterion for community detection. This tightness criterion is closely related with the graph Laplacian with L0L_0 penalty. Unlike most community detection methods, our method does not require a known KK and can properly detect communities in networks with outliers. Both theoretical and numerical properties of the method are analyzed. The theoretical result guarantees that, under the degree corrected stochastic block model, even for networks with outliers, the maximizer of the tightness criterion can extract communities with small misclassification rates even when the number of communities grows to infinity as the network size grows. Simulation study shows that the proposed method can recover true communities more accurately than other methods. Applications to a college football data and a yeast protein-protein interaction data also reveal that the proposed method performs significantly better.Comment: 40 pages, 15 Postscript figure

    Economics of ‘Tipping’ Button in Social Media: An Empirical Analysis of Content Monetization

    Get PDF
    As the success of social media platforms heavily depends on the amount and the nature of user-generated content, content monetization has been introduced as a mechanism to incentivize users to generate content. In particular, content contributors can be paid (i.e. tipped) by readers who like the story. We adopted difference-in-differences approach with robustness matching estimator to examine the impact of content monetization. Our results confirm that the content monetization effectively motivate content demand and supply and also improves content quality. Furthermore, such economic incentives have a spillover effect on ordinary weibo users before they are eligible to adopt “tipping” function. However, the verified users who have already been the experts or celebrities in teh society may be depressed after open application of the program. This result suggests that start-ups are able to survive and earn profit even in markets that are dominated by famous celebrities because of the monetization mechanism

    Spillover Effect of Content Marketing in E-commerce Platform under the Fan Economy Era

    Get PDF
    As the proliferation of social media and live streaming, online celebrity endorsement is a common practice of content marketing in e-commerce platform. Despite the prevalent use of social media and online community, empirical research investigating the economic values of user-generated-content (UGC) and marketer-generated-content (MGC) still lags. This study seeks to contribute theoretically and practically to an understanding of how online celebrity endorsement and fans interaction behaviors affect e-commerce sales. We adopt cross-sectional regression to assess the economic value of online celebrity endorsement, and we employ panel vector autoregressive model to explain the dynamic relationship between marketers’ and consumers’ content marketing behaviors and e-commerce product sales. Empirical results highlight that the interaction within fans community has spillover effect on content marketing under “Fan Economy” era

    Internet Celebrity Endorsement: How Internet Celebrities Bring Referral Traffic to E-commerce Sites?

    Get PDF
    Endorsement marketing has been widely used to generate consumer attention, interest, and purchase behaviors among targeted audience of celebrities. Internet celebrities who become famous by means of the Internet are more dependent on strategy intimacy to appeal to their followers. Limited studies have addressed the new business models in Internet celebrities economy: content advertising and online retailing. Our study aims to examine how Internet celebrity endorsement influencing the consumers’ clickon behaviors and purchase behaviors in the context of e-commerce business. Results suggest that content marketing using Internet celebrity endorsement exhibit a significant role in bringing referral traffic to e-commerce sites but less helpful to boost sales. The impact of Internet celebrity endorsement on consumers’ click-on decisions is U-shaped, but the role of Internet celebrities as online retailers will “shape-flip” such relationship to a negative linear relation. Therefore, Internet celebrity endorsement provides effective ways to bring referral traffic to e-commerce sites

    Feature screening for clustering analysis

    Full text link
    In this paper, we consider feature screening for ultrahigh dimensional clustering analyses. Based on the observation that the marginal distribution of any given feature is a mixture of its conditional distributions in different clusters, we propose to screen clustering features by independently evaluating the homogeneity of each feature's mixture distribution. Important cluster-relevant features have heterogeneous components in their mixture distributions and unimportant features have homogeneous components. The well-known EM-test statistic is used to evaluate the homogeneity. Under general parametric settings, we establish the tail probability bounds of the EM-test statistic for the homogeneous and heterogeneous features, and further show that the proposed screening procedure can achieve the sure independent screening and even the consistency in selection properties. Limiting distribution of the EM-test statistic is also obtained for general parametric distributions. The proposed method is computationally efficient, can accurately screen for important cluster-relevant features and help to significantly improve clustering, as demonstrated in our extensive simulation and real data analyses

    rSW-seq: Algorithm for detection of copy number alterations in deep sequencing data

    Get PDF
    Background Recent advances in sequencing technologies have enabled generation of large-scale genome sequencing data. These data can be used to characterize a variety of genomic features, including the DNA copy number profile of a cancer genome. A robust and reliable method for screening chromosomal alterations would allow a detailed characterization of the cancer genome with unprecedented accuracy. Results We develop a method for identification of copy number alterations in a tumor genome compared to its matched control, based on application of Smith-Waterman algorithm to single-end sequencing data. In a performance test with simulated data, our algorithm shows >90% sensitivity and >90% precision in detecting a single copy number change that contains approximately 500 reads for the normal sample. With 100-bp reads, this corresponds to a ~50 kb region for 1X genome coverage of the human genome. We further refine the algorithm to develop rSW-seq, (recursive Smith-Waterman-seq) to identify alterations in a complex configuration, which are commonly observed in the human cancer genome. To validate our approach, we compare our algorithm with an existing algorithm using simulated and publicly available datasets. We also compare the sequencing-based profiles to microarray-based results. Conclusion We propose rSW-seq as an efficient method for detecting copy number changes in the tumor genome.National Institute of General Medical Sciences (U.S.) (R01 GM082798

    Online Bayesian Analysis

    Get PDF
    In the last few years, there has been active research on aggregating advanced statistical measures in multidimensional data cubes from partitioned subsets of data. In this paper, we propose an online compression and aggregation scheme to support Bayesian estimations in data cubes based on the asymptotic properties of Bayesian statistics. In the proposed approach, we compress each data segment by retaining only the model parameters and a small amount of auxiliary measures. We then develop an aggregation formula that allows us to reconstruct the Bayesian estimation from partitioned segments with a small approximation error. We show that the Bayesian estimates and the aggregated Bayesian estimates are asymptotically equivalent
    • 

    corecore