82 research outputs found
Statistical Aggregation: Theory and Applications
Due to their size and complexity, massive data sets bring many computational challenges for statistical analysis, such as overcoming the memory limitation and improving computational efficiency of traditional statistical methods. In the dissertation, I propose the statistical aggregation strategy to conquer such challenges posed by massive data sets. Statistical aggregation partitions the entire data set into smaller subsets, compresses each subset into certain low-dimensional summary statistics and aggregates the summary statistics to approximate the desired computation based on the entire data. Results from statistical aggregation are required to be asymptotically equivalent. Statistical aggregation processes the entire data set part by part, and hence overcomes memory limitation. Moreover, statistical aggregation can also improve the computational efficiency of statistical algorithms with computational complexity at the order of O(Nm): m \u3e 1) or even higher, where N is the size of the data. Statistical aggregation is particularly useful for online analytical processing: OLAP) in data cubes and stream data, where fast response to queries is the top priority. The &ldquo partition-compression-aggregation&rdquo strategy in statistical aggregation actually has been considered previously for OLAP computing in data cubes. But existing research in this area tends to overlook the statistical property of the analysis and aims to obtain identical results from aggregation, which has limited the application of this strategy to very simple analyses. Statistical aggregation instead can support OLAP in more sophisticated statistical analyses. In this dissertation, I apply statistical aggregation to two large families of statistical methods, estimating equation: EE) estimation and U-statistics, develop proper compression-aggregation schemes and show that the statistical aggregation tremendously reduces their computational burden while maintaining their efficiency. I further apply statistical aggregation to U-statistic based estimating equations and propose new estimating equations that need much less computational time but give asymptotically equivalent estimators
Community Detection by -penalized Graph Laplacian
Community detection in network analysis aims at partitioning nodes in a
network into disjoint communities. Most currently available algorithms
assume that is known, but choosing a correct is generally very
difficult for real networks. In addition, many real networks contain outlier
nodes not belonging to any community, but currently very few algorithm can
handle networks with outliers. In this paper, we propose a novel model free
tightness criterion and an efficient algorithm to maximize this criterion for
community detection. This tightness criterion is closely related with the graph
Laplacian with penalty. Unlike most community detection methods, our
method does not require a known and can properly detect communities in
networks with outliers.
Both theoretical and numerical properties of the method are analyzed. The
theoretical result guarantees that, under the degree corrected stochastic block
model, even for networks with outliers, the maximizer of the tightness
criterion can extract communities with small misclassification rates even when
the number of communities grows to infinity as the network size grows.
Simulation study shows that the proposed method can recover true communities
more accurately than other methods. Applications to a college football data and
a yeast protein-protein interaction data also reveal that the proposed method
performs significantly better.Comment: 40 pages, 15 Postscript figure
Economics of âTippingâ Button in Social Media: An Empirical Analysis of Content Monetization
As the success of social media platforms heavily depends on the amount and the nature of user-generated content, content monetization has been introduced as a mechanism to incentivize users to generate content. In particular, content contributors can be paid (i.e. tipped) by readers who like the story. We adopted difference-in-differences approach with robustness matching estimator to examine the impact of content monetization. Our results confirm that the content monetization effectively motivate content demand and supply and also improves content quality. Furthermore, such economic incentives have a spillover effect on ordinary weibo users before they are eligible to adopt âtippingâ function. However, the verified users who have already been the experts or celebrities in teh society may be depressed after open application of the program. This result suggests that start-ups are able to survive and earn profit even in markets that are dominated by famous celebrities because of the monetization mechanism
Spillover Effect of Content Marketing in E-commerce Platform under the Fan Economy Era
As the proliferation of social media and live streaming, online celebrity endorsement is a common practice of content marketing in e-commerce platform. Despite the prevalent use of social media and online community, empirical research investigating the economic values of user-generated-content (UGC) and marketer-generated-content (MGC) still lags. This study seeks to contribute theoretically and practically to an understanding of how online celebrity endorsement and fans interaction behaviors affect e-commerce sales. We adopt cross-sectional regression to assess the economic value of online celebrity endorsement, and we employ panel vector autoregressive model to explain the dynamic relationship between marketersâ and consumersâ content marketing behaviors and e-commerce product sales. Empirical results highlight that the interaction within fans community has spillover effect on content marketing under âFan Economyâ era
Internet Celebrity Endorsement: How Internet Celebrities Bring Referral Traffic to E-commerce Sites?
Endorsement marketing has been widely used to generate consumer attention, interest, and purchase behaviors among targeted audience of celebrities. Internet celebrities who become famous by means of the Internet are more dependent on strategy intimacy to appeal to their followers. Limited studies have addressed the new business models in Internet celebrities economy: content advertising and online retailing. Our study aims to examine how Internet celebrity endorsement influencing the consumersâ clickon behaviors and purchase behaviors in the context of e-commerce business. Results suggest that content marketing using Internet celebrity endorsement exhibit a significant role in bringing referral traffic to e-commerce sites but less helpful to boost sales. The impact of Internet celebrity endorsement on consumersâ click-on decisions is U-shaped, but the role of Internet celebrities as online retailers will âshape-flipâ such relationship to a negative linear relation. Therefore, Internet celebrity endorsement provides effective ways to bring referral traffic to e-commerce sites
Feature screening for clustering analysis
In this paper, we consider feature screening for ultrahigh dimensional
clustering analyses. Based on the observation that the marginal distribution of
any given feature is a mixture of its conditional distributions in different
clusters, we propose to screen clustering features by independently evaluating
the homogeneity of each feature's mixture distribution. Important
cluster-relevant features have heterogeneous components in their mixture
distributions and unimportant features have homogeneous components. The
well-known EM-test statistic is used to evaluate the homogeneity. Under general
parametric settings, we establish the tail probability bounds of the EM-test
statistic for the homogeneous and heterogeneous features, and further show that
the proposed screening procedure can achieve the sure independent screening and
even the consistency in selection properties. Limiting distribution of the
EM-test statistic is also obtained for general parametric distributions. The
proposed method is computationally efficient, can accurately screen for
important cluster-relevant features and help to significantly improve
clustering, as demonstrated in our extensive simulation and real data analyses
rSW-seq: Algorithm for detection of copy number alterations in deep sequencing data
Background
Recent advances in sequencing technologies have enabled generation of large-scale genome sequencing data. These data can be used to characterize a variety of genomic features, including the DNA copy number profile of a cancer genome. A robust and reliable method for screening chromosomal alterations would allow a detailed characterization of the cancer genome with unprecedented accuracy.
Results
We develop a method for identification of copy number alterations in a tumor genome compared to its matched control, based on application of Smith-Waterman algorithm to single-end sequencing data. In a performance test with simulated data, our algorithm shows >90% sensitivity and >90% precision in detecting a single copy number change that contains approximately 500 reads for the normal sample. With 100-bp reads, this corresponds to a ~50 kb region for 1X genome coverage of the human genome. We further refine the algorithm to develop rSW-seq, (recursive Smith-Waterman-seq) to identify alterations in a complex configuration, which are commonly observed in the human cancer genome. To validate our approach, we compare our algorithm with an existing algorithm using simulated and publicly available datasets. We also compare the sequencing-based profiles to microarray-based results.
Conclusion
We propose rSW-seq as an efficient method for detecting copy number changes in the tumor genome.National Institute of General Medical Sciences (U.S.) (R01 GM082798
Online Bayesian Analysis
In the last few years, there has been active research on aggregating advanced statistical measures in multidimensional data cubes from partitioned subsets of data. In this paper, we propose an online compression and aggregation scheme to support Bayesian estimations in data cubes based on the asymptotic properties of Bayesian statistics. In the proposed approach, we compress each data segment by retaining only the model parameters and a small amount of auxiliary measures. We then develop an aggregation formula that allows us to reconstruct the Bayesian estimation from partitioned segments with a small approximation error. We show that the Bayesian estimates and the aggregated Bayesian estimates are asymptotically equivalent
- âŠ