96,880 research outputs found
Efficient Data Management and Statistics with Zero-Copy Integration
Statistical analysts have long been struggling with evergrowing
data volumes. While specialized data management
systems such as relational databases would be able to handle
the data, statistical analysis tools are far more convenient to
express complex data analyses. An integration of these two
classes of systems has the potential to overcome the data
management issue while at the same time keeping analysis
convenient. However, one must keep a careful eye on implementation
overheads such as serialization. In this paper, we
propose the in-process integration of data management and
analytical tools. Furthermore, we argue that a zero-copy integration
is feasible due to the omnipresence of C-style arrays
containing native types. We discuss the general concept and
present a prototype of this integration based on the columnar
relational database MonetDB and the R environment for
statistical computing. We evaluate the performance of this
prototype in a series of micro-benchmarks of common data
management tasks
Structured penalized regression for drug sensitivity prediction
Large-scale {\it in vitro} drug sensitivity screens are an important tool in
personalized oncology to predict the effectiveness of potential cancer drugs.
The prediction of the sensitivity of cancer cell lines to a panel of drugs is a
multivariate regression problem with high-dimensional heterogeneous multi-omics
data as input data and with potentially strong correlations between the outcome
variables which represent the sensitivity to the different drugs. We propose a
joint penalized regression approach with structured penalty terms which allow
us to utilize the correlation structure between drugs with group-lasso-type
penalties and at the same time address the heterogeneity between omics data
sources by introducing data-source-specific penalty factors to penalize
different data sources differently. By combining integrative penalty factors
(IPF) with tree-guided group lasso, we create the IPF-tree-lasso method. We
present a unified framework to transform more general IPF-type methods to the
original penalized method. Because the structured penalty terms have multiple
parameters, we demonstrate how the interval-search Efficient Parameter
Selection via Global Optimization (EPSGO) algorithm can be used to optimize
multiple penalty parameters efficiently. Simulation studies show that
IPF-tree-lasso can improve the prediction performance compared to other
lasso-type methods, in particular for heterogenous data sources. Finally, we
employ the new methods to analyse data from the Genomics of Drug Sensitivity in
Cancer project.Comment: Zhao Z, Zucknick M (2020). Structured penalized regression for drug
sensitivity prediction. Journal of the Royal Statistical Society, Series C.
19 pages, 6 figures and 2 table
Why African stock markets should formally harmonise and integrate their operations
Despite experiencing rapid growth in their number and size, existing evidence suggests that
African stock markets remain highly fragmented, small, illiquid and technologically weak,
severely affecting their informational efficiency. Therefore, this study attempts to empirically
ascertain whether African stock markets can improve their informational efficiency by formally
harmonising and integrating their operations. Employing parametric and non-parametric
variance-ratios tests on 8 African continent-wide and 8 individual national daily share price
indices from 1995 to 2011, we find that irrespective of the test employed, the returns of all the 8
African continent-wide indices investigated appear to have better normal distribution properties
compared with the 8 individual national share price indices examined. We also report evidence
of statistically significant weak form informational efficiency of the African continent-wide
share price indices over the individual national share price indices irrespective of the test statistic
used. Our results imply that formal harmonisation and integration of African stock markets may
improve their informational efficiency
Consistent and efficient output-streams management in optimistic simulation platforms
Optimistic synchronization is considered an effective means for supporting Parallel Discrete Event Simulations. It relies on a speculative approach, where concurrent processes execute simulation events regardless of their safety, and consistency is ensured via proper rollback mechanisms, upon the a-posteriori detection of causal inconsistencies along the events' execution path. Interactions with the outside world (e.g. generation of output streams) are a well-known problem for rollback-based systems, since the outside world may have no notion of rollback. In this context, approaches for allowing the simulation modeler to generate consistent output rely on either the usage of ad-hoc APIs (which must be provided by the underlying simulation kernel) or temporary suspension of processing activities in order to wait for the final outcome (commit/rollback) associated with a speculatively-produced output. In this paper we present design indications and a reference implementation for an output streams' management subsystem which allows the simulation-model writer to rely on standard output-generation libraries (e.g. stdio) within code blocks associated with event processing. Further, the subsystem ensures that the produced output is consistent, namely associated with events that are eventually committed, and system-wide ordered along the simulation time axis. The above features jointly provide the illusion of a classical (simple to deal with) sequential programming model, which spares the developer from being aware that the simulation program is run concurrently and speculatively. We also show, via an experimental study, how the design/development optimizations we present lead to limited overhead, giving rise to the situation where the simulation run would have been carried out with near-to-zero or reduced output management cost. At the same time, the delay for materializing the output stream (making it available for any type of audit activity) is shown to be fairly limited and constant, especially for good mixtures of I/O-bound vs CPU-bound behaviors at the application level. Further, the whole output streams' management subsystem has been designed in order to provide scalability for I/O management on clusters. © 2013 ACM
Tupleware: Redefining Modern Analytics
There is a fundamental discrepancy between the targeted and actual users of
current analytics frameworks. Most systems are designed for the data and
infrastructure of the Googles and Facebooks of the world---petabytes of data
distributed across large cloud deployments consisting of thousands of cheap
commodity machines. Yet, the vast majority of users operate clusters ranging
from a few to a few dozen nodes, analyze relatively small datasets of up to a
few terabytes, and perform primarily compute-intensive operations. Targeting
these users fundamentally changes the way we should build analytics systems.
This paper describes the design of Tupleware, a new system specifically aimed
at the challenges faced by the typical user. Tupleware's architecture brings
together ideas from the database, compiler, and programming languages
communities to create a powerful end-to-end solution for data analysis. We
propose novel techniques that consider the data, computations, and hardware
together to achieve maximum performance on a case-by-case basis. Our
experimental evaluation quantifies the impact of our novel techniques and shows
orders of magnitude performance improvement over alternative systems
The Impact of M&A on Technology Sourcing Strategies
The paper investigates the effects of Mergers and Acquisitions (M&A) on corporate research and development (R&D) strategies using Community Innovation Survey (CIS) data on the Dutch manufacturing sector. The focus of the research is whether M&A affect corporate innovation strategies, favouring in-house R&D and innovation expenses versus external technological sourcing. The results show that M&A activities have a positive and significant impact on innovation investments by firms, and particularly on R&D intensity and total expenditure on innovation. M&A affect corporate innovation strategies, favouring in-house R&D versus external technological sourcing. Firm post-merger behaviour favours the consolidation of the knowledge, competences and capabilities that have been acquired by merging with or by buying another firm, confirming that the reasons for a merger or acquisition are most often related to firms' innovative performance. Following involvement in a M&A, firms tend primarily to focus on fully integration of their resource bases in order to enable them to produce and sell innovative products that are new to the market.Technology sourcing; Innovation; M&A; Heckman two-stage; Bi-Tobit.
Forecasting Construction Tender Price Index in Ghana using Autoregressive Integrated Moving Average with Exogenous Variables Model
Prices of construction resources keep on fluctuating due to unstable economic situations that have been experienced over the years. Clients knowledge of their financial commitments toward their intended project remains the basis for their final decision. The use of construction tender price index provides a realistic estimate at the early stage of the project. Tender price index (TPI) is influenced by various economic factors, hence there are several statistical techniques that have been employed in forecasting. Some of these include regression, time series, vector error correction among others. However, in recent times the integrated modelling approach is gaining popularity due to its ability to give powerful predictive accuracy. Thus, in line with this assumption, the aim of this study is to apply autoregressive integrated moving average with exogenous variables (ARIMAX) in modelling TPI. The results showed that ARIMAX model has a better predictive ability than the use of the single approach. The study further confirms the earlier position of previous research of the need to use the integrated model technique in forecasting TPI. This model will assist practitioners to forecast the future values of tender price index. Although the study focuses on the Ghanaian economy, the findings can be broadly applicable to other developing countries which share similar economic characteristics
BlinkML: Efficient Maximum Likelihood Estimation with Probabilistic Guarantees
The rising volume of datasets has made training machine learning (ML) models
a major computational cost in the enterprise. Given the iterative nature of
model and parameter tuning, many analysts use a small sample of their entire
data during their initial stage of analysis to make quick decisions (e.g., what
features or hyperparameters to use) and use the entire dataset only in later
stages (i.e., when they have converged to a specific model). This sampling,
however, is performed in an ad-hoc fashion. Most practitioners cannot precisely
capture the effect of sampling on the quality of their model, and eventually on
their decision-making process during the tuning phase. Moreover, without
systematic support for sampling operators, many optimizations and reuse
opportunities are lost.
In this paper, we introduce BlinkML, a system for fast, quality-guaranteed ML
training. BlinkML allows users to make error-computation tradeoffs: instead of
training a model on their full data (i.e., full model), BlinkML can quickly
train an approximate model with quality guarantees using a sample. The quality
guarantees ensure that, with high probability, the approximate model makes the
same predictions as the full model. BlinkML currently supports any ML model
that relies on maximum likelihood estimation (MLE), which includes Generalized
Linear Models (e.g., linear regression, logistic regression, max entropy
classifier, Poisson regression) as well as PPCA (Probabilistic Principal
Component Analysis). Our experiments show that BlinkML can speed up the
training of large-scale ML tasks by 6.26x-629x while guaranteeing the same
predictions, with 95% probability, as the full model.Comment: 22 pages, SIGMOD 201
- …