96,880 research outputs found

    Efficient Data Management and Statistics with Zero-Copy Integration

    Get PDF
    Statistical analysts have long been struggling with evergrowing data volumes. While specialized data management systems such as relational databases would be able to handle the data, statistical analysis tools are far more convenient to express complex data analyses. An integration of these two classes of systems has the potential to overcome the data management issue while at the same time keeping analysis convenient. However, one must keep a careful eye on implementation overheads such as serialization. In this paper, we propose the in-process integration of data management and analytical tools. Furthermore, we argue that a zero-copy integration is feasible due to the omnipresence of C-style arrays containing native types. We discuss the general concept and present a prototype of this integration based on the columnar relational database MonetDB and the R environment for statistical computing. We evaluate the performance of this prototype in a series of micro-benchmarks of common data management tasks

    Structured penalized regression for drug sensitivity prediction

    Full text link
    Large-scale {\it in vitro} drug sensitivity screens are an important tool in personalized oncology to predict the effectiveness of potential cancer drugs. The prediction of the sensitivity of cancer cell lines to a panel of drugs is a multivariate regression problem with high-dimensional heterogeneous multi-omics data as input data and with potentially strong correlations between the outcome variables which represent the sensitivity to the different drugs. We propose a joint penalized regression approach with structured penalty terms which allow us to utilize the correlation structure between drugs with group-lasso-type penalties and at the same time address the heterogeneity between omics data sources by introducing data-source-specific penalty factors to penalize different data sources differently. By combining integrative penalty factors (IPF) with tree-guided group lasso, we create the IPF-tree-lasso method. We present a unified framework to transform more general IPF-type methods to the original penalized method. Because the structured penalty terms have multiple parameters, we demonstrate how the interval-search Efficient Parameter Selection via Global Optimization (EPSGO) algorithm can be used to optimize multiple penalty parameters efficiently. Simulation studies show that IPF-tree-lasso can improve the prediction performance compared to other lasso-type methods, in particular for heterogenous data sources. Finally, we employ the new methods to analyse data from the Genomics of Drug Sensitivity in Cancer project.Comment: Zhao Z, Zucknick M (2020). Structured penalized regression for drug sensitivity prediction. Journal of the Royal Statistical Society, Series C. 19 pages, 6 figures and 2 table

    Why African stock markets should formally harmonise and integrate their operations

    Get PDF
    Despite experiencing rapid growth in their number and size, existing evidence suggests that African stock markets remain highly fragmented, small, illiquid and technologically weak, severely affecting their informational efficiency. Therefore, this study attempts to empirically ascertain whether African stock markets can improve their informational efficiency by formally harmonising and integrating their operations. Employing parametric and non-parametric variance-ratios tests on 8 African continent-wide and 8 individual national daily share price indices from 1995 to 2011, we find that irrespective of the test employed, the returns of all the 8 African continent-wide indices investigated appear to have better normal distribution properties compared with the 8 individual national share price indices examined. We also report evidence of statistically significant weak form informational efficiency of the African continent-wide share price indices over the individual national share price indices irrespective of the test statistic used. Our results imply that formal harmonisation and integration of African stock markets may improve their informational efficiency

    Consistent and efficient output-streams management in optimistic simulation platforms

    Get PDF
    Optimistic synchronization is considered an effective means for supporting Parallel Discrete Event Simulations. It relies on a speculative approach, where concurrent processes execute simulation events regardless of their safety, and consistency is ensured via proper rollback mechanisms, upon the a-posteriori detection of causal inconsistencies along the events' execution path. Interactions with the outside world (e.g. generation of output streams) are a well-known problem for rollback-based systems, since the outside world may have no notion of rollback. In this context, approaches for allowing the simulation modeler to generate consistent output rely on either the usage of ad-hoc APIs (which must be provided by the underlying simulation kernel) or temporary suspension of processing activities in order to wait for the final outcome (commit/rollback) associated with a speculatively-produced output. In this paper we present design indications and a reference implementation for an output streams' management subsystem which allows the simulation-model writer to rely on standard output-generation libraries (e.g. stdio) within code blocks associated with event processing. Further, the subsystem ensures that the produced output is consistent, namely associated with events that are eventually committed, and system-wide ordered along the simulation time axis. The above features jointly provide the illusion of a classical (simple to deal with) sequential programming model, which spares the developer from being aware that the simulation program is run concurrently and speculatively. We also show, via an experimental study, how the design/development optimizations we present lead to limited overhead, giving rise to the situation where the simulation run would have been carried out with near-to-zero or reduced output management cost. At the same time, the delay for materializing the output stream (making it available for any type of audit activity) is shown to be fairly limited and constant, especially for good mixtures of I/O-bound vs CPU-bound behaviors at the application level. Further, the whole output streams' management subsystem has been designed in order to provide scalability for I/O management on clusters. © 2013 ACM

    Tupleware: Redefining Modern Analytics

    Full text link
    There is a fundamental discrepancy between the targeted and actual users of current analytics frameworks. Most systems are designed for the data and infrastructure of the Googles and Facebooks of the world---petabytes of data distributed across large cloud deployments consisting of thousands of cheap commodity machines. Yet, the vast majority of users operate clusters ranging from a few to a few dozen nodes, analyze relatively small datasets of up to a few terabytes, and perform primarily compute-intensive operations. Targeting these users fundamentally changes the way we should build analytics systems. This paper describes the design of Tupleware, a new system specifically aimed at the challenges faced by the typical user. Tupleware's architecture brings together ideas from the database, compiler, and programming languages communities to create a powerful end-to-end solution for data analysis. We propose novel techniques that consider the data, computations, and hardware together to achieve maximum performance on a case-by-case basis. Our experimental evaluation quantifies the impact of our novel techniques and shows orders of magnitude performance improvement over alternative systems

    The Impact of M&A on Technology Sourcing Strategies

    Get PDF
    The paper investigates the effects of Mergers and Acquisitions (M&A) on corporate research and development (R&D) strategies using Community Innovation Survey (CIS) data on the Dutch manufacturing sector. The focus of the research is whether M&A affect corporate innovation strategies, favouring in-house R&D and innovation expenses versus external technological sourcing. The results show that M&A activities have a positive and significant impact on innovation investments by firms, and particularly on R&D intensity and total expenditure on innovation. M&A affect corporate innovation strategies, favouring in-house R&D versus external technological sourcing. Firm post-merger behaviour favours the consolidation of the knowledge, competences and capabilities that have been acquired by merging with or by buying another firm, confirming that the reasons for a merger or acquisition are most often related to firms' innovative performance. Following involvement in a M&A, firms tend primarily to focus on fully integration of their resource bases in order to enable them to produce and sell innovative products that are new to the market.Technology sourcing; Innovation; M&A; Heckman two-stage; Bi-Tobit.

    Forecasting Construction Tender Price Index in Ghana using Autoregressive Integrated Moving Average with Exogenous Variables Model

    Get PDF
    Prices of construction resources keep on fluctuating due to unstable economic situations that have been experienced over the years. Clients knowledge of their financial commitments toward their intended project remains the basis for their final decision. The use of construction tender price index provides a realistic estimate at the early stage of the project. Tender price index (TPI) is influenced by various economic factors, hence there are several statistical techniques that have been employed in forecasting. Some of these include regression, time series, vector error correction among others. However, in recent times the integrated modelling approach is gaining popularity due to its ability to give powerful predictive accuracy. Thus, in line with this assumption, the aim of this study is to apply autoregressive integrated moving average with exogenous variables (ARIMAX) in modelling TPI. The results showed that ARIMAX model has a better predictive ability than the use of the single approach. The study further confirms the earlier position of previous research of the need to use the integrated model technique in forecasting TPI. This model will assist practitioners to forecast the future values of tender price index. Although the study focuses on the Ghanaian economy, the findings can be broadly applicable to other developing countries which share similar economic characteristics

    BlinkML: Efficient Maximum Likelihood Estimation with Probabilistic Guarantees

    Full text link
    The rising volume of datasets has made training machine learning (ML) models a major computational cost in the enterprise. Given the iterative nature of model and parameter tuning, many analysts use a small sample of their entire data during their initial stage of analysis to make quick decisions (e.g., what features or hyperparameters to use) and use the entire dataset only in later stages (i.e., when they have converged to a specific model). This sampling, however, is performed in an ad-hoc fashion. Most practitioners cannot precisely capture the effect of sampling on the quality of their model, and eventually on their decision-making process during the tuning phase. Moreover, without systematic support for sampling operators, many optimizations and reuse opportunities are lost. In this paper, we introduce BlinkML, a system for fast, quality-guaranteed ML training. BlinkML allows users to make error-computation tradeoffs: instead of training a model on their full data (i.e., full model), BlinkML can quickly train an approximate model with quality guarantees using a sample. The quality guarantees ensure that, with high probability, the approximate model makes the same predictions as the full model. BlinkML currently supports any ML model that relies on maximum likelihood estimation (MLE), which includes Generalized Linear Models (e.g., linear regression, logistic regression, max entropy classifier, Poisson regression) as well as PPCA (Probabilistic Principal Component Analysis). Our experiments show that BlinkML can speed up the training of large-scale ML tasks by 6.26x-629x while guaranteeing the same predictions, with 95% probability, as the full model.Comment: 22 pages, SIGMOD 201
    • …
    corecore