1,027,904 research outputs found

    Do We Price Happiness? Evidence from Korean Stock Market

    Full text link
    This study explores the potential of internet search volume data, specifically Google Trends, as an indicator for cross-sectional stock returns. Unlike previous studies, our research specifically investigates the search volume of the topic 'happiness' and its impact on stock returns in the aspect of risk pricing rather than as sentiment measurement. Empirical results indicate that this 'happiness' search exposure (HSE) can explain future returns, particularly for big and value firms. This suggests that HSE might be a reflection of a firm's ability to produce goods or services that meet societal utility needs. Our findings have significant implications for institutional investors seeking to leverage HSE-based strategies for outperformance. Additionally, our research suggests that, when selected judiciously, some search topics on Google Trends can be related to risks that impact stock prices.Comment: 8 page

    String and Membrane Gaussian Processes

    Full text link
    In this paper we introduce a novel framework for making exact nonparametric Bayesian inference on latent functions, that is particularly suitable for Big Data tasks. Firstly, we introduce a class of stochastic processes we refer to as string Gaussian processes (string GPs), which are not to be mistaken for Gaussian processes operating on text. We construct string GPs so that their finite-dimensional marginals exhibit suitable local conditional independence structures, which allow for scalable, distributed, and flexible nonparametric Bayesian inference, without resorting to approximations, and while ensuring some mild global regularity constraints. Furthermore, string GP priors naturally cope with heterogeneous input data, and the gradient of the learned latent function is readily available for explanatory analysis. Secondly, we provide some theoretical results relating our approach to the standard GP paradigm. In particular, we prove that some string GPs are Gaussian processes, which provides a complementary global perspective on our framework. Finally, we derive a scalable and distributed MCMC scheme for supervised learning tasks under string GP priors. The proposed MCMC scheme has computational time complexity O(N)\mathcal{O}(N) and memory requirement O(dN)\mathcal{O}(dN), where NN is the data size and dd the dimension of the input space. We illustrate the efficacy of the proposed approach on several synthetic and real-world datasets, including a dataset with 66 millions input points and 88 attributes.Comment: To appear in the Journal of Machine Learning Research (JMLR), Volume 1

    A Delphi study to build consensus on the definition and use of big data in obesity research

    Get PDF
    Background: ā€˜Big dataā€™ has great potential to help address the global health challenge of obesity. However, lack of clarity with regard to the definition of big data and frameworks for effectively using big data in the context of obesity research may be hindering progress. The aim of this study was to establish agreed approaches for the use of big data in obesity-related research. Methods: A Delphi method of consensus development was used, comprising three survey rounds. In Round 1, participants were asked to rate agreement/disagreement with 77 statements across seven domains relating to definitions of and approaches to using big data in the context of obesity research. Participants were also asked to contribute further ideas in relation to these topics, which were incorporated as new statements (n = 8) in Round 2. In Rounds 2 and 3 participants re-appraised their ratings in view of the group consensus. Results: Ninety-six experts active in obesity-related research were invited to participate. Of these, 36/96 completed Round 1 (37.5% response rate), 29/36 completed Round 2 (80.6% response rate) and 26/29 completed Round 3 (89.7% response rate). Consensus (defined as >70% agreement) was achieved for 90.6% (n=77) of statements, with 100% consensus achieved for the Definition of Big Data, Data Governance, and Quality and Inference domains. Conclusions: Experts agreed that big data was more nuanced than the oft-cited definition of ā€˜volume, variety and velocityā€™, and includes quantitative, qualitative, observational or intervention data from a range of sources that have been collected for research or other purposes. Experts repeatedly called for third party action, for example to develop frameworks for reporting and ethics, to clarify data governance requirements, to support training and skill development and to facilitate sharing of big data. Further advocacy will be required to encourage organisations to adopt these roles

    Using Stock Prices as Ground Truth in Sentiment Analysis to Generate Profitable Trading Signals

    Get PDF
    The increasing availability of "big" (large volume) social media data has motivated a great deal of research in applying sentiment analysis to predict the movement of prices within financial markets. Previous work in this field investigates how the true sentiment of text (i.e. positive or negative opinions) can be used for financial predictions, based on the assumption that sentiments expressed online are representative of the true market sentiment. Here we consider the converse idea, that using the stock price as the ground-truth in the system may be a better indication of sentiment. Tweets are labelled as Buy or Sell dependent on whether the stock price discussed rose or fell over the following hour, and from this, stock-specific dictionaries are built for individual companies. A Bayesian classifier is used to generate stock predictions, which are input to an automated trading algorithm. Placing 468 trades over a 1 month period yields a return rate of 5.18%, which annualises to approximately 83% per annum. This approach performs significantly better than random chance and outperforms two baseline sentiment analysis methods tested.Comment: 8 pages, 6 figures. To be presented at IEEE Symposium on Computational Intelligence in Financial Engineering (CIFEr), Bengaluru, November 18-21, 201

    An Empirical Analysis of Predictive Machine Learning Algorithms on High-Dimensional Microarray Cancer Data

    Get PDF
    This research evaluates pattern recognition techniques on a subclass of big data where the dimensionality of the input space p is much larger than the number of observations n. Seven gene-expression microarray cancer datasets, where the ratio Īŗ = n/p is less than one, were chosen for evaluation. The statistical and computational challenges inherent with this type of high-dimensional low sample size (HDLSS) data were explored. The capability and performance of a diverse set of machine learning algorithms is presented and compared. The sparsity and collinearity of the data being employed, in conjunction with the complexity of the algorithms studied, demanded rigorous and careful tuning of the hyperparameters and regularization parameters. This necessitated several extensions of cross-validation to be investigated, with the purpose of culminating in the best predictive performance. For the techniques evaluated in this thesis, regularization or kernelization, and often both, produced lower classiļ¬cation error rates than randomized ensemble for all datasets used in this research. However, no one technique evaluated for classifying HDLSS microarray cancer data emerged as the universally best technique for predicting the generalization error.1 From the empirical analysis performed in this thesis, the following fundamentals emerged as being instrumental in consistently resulting in lower error rates when estimating the generalization error in this HDLSS microarray cancer data: ā€¢ Thoroughly investigate and understand the data ā€¢ Stratify during all sampling due to the uneven classes and extreme sparsity of this data. ā€¢ Perform 3 to 5 replicates of stratiļ¬ed cross-validation, implementing an adaptive K-fold, to determine the optimal tuning parameters. ā€¢ To estimate the generalization error in HDLSS data, replication is paramount. Replicate R=500 or R=1000 times with training and test sets of 2/3 and 1/3, respectively, to get the best generalization error estimate. ā€¢ Whenever possible, obtain an independent validation dataset. ā€¢ Seed the data for a fair and unbiased comparison among techniques. ā€¢ Deļ¬ne a methodology or standard set of process protocols to apply to machine learning research. This would prove very beneļ¬cial in ensuring reproducibility and would enable better comparisons among techniques. _____ 1A predominant portion of this research was published in the Serdica Journal of Computing (Volume 8, Number 2, 2014) as proceedings from the 2014 Flint International Statistical Conference at Kettering University, Michigan, USA
    • ā€¦
    corecore