2 research outputs found

    Analyzing the Fine Structure of Distributions

    Full text link
    One aim of data mining is the identification of interesting structures in data. For better analytical results, the basic properties of an empirical distribution, such as skewness and eventual clipping, i.e. hard limits in value ranges, need to be assessed. Of particular interest is the question of whether the data originate from one process or contain subsets related to different states of the data producing process. Data visualization tools should deliver a clear picture of the univariate probability density distribution (PDF) for each feature. Visualization tools for PDFs typically use kernel density estimates and include both the classical histogram, as well as the modern tools like ridgeline plots, bean plots and violin plots. If density estimation parameters remain in a default setting, conventional methods pose several problems when visualizing the PDF of uniform, multimodal, skewed distributions and distributions with clipped data, For that reason, a new visualization tool called the mirrored density plot (MD plot), which is specifically designed to discover interesting structures in continuous features, is proposed. The MD plot does not require adjusting any parameters of density estimation, which is what may make the use of this plot compelling particularly to non-experts. The visualization tools in question are evaluated against statistical tests with regard to typical challenges of explorative distribution analysis. The results of the evaluation are presented using bimodal Gaussian, skewed distributions and several features with already published PDFs. In an exploratory data analysis of 12 features describing quarterly financial statements, when statistical testing poses a great difficulty, only the MD plots can identify the structure of their PDFs. In sum, the MD plot outperforms the above mentioned methods.Comment: 66 pages, 81 figures, accepted in PLOS ON

    A Hierarchical Decision Model for Evaluating the Strategy Readiness of Quantitative Machine Learning/Data Science-Driven Investment Strategies

    Get PDF
    Big data and computational technologies are increasingly important worldwide in asset and investment management. Many investment management firms are adopting these data science methods and technologies to improve performance across all investment processes. Researchers actively use these methods to develop more effective systematic investment strategies and produce more reliable outcomes less vulnerable to human decision-making biases. However, the success of such a strategy depends heavily on the scientific rigor applied throughout the process. Best practices involve understanding how to make better decisions in the research design process. A good question is whether we can make better decisions in developing quantitative strategies. Therefore, the decisions made in the research process are crucial to developing successful quantitative strategies. Additionally, as this field is inherently multidisciplinary, it requires a system thinking approach to consider multiple perspectives to provide a clearer understanding of the strategies often referred to as black boxes. Therefore, the main objective of this research is to develop a multi-criteria assessment framework and scoring decision support system to evaluate quantitative investment strategies that apply machine learning and data science techniques in their research and development. Subject matter experts will assess all framework perspectives from a systematic literature review to approve their reliability. The perspectives consist of economic and financial foundations, data perspective, features perspective, modeling perspective, and performance perspective. The research methodology applied is the Hierarchical Decision Model (aka HDM) to provide a 360-degree view of the quantitative investment strategy and improve and generalize the concept to other asset classes and regions. Finally, this research helps investment researchers and professionals to focus on research process decisions in generating more hypotheses and developing financial theories to be tested empirically rather than cherry-picking investment strategies based on historical simulations
    corecore