7 research outputs found
A Data-Driven Method for Selecting Optimal Models Based on Graphical Visualisation of Differences in Sequentially Fitted ROC Model Parameters
Differences in modelling techniques and model performance assessments typically impinge on the quality of knowledge extraction from data. We propose an algorithm for determining optimal patterns in data by separately training and testing three decision tree models in the Pima Indians Diabetes and the Bupa Liver Disorders datasets. Model performance is assessed using ROC curves and the Youden Index. Moving differences between sequential fitted parameters are then extracted, and their respective probability density estimations are used to track their variability using an iterative graphical data visualisation technique developed for this purpose. Our results show that the proposed strategy separates the groups more robustly than the plain ROC/Youden approach, eliminates obscurity, and minimizes over-fitting. Further, the algorithm can easily be understood by non-specialists and demonstrates multi-disciplinary compliance
A robust machine learning approach to SDG data segmentation
In the light of the recent technological advances in computing and data explosion, the complex interactions of the Sustainable Development Goals (SDG) present both a challenge and an opportunity to researchers and decision makers across fields and sectors. The deep and wide socio-economic, cultural and technological variations across the globe entail a unified understanding of the SDG project. The complexity of SDGs interactions and the dynamics through their indicators align naturally to technical and application specifics that require interdisciplinary solutions. We present a consilient approach to expounding triggers of SDG indicators. Illustrated through data segmentation, it is designed to unify our understanding of the complex overlap of the SDGs by utilising data from different sources. The paper treats each SDG as a Big Data source node, with the potential to contribute towards a unified understanding of applications across the SDG spectrum. Data for five SDGs was extracted from the United Nations SDG indicators data repository and used to model spatio-temporal variations in search of robust and consilient scientific solutions. Based on a number of pre-determined assumptions on socio-economic and geo-political variations, the data is subjected to sequential analyses, exploring distributional behaviour, component extraction and clustering. All three methods exhibit pronounced variations across samples, with initial distributional and data segmentation patterns isolating South Africa from the remaining five countries. Data randomness is dealt with via a specially developed algorithm for sampling, measuring and assessing, based on repeated samples of different sizes. Results exhibit consistent variations across samples, based on socio-economic, cultural and geo-political variations entailing a unified understanding, across disciplines and sectors. The findings highlight novel paths towards attaining informative patterns for a unified understanding of the triggers of SDG indicators and open new paths to interdisciplinary research
A Framework for Data-Driven Solutions with COVID-19 Illustrations
Data–driven solutions have long been keenly sought after as tools for driving the world’s fast changing business environment, with business leaders seeking to enhance decision making processes within their organisations. In the current era of Big Data, applications of data tools in addressing global, regional and national challenges have steadily grown in almost all fields across the globe. However, working in silos has continued to impede research progress, creating knowledge gaps and challenges across geographical borders, legislations, sectors and fields. There are many examples of the challenges the world faces in tackling global issues, including the complex interactions of the 17 Sustainable Development Goals (SDG) and the spatio–temporal variations of the impact of the on-going COVID–19 pandemic. Both challenges can be seen as non–orthogonal, strongly correlated and requiring an interdisciplinary approach to address. We present a generic framework for filling such gaps, based on two data-driven algorithms that combine data, machine learning and interdisciplinarity to bridge societal knowledge gaps. The novelty of the algorithms derives from their robust built–in mechanics for handling data randomness. Animation applications on structured COVID–19 related data obtained from the European Centre for Disease Prevention and Control (ECDC) and the UK Office of National Statistics exhibit great potentials for decision-support systems. Predictive findings are based on unstructured data–a large COVID–19 X–Ray data, 3181 image files, obtained from GitHub and Kaggle. Our results exhibit consistent performance across samples, resonating with cross-disciplinary discussions on novel paths for data-driven interdisciplinary research
Dealing with Randomness and Concept Drift in Large Datasets
Data-driven solutions to societal challenges continue to bring new dimensions to our daily lives. For example, while good-quality education is a well-acknowledged foundation of sustainable development, innovation and creativity, variations in student attainment and general performance remain commonplace. Developing data -driven solutions hinges on two fronts-technical and application. The former relates to the modelling perspective, where two of the major challenges are the impact of data randomness and general variations in definitions, typically referred to as concept drift in machine learning. The latter relates to devising data-driven solutions to address real-life challenges such as identifying potential triggers of pedagogical performance, which aligns with the Sustainable Development Goal (SDG) #4-Quality Education. A total of 3145 pedagogical data points were obtained from the central data collection platform for the United Arab Emirates (UAE) Ministry of Education (MoE). Using simple data visualisation and machine learning techniques via a generic algorithm for sampling, measuring and assessing, the paper highlights research pathways for educationists and data scientists to attain unified goals in an interdisciplinary context. Its novelty derives from embedded capacity to address data randomness and concept drift by minimising modelling variations and yielding consistent results across samples. Results show that intricate relationships among data attributes describe the invariant conditions that practitioners in the two overlapping fields of data science and education must identify
A statistical downscaling framework for environmental mapping
In recent years, knowledge extraction from data has become increasingly popular, with many numerical forecasting models, mainly falling into two major categories—chemical transport models (CTMs) and conventional statistical methods. However, due to data and model variability, data-driven knowledge extraction from high-dimensional, multifaceted data in such applications require generalisations of global to regional or local conditions. Typically, generalisation is achieved via mapping global conditions to local ecosystems and human habitats which amounts to tracking and monitoring environmental dynamics in various geographical areas and their regional and global implications on human livelihood. Statistical downscaling techniques have been widely used to extract high-resolution information from regional-scale variables produced by CTMs in climate model. Conventional applications of these methods are predominantly dimensional reduction in nature, designed to reduce spatial dimension of gridded model outputs without loss of essential spatial information. Their downside is twofold—complete dependence on unlabelled design matrix and reliance on underlying distributional assumptions. We propose a novel statistical downscaling framework for dealing with data and model variability. Its power derives from training and testing multiple models on multiple samples, narrowing down global environmental phenomena to regional discordance through dimensional reduction and visualisation. Hourly ground-level ozone observations were obtained from various environmental stations maintained by the US Environmental Protection Agency, covering the summer period (June–August 2005). Regional patterns of ozone are related to local observations via repeated runs and performance assessment of multiple versions of empirical orthogonal functions or principal components and principal fitted components via an algorithm with fully adaptable parameters. We demonstrate how the algorithm can be extended to weather-dependent and other applications with inherent data randomness and model variability via its built-in interdisciplinary computational power that connects data sources with end-users
A sequential data mining method for modelling solar magnetic cycles
We propose an adaptive data-driven approach to modelling solar magnetic activity cyclesbased on a sequential link between unsupervised and supervised modelling. Monthly sunspot numbers spanning over hundreds of years – from the mid-18th century to the first quarter of 2012 - obtained from the Royal Greenwich Observatory provide a reliable source of training and validation sets.An indicator variable is used to generate class labels and internal parameters which are used to separate high from low activity cycles. Our results show that by maximising data-dependent parameters and using them as inputs to a support vector machine model we obtain comparatively more robust and reliable predictions. Finally, we demonstrate how the method can be adapted to other unsupervised and supervised modelling applications.
Keywords :
Data clustering – Data mining – Predictive modelling – Solar magnetic activity – Sunspots – Supervised modelling – Support Vector Machines – Unsupervised Modellin