1,035 research outputs found
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
A Comprehensive Survey on Rare Event Prediction
Rare event prediction involves identifying and forecasting events with a low
probability using machine learning and data analysis. Due to the imbalanced
data distributions, where the frequency of common events vastly outweighs that
of rare events, it requires using specialized methods within each step of the
machine learning pipeline, i.e., from data processing to algorithms to
evaluation protocols. Predicting the occurrences of rare events is important
for real-world applications, such as Industry 4.0, and is an active research
area in statistical and machine learning. This paper comprehensively reviews
the current approaches for rare event prediction along four dimensions: rare
event data, data processing, algorithmic approaches, and evaluation approaches.
Specifically, we consider 73 datasets from different modalities (i.e.,
numerical, image, text, and audio), four major categories of data processing,
five major algorithmic groupings, and two broader evaluation approaches. This
paper aims to identify gaps in the current literature and highlight the
challenges of predicting rare events. It also suggests potential research
directions, which can help guide practitioners and researchers.Comment: 44 page
Seeking multiple solutions:an updated survey on niching methods and their applications
Multi-Modal Optimization (MMO) aiming to locate multiple optimal (or near-optimal) solutions in a single simulation run has practical relevance to problem solving across many fields. Population-based meta-heuristics have been shown particularly effective in solving MMO problems, if equipped with specificallydesigned diversity-preserving mechanisms, commonly known as niching methods. This paper provides an updated survey on niching methods. The paper first revisits the fundamental concepts about niching and its most representative schemes, then reviews the most recent development of niching methods, including novel and hybrid methods, performance measures, and benchmarks for their assessment. Furthermore, the paper surveys previous attempts at leveraging the capabilities of niching to facilitate various optimization tasks (e.g., multi-objective and dynamic optimization) and machine learning tasks (e.g., clustering, feature selection, and learning ensembles). A list of successful applications of niching methods to real-world problems is presented to demonstrate the capabilities of niching methods in providing solutions that are difficult for other optimization methods to offer. The significant practical value of niching methods is clearly exemplified through these applications. Finally, the paper poses challenges and research questions on niching that are yet to be appropriately addressed. Providing answers to these questions is crucial before we can bring more fruitful benefits of niching to real-world problem solving
Can Conversations on Reddit Forecast Future Economic Uncertainty? An Interpretable Machine Learning Approach
In recent years, social media has become an indispensable source of information through which public attitudes, opinions, and concerns can be studied and quantified. This paper proposes an interpretable machine learning framework for predicting the Equity Market-related Economic Uncertainty Index using features generated from a popular discussion forum on Reddit. Our framework consists of a series of custom preprocessing and analytics methods, including BERTopic for latent topic identification and regularized linear models. Using our framework, we evaluate explanatory models with different configurations over a large corpus of Reddit posts belonging to the personal finance category. Our analysis generates valuable insights about discussion topics on Reddit and their efficacy in accurately predicting future economic uncertainty. The study demonstrates the potential of using social media data and interpretable machine learning to inform economic forecasting research
Recommended from our members
Ageneric predictive information system for resource planning and optimisation
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel UniversityThe purpose of this research work is to demonstrate the feasibility of creating a quick response decision platform for middle management in industry. It utilises the strengths of current, but more importantly creates a leap forward in the theory and practice of Supervisory and Data Acquisition (SCADA) systems and Discrete Event Simulation and Modelling (DESM). The proposed research platform uses real-time data and creates an automatic platform for real-time and predictive system analysis, giving current and ahead of time information on the performance of the system in an efficient manner. Data acquisition as the backend connection of data integration system to the shop floor faces both hardware and software challenges for coping with large scale real-time data collection. Limited scope of SCADA systems does not make them suitable candidates for this. Cost effectiveness, complexity, and efficiency-orientation of proprietary solutions leave space for more challenge. A Flexible Data Input Layer Architecture (FDILA) is proposed to address generic data integration platform so a multitude of data sources can be connected to the data processing unit. The efficiency of the proposed integration architecture lies in decentralising and distributing services between different layers. A novel Sensitivity Analysis (SA) method called EvenTracker is proposed as an effective tool to measure the importance and priority of inputs to the system. The EvenTracker method is introduced to deal with the complexity systems in real-time. The approach takes advantage of event-based definition of data involved in process flow. The underpinning logic behind EvenTracker SA method is capturing the cause-effect relationships between triggers (input variables) and events (output variables) at a specified period of time determined by an expert. The approach does not require estimating data distribution of any kind. Neither the performance model requires execution beyond the real-time. The proposed EvenTracker sensitivity analysis method has the lowest computational complexity compared with other popular sensitivity analysis methods. For proof of concept, a three tier data integration system was designed and developed by using National Instruments’ LabVIEW programming language, Rockwell Automation’s Arena simulation and modelling software, and OPC data communication software. A laboratory-based conveyor system with 29 sensors was installed to simulate a typical shop floor production line. In addition, EvenTracker SA method has been implemented on the data extracted from 28 sensors of one manufacturing line in a real factory. The experiment has resulted 14% of the input variables to be unimportant for evaluation of model outputs. The method proved a time efficiency gain of 52% on the analysis of filtered system when unimportant input variables were not sampled anymore. The EvenTracker SA method compared to Entropy-based SA technique, as the only other method that can be used for real-time purposes, is quicker, more accurate and less computationally burdensome. Additionally, theoretic estimation of computational complexity of SA methods based on both structural complexity and energy-time analysis resulted in favour of the efficiency of the proposed EvenTracker SA method. Both laboratory and factory-based experiments demonstrated flexibility and efficiency of the proposed solution.The Engineering and Physical Sciences Research Council
A Survey on Explainable Anomaly Detection
In the past two decades, most research on anomaly detection has focused on
improving the accuracy of the detection, while largely ignoring the
explainability of the corresponding methods and thus leaving the explanation of
outcomes to practitioners. As anomaly detection algorithms are increasingly
used in safety-critical domains, providing explanations for the high-stakes
decisions made in those domains has become an ethical and regulatory
requirement. Therefore, this work provides a comprehensive and structured
survey on state-of-the-art explainable anomaly detection techniques. We propose
a taxonomy based on the main aspects that characterize each explainable anomaly
detection technique, aiming to help practitioners and researchers find the
explainable anomaly detection method that best suits their needs.Comment: Paper accepted by the ACM Transactions on Knowledge Discovery from
Data (TKDD) for publication (preprint version
Scientific Workflows for Metabolic Flux Analysis
Metabolic engineering is a highly interdisciplinary research domain that interfaces biology, mathematics, computer science, and engineering. Metabolic flux analysis with carbon tracer experiments (13 C-MFA) is a particularly challenging metabolic engineering application that consists of several tightly interwoven building blocks such as modeling, simulation, and experimental design. While several general-purpose workflow solutions have emerged in recent years to support the realization of complex scientific applications, the transferability of these approaches are only partially applicable to 13C-MFA workflows. While problems in other research fields (e.g., bioinformatics) are primarily centered around scientific data processing, 13C-MFA workflows have more in common with business workflows. For instance, many bioinformatics workflows are designed to identify, compare, and annotate genomic sequences by "pipelining" them through standard tools like BLAST. Typically, the next workflow task in the pipeline can be automatically determined by the outcome of the previous step. Five computational challenges have been identified in the endeavor of conducting 13 C-MFA studies: organization of heterogeneous data, standardization of processes and the unification of tools and data, interactive workflow steering, distributed computing, and service orientation. The outcome of this thesis is a scientific workflow framework (SWF) that is custom-tailored for the specific requirements of 13 C-MFA applications. The proposed approach – namely, designing the SWF as a collection of loosely-coupled modules that are glued together with web services – alleviates the realization of 13C-MFA workflows by offering several features. By design, existing tools are integrated into the SWF using web service interfaces and foreign programming language bindings (e.g., Java or Python). Although the attributes "easy-to-use" and "general-purpose" are rarely associated with distributed computing software, the presented use cases show that the proposed Hadoop MapReduce framework eases the deployment of computationally demanding simulations on cloud and cluster computing resources. An important building block for allowing interactive researcher-driven workflows is the ability to track all data that is needed to understand and reproduce a workflow. The standardization of 13 C-MFA studies using a folder structure template and the corresponding services and web interfaces improves the exchange of information for a group of researchers. Finally, several auxiliary tools are developed in the course of this work to complement the SWF modules, i.e., ranging from simple helper scripts to visualization or data conversion programs. This solution distinguishes itself from other scientific workflow approaches by offering a system of loosely-coupled components that are flexibly arranged to match the typical requirements in the metabolic engineering domain. Being a modern and service-oriented software framework, new applications are easily composed by reusing existing components
- …