Search CORE

371 research outputs found

Structural learning for continuous data using graphical models

Author: Szili Benjamin
Publication venue
Publication date: 01/01/2022
Field of study

The field of statistics, and science as a whole has continuously been improving the study of increasingly complex structures, not only focusing on their individual components but also how they interaction with each other and their dependencies. This thesis is centered around learning the structure of data using graphical models. Graphical models allow the visual representation of dependence relations between a set of random variables through graphs, specified by some type of model formula. Whether the goal is probabilistic inference, mainly dealing with belief propagation, or causal inference, focusing on interventions, it is very important to be able to learn the underlying structure given a set of variables. In graph theory, this can be achieved by applying structural learning algorithms, either based on constraints or some scoring function. While there are a number of such algorithms that have been used extensively and are effective, they are somewhat limited by the type of relationships they can learn. Current methods excel at learning a network structure when the variables of interest are discrete, or if continuous, they are Gaussian. The main objective of this thesis is therefore to create a structural learning algorithm that is able to establish dependence relations that are not between discrete or Gaussian variables. Initially relevant literature and key methodology on graphical models, kernel methods and information theory were reviewed. The aim was to use a measurement that can reliably detect pairwise and conditional dependencies between random variables that are not necessarily Gaussian. The main contribution of the thesis is then the incorporation of kernel methods and Mutual Information to create new structural learning algorithm using graphs to visualize the learned structure. The resulting algorithm with three variants is then applied in a number of settings. First, a simulation setup using the post non-linear noise model on synthetic data was examined to compare performance of the new algorithm to a current approach. As the structure was known from the setup, the focus in this context was to see whether the algorithm successfully learns the structure. The remaining two settings then present cases where the new algorithm can be applied to provide insight and improve inference by accounting for the dependence structure. One of these settings is focusing on distinguishing handwritten digits, initially using Gaussian Process latent variable models (GP-LVMs). The second setting then applies the algorithm in to the field of phonetics, the task focusing on identifying speakers based on sound data

Glasgow Theses Service

Learning Bayesian network equivalence classes using ant colony optimisation

Author: Daly Rónán
Publication venue: The University of Edinburgh
Publication date: 01/01/2009
Field of study

Bayesian networks have become an indispensable tool in the modelling of uncertain knowledge. Conceptually, they consist of two parts: a directed acyclic graph called the structure, and conditional probability distributions attached to each node known as the parameters. As a result of their expressiveness, understandability and rigorous mathematical basis, Bayesian networks have become one of the first methods investigated, when faced with an uncertain problem domain. However, a recurring problem persists in specifying a Bayesian network. Both the structure and parameters can be difficult for experts to conceive, especially if their knowledge is tacit.To counteract these problems, research has been ongoing, on learning both the structure and parameters of Bayesian networks from data. Whilst there are simple methods for learning the parameters, learning the structure has proved harder. Part ofthis stems from the NP-hardness of the problem and the super-exponential space of possible structures. To help solve this task, this thesis seeks to employ a relatively new technique, that has had much success in tackling NP-hard problems. This technique is called ant colony optimisation. Ant colony optimisation is a metaheuristic based on the behaviour of ants acting together in a colony. It uses the stochastic activity of artificial ants to find good solutions to combinatorial optimisation problems. In the current work, this method is applied to the problem of searching through the space of equivalence classes of Bayesian networks, in order to find a good match against a set of data. The system uses operators that evaluate potential modifications to a current state. Each of the modifications is scored and the results used to inform the search. In order to facilitate these steps, other techniques are also devised, to speed up the learning process. The techniques includeThe techniques are tested by sampling data from gold standard networks and learning structures from this sampled data. These structures are analysed using various goodnessof-fit measures to see how well the algorithms perform. The measures include structural similarity metrics and Bayesian scoring metrics. The results are compared in depth against systems that also use ant colony optimisation and other methods, including evolutionary programming and greedy heuristics. Also, comparisons are made to well known state-of-the-art algorithms and a study performed on a real-life data set. The results show favourable performance compared to the other methods and on modelling the real-life data

Edinburgh Research Archive

Multivariate Models and Algorithms for Systems Biology

Author: Acharya Lipi Rani
Publication venue: ScholarWorks@UNO
Publication date: 17/12/2011
Field of study

Rapid advances in high-throughput data acquisition technologies, such as microarraysand next-generation sequencing, have enabled the scientists to interrogate the expression levels of tens of thousands of genes simultaneously. However, challenges remain in developingeffective computational methods for analyzing data generated from such platforms. In thisdissertation, we address some of these challenges. We divide our work into two parts. Inthe first part, we present a suite of multivariate approaches for a reliable discovery of geneclusters, often interpreted as pathway components, from molecular profiling data with replicated measurements. We translate our goal into learning an optimal correlation structure from replicated complete and incomplete measurements. In the second part, we focus on thereconstruction of signal transduction mechanisms in the signaling pathway components. Wepropose gene set based approaches for inferring the structure of a signaling pathway.First, we present a constrained multivariate Gaussian model, referred to as the informed-case model, for estimating the correlation structure from replicated and complete molecular profiling data. Informed-case model generalizes previously known blind-case modelby accommodating prior knowledge of replication mechanisms. Second, we generalize theblind-case model by designing a two-component mixture model. Our idea is to strike anoptimal balance between a fully constrained correlation structure and an unconstrained one.Third, we develop an Expectation-Maximization algorithm to infer the underlying correlation structure from replicated molecular profiling data with missing (incomplete) measurements.We utilize our correlation estimators for clustering real-world replicated complete and incompletemolecular profiling data sets. The above three components constitute the first partof the dissertation. For the structural inference of signaling pathways, we hypothesize a directed signal pathway structure as an ensemble of overlapping and linear signal transduction events. We then propose two algorithms to reverse engineer the underlying signaling pathway structure using unordered gene sets corresponding to signal transduction events. Throughout we treat gene sets as variables and the associated gene orderings as random.The first algorithm has been developed under the Gibbs sampling framework and the secondalgorithm utilizes the framework of simulated annealing. Finally, we summarize our findingsand discuss possible future directions

Multivariate Models and Algorithms for Systems Biology

Author: Acharya Lipi Rani
Publication venue: ScholarWorks@UNO
Publication date: 01/01/2011
Field of study

CiteSeerX

University of New Orleans

Agreement among human and annotated transcriptions of global songs

Author: 22nd International Society for Music Information Retrieval Conference (ISMIR)
Benetos E
Fujii S
Fukatsu H
Kondo H
McBride J
Ozaki Y
Pfordresher PQ
Proutskova P
Sakai E
Savage PE
Six J
T. Tierney A
Publication venue: International Society for Music Information Retrieval
Publication date: 09/11/2021
Field of study

Cross-cultural musical analysis requires standardized symbolic representation of sounds such as score notation. However, transcription into notation is usually conducted manually by ear, which is time-consuming and subjective. Our aim is to evaluate the reliability of existing methods for transcribing songs from diverse societies. We had 3 experts independently transcribe a sample of 32 excerpts of traditional monophonic songs from around the world (half a cappella, half with instrumental accompaniment). 16 songs also had pre-existing transcriptions created by 3 different experts. We compared these human transcriptions against one another and against 10 automatic music transcription algorithms. We found that human transcriptions can be sufficiently reliable (~90% agreement, κ ~.7), but current automated methods are not (<60% agreement, κ <.4). No automated method clearly outperformed others, in contrast to our predictions. These results suggest that improving automated methods for cross-cultural music transcription is critical for diversifying MIR

Queen Mary Research Online

Structural Influence of gene networks on their inference: Analysis of C3NET

Author: Altay Gokmen
Emmert-Streib Frank
Publication venue
Publication date: 22/06/2011
Field of study

RIGHTS : This article is licensed under the BioMed Central licence at http://www.biomedcentral.com/about/license which is similar to the 'Creative Commons Attribution Licence'. In brief you may : copy, distribute, and display the work; make derivative works; or make commercial use of the work - under the following conditions: the original author must be given credit; for any reuse or distribution, it must be made clear to others what the license terms of this work are.Abstract Background The availability of large-scale high-throughput data possesses considerable challenges toward their functional analysis. For this reason gene network inference methods gained considerable interest. However, our current knowledge, especially about the influence of the structure of a gene network on its inference, is limited. Results In this paper we present a comprehensive investigation of the structural influence of gene networks on the inferential characteristics of C3NET - a recently introduced gene network inference algorithm. We employ local as well as global performance metrics in combination with an ensemble approach. The results from our numerical study for various biological and synthetic network structures and simulation conditions, also comparing C3NET with other inference algorithms, lead a multitude of theoretical and practical insights into the working behavior of C3NET. In addition, in order to facilitate the practical usage of C3NET we provide an user-friendly R package, called c3net, and describe its functionality. It is available from https://r-forge.r-project.org/projects/c3net and from the CRAN package repository. Conclusions The availability of gene network inference algorithms with known inferential properties opens a new era of large-scale screening experiments that could be equally beneficial for basic biological and biomedical research with auspicious prospects. The availability of our easy to use software package c3net may contribute to the popularization of such methods. Reviewers This article was reviewed by Lev Klebanov, Joel Bader and Yuriy Gusev.Peer Reviewe

Queen's University Belfast Research Portal

Springer - Publisher Connector

PubMed Central

Apollo (Cambridge)

Information Processing Equalities and the Information-Risk Bridge

Author: Cranko Zac
Williamson Robert C.
Publication venue
Publication date: 08/09/2023
Field of study

We introduce two new classes of measures of information for statistical experiments which generalise and subsume

\phi

-divergences, integral probability metrics,

\mathfrak{N}

-distances (MMD), and

(f,\Gamma)

divergences between two or more distributions. This enables us to derive a simple geometrical relationship between measures of information and the Bayes risk of a statistical decision problem, thus extending the variational

\phi

-divergence representation to multiple distributions in an entirely symmetric manner. The new families of divergence are closed under the action of Markov operators which yields an information processing equality which is a refinement and generalisation of the classical data processing inequality. This equality gives insight into the significance of the choice of the hypothesis class in classical risk minimization.Comment: 48 pages; corrected some typos and added a few additional explanation

arXiv.org e-Print Archive

Predictive Modelling Approach to Data-Driven Computational Preventive Medicine

Author: Aldraimli M.
Aldraimli M.
Publication venue: University of Westminster
Publication date: 01/01/2023
Field of study

This thesis contributes novel predictive modelling approaches to data-driven computational preventive medicine and offers an alternative framework to statistical analysis in preventive medicine research. In the early parts of this research, this thesis presents research by proposing a synergy of machine learning methods for detecting patterns and developing inexpensive predictive models from healthcare data to classify the potential occurrence of adverse health events. In particular, the data-driven methodology is founded upon a heuristic-systematic assessment of several machine-learning methods, data preprocessing techniques, models’ training estimation and optimisation, and performance evaluation, yielding a novel computational data-driven framework, Octopus. Midway through this research, this thesis advances research in preventive medicine and data mining by proposing several new extensions in data preparation and preprocessing. It offers new recommendations for data quality assessment checks, a novel multimethod imputation (MMI) process for missing data mitigation, a novel imbalanced resampling approach, and minority pattern reconstruction (MPR) led by information theory. This thesis also extends the area of model performance evaluation with a novel classification performance ranking metric called XDistance. In particular, the experimental results show that building predictive models with the methods guided by our new framework (Octopus) yields domain experts' approval of the new reliable models’ performance. Also, performing the data quality checks and applying the MMI process led healthcare practitioners to outweigh predictive reliability over interpretability. The application of MPR and its hybrid resampling strategies led to better performances in line with experts' success criteria than the traditional imbalanced data resampling techniques. Finally, the use of the XDistance performance ranking metric was found to be more effective in ranking several classifiers' performances while offering an indication of class bias, unlike existing performance metrics The overall contributions of this thesis can be summarised as follow. First, several data mining techniques were thoroughly assessed to formulate the new Octopus framework to produce new reliable classifiers. In addition, we offer a further understanding of the impact of newly engineered features, the physical activity index (PAI) and biological effective dose (BED). Second, the newly developed methods within the new framework. Finally, the newly accepted developed predictive models help detect adverse health events, namely, visceral fat-associated diseases and advanced breast cancer radiotherapy toxicity side effects. These contributions could be used to guide future theories, experiments and healthcare interventions in preventive medicine and data mining

WestminsterResearch

Recommended from our members

Machine Learning Methods for Cancer Immunology

Author: Chlon Leon
Publication venue: University of Cambridge
Publication date: 02/11/2017
Field of study

Tumours are highly heterogeneous collections of tissues characterised by a repertoire of heavily mutated and rapidly proliferating cells. Evading immune destruction is a fundamental hallmark of cancer, and elucidating the contextual basis of tumour-infiltrating leukocytes is pivotal for improving immunotherapy initiatives. However, progress in this domain is hindered by an incomplete characterisation of the regulatory mechanisms involved in cancer immunity. Addressing this challenge, this thesis is formulated around a fundamental line of inquiry: how do we quantitatively describe the immune system with respect to tumour heterogeneity? Describing the molecular interactions between cancer cells and the immune system is a fundamental goal of cancer immunology. The first part of this thesis describes a three-stage association study to address this challenge in pancreatic ductal adenocarcinoma (PDAC). Firstly, network-based approaches are used to characterise PDAC on the basis of transcription factor regulators of an oncogenic KRAS signature. Next, gene expression tools are used to resolve the leukocyte subset mixing proportions, stromal contamination, immune checkpoint expression and immune pathway dysregulation from the data. Finally, partial correlations are used to characterise immune features in terms of KRAS master regulator activity. The results are compared across two independent cohorts for consistency. Moving beyond associations, the second part of the dissertation introduces a causal modelling approach to infer directed interactions between signaling pathway activity and immune agency. This is achieved by anchoring the analysis on somatic genomic changes. In particular, copy number profiles, transcriptomic data, image data and a protein-protein interaction network are integrated using graphical modelling approaches to infer directed relationships. Generated models are compared between independent cohorts and orthogonal datasets to evaluate consistency. Finally, proposed mechanisms are cross-referenced against literature examples to test for legitimacy. In summary, this dissertation provides methodological contributions, at the levels of associative and causal inference, for inferring the contextual basis for tumour-specific immune agency.This PhD was supported by the Cancer Research UK and Engineering and Physical Sciences Research Council Imaging Centre in Cambridge and Manchester (C197/A16465

Apollo (Cambridge)

Untangling hotel industry’s inefficiency: An SFA approach applied to a renowned Portuguese hotel chain

Author: Ferreira N.
Oliveira M.
Publication venue: CFE and CMStatistics networks
Publication date: 01/01/2015
Field of study

The present paper explores the technical efficiency of four hotels from Teixeira Duarte Group - a renowned Portuguese hotel chain. An efficiency ranking is established from these four hotel units located in Portugal using Stochastic Frontier Analysis. This methodology allows to discriminate between measurement error and systematic inefficiencies in the estimation process enabling to investigate the main inefficiency causes. Several suggestions concerning efficiency improvement are undertaken for each hotel studied.info:eu-repo/semantics/publishedVersio

Repositório Institucional do ISCTE-IUL