7,939 research outputs found
Local Binary Patterns as a Feature Descriptor in Alignment-free Visualisation of Metagenomic Data
Shotgun sequencing has facilitated the analysis of complex microbial communities. However, clustering and visualising these communities without prior taxonomic information is a major challenge. Feature descriptor methods can be utilised to extract these taxonomic relations from the data. Here, we present a novel approach consisting of local binary patterns (LBP) coupled with randomised singular value decomposition (RSVD) and Barnes-Hut t-stochastic neighbor embedding (BH-tSNE) to highlight the underlying taxonomic structure of the metagenomic data. The effectiveness of our approach is demonstrated using several simulated and a real metagenomic datasets
An Overview of the Use of Neural Networks for Data Mining Tasks
In the recent years the area of data mining has experienced a considerable demand for technologies that extract knowledge from large and complex data sources. There is a substantial commercial interest as well as research investigations in the area that aim to develop new and improved approaches for extracting information, relationships, and patterns from datasets. Artificial Neural Networks (NN) are popular biologically inspired intelligent methodologies, whose classification, prediction and pattern recognition capabilities have been utilised successfully in many areas, including science, engineering, medicine, business, banking, telecommunication, and many other fields. This paper highlights from a data mining perspective the implementation of NN, using supervised and unsupervised learning, for pattern recognition, classification, prediction and cluster analysis, and focuses the discussion on their usage in bioinformatics and financial data analysis tasks
Graph-based Regularization in Machine Learning: Discovering Driver Modules in Biological Networks
Curiosity of human nature drives us to explore the origins of what makes each of us different. From ancient legends and mythology, Mendel\u27s law, Punnett square to modern genetic research, we carry on this old but eternal question. Thanks to technological revolution, today\u27s scientists try to answer this question using easily measurable gene expression and other profiling data. However, the exploration can easily get lost in the data of growing volume, dimension, noise and complexity. This dissertation is aimed at developing new machine learning methods that take data from different classes as input, augment them with knowledge of feature relationships, and train classification models that serve two goals: 1) class prediction for previously unseen samples; 2) knowledge discovery of the underlying causes of class differences. Application of our methods in genetic studies can help scientist take advantage of existing biological networks, generate diagnosis with higher accuracy, and discover the driver networks behind the differences. We proposed three new graph-based regularization algorithms. Graph Connectivity Constrained AdaBoost algorithm combines a connectivity module, a deletion function, and a model retraining procedure with the AdaBoost classifier. Graph-regularized Linear Programming Support Vector Machine integrates penalty term based on submodular graph cut function into linear classifier\u27s objective function. Proximal Graph LogisticBoost adds lasso and graph-based penalties into logistic risk function of an ensemble classifier. Results of tests of our models on simulated biological datasets show that the proposed methods are able to produce accurate, sparse classifiers, and can help discover true genetic differences between phenotypes
Systems approaches and algorithms for discovery of combinatorial therapies
Effective therapy of complex diseases requires control of highly non-linear
complex networks that remain incompletely characterized. In particular, drug
intervention can be seen as control of signaling in cellular networks.
Identification of control parameters presents an extreme challenge due to the
combinatorial explosion of control possibilities in combination therapy and to
the incomplete knowledge of the systems biology of cells. In this review paper
we describe the main current and proposed approaches to the design of
combinatorial therapies, including the empirical methods used now by clinicians
and alternative approaches suggested recently by several authors. New
approaches for designing combinations arising from systems biology are
described. We discuss in special detail the design of algorithms that identify
optimal control parameters in cellular networks based on a quantitative
characterization of control landscapes, maximizing utilization of incomplete
knowledge of the state and structure of intracellular networks. The use of new
technology for high-throughput measurements is key to these new approaches to
combination therapy and essential for the characterization of control
landscapes and implementation of the algorithms. Combinatorial optimization in
medical therapy is also compared with the combinatorial optimization of
engineering and materials science and similarities and differences are
delineated.Comment: 25 page
Recommended from our members
iBBiG: iterative binary bi-clustering of gene sets
Motivation: Meta-analysis of genomics data seeks to identify genes associated with a biological phenotype across multiple datasets; however, merging data from different platforms by their features (genes) is challenging. Meta-analysis using functionally or biologically characterized gene sets simplifies data integration is biologically intuitive and is seen as having great potential, but is an emerging field with few established statistical methods. Results: We transform gene expression profiles into binary gene set profiles by discretizing results of gene set enrichment analyses and apply a new iterative bi-clustering algorithm (iBBiG) to identify groups of gene sets that are coordinately associated with groups of phenotypes across multiple studies. iBBiG is optimized for meta-analysis of large numbers of diverse genomics data that may have unmatched samples. It does not require prior knowledge of the number or size of clusters. When applied to simulated data, it outperforms commonly used clustering methods, discovers overlapping clusters of diverse sizes and is robust in the presence of noise. We apply it to meta-analysis of breast cancer studies, where iBBiG extracted novel gene set—phenotype association that predicted tumor metastases within tumor subtypes
Evolutionary Computation and QSAR Research
[Abstract] The successful high throughput screening of molecule libraries for a specific biological property is one of the main improvements in drug discovery. The virtual molecular filtering and screening relies greatly on quantitative structure-activity relationship (QSAR) analysis, a mathematical model that correlates the activity of a molecule with molecular descriptors. QSAR models have the potential to reduce the costly failure of drug candidates in advanced (clinical) stages by filtering combinatorial libraries, eliminating candidates with a predicted toxic effect and poor pharmacokinetic profiles, and reducing the number of experiments. To obtain a predictive and reliable QSAR model, scientists use methods from various fields such as molecular modeling, pattern recognition, machine learning or artificial intelligence. QSAR modeling relies on three main steps: molecular structure codification into molecular descriptors, selection of relevant variables in the context of the analyzed activity, and search of the optimal mathematical model that correlates the molecular descriptors with a specific activity. Since a variety of techniques from statistics and artificial intelligence can aid variable selection and model building steps, this review focuses on the evolutionary computation methods supporting these tasks. Thus, this review explains the basic of the genetic algorithms and genetic programming as evolutionary computation approaches, the selection methods for high-dimensional data in QSAR, the methods to build QSAR models, the current evolutionary feature selection methods and applications in QSAR and the future trend on the joint or multi-task feature selection methods.Instituto de Salud Carlos III, PIO52048Instituto de Salud Carlos III, RD07/0067/0005Ministerio de Industria, Comercio y Turismo; TSI-020110-2009-53)Galicia. Consellería de Economía e Industria; 10SIN105004P
Analysis and Computational Dissection of Molecular Signature Multiplicity
Molecular signatures are computational or mathematical models created to diagnose disease and other phenotypes and to predict clinical outcomes and response to treatment. It is widely recognized that molecular signatures constitute one of the most important translational and basic science developments enabled by recent high-throughput molecular assays. A perplexing phenomenon that characterizes high-throughput data analysis is the ubiquitous multiplicity of molecular signatures. Multiplicity is a special form of data analysis instability in which different analysis methods used on the same data, or different samples from the same population lead to different but apparently maximally predictive signatures. This phenomenon has far-reaching implications for biological discovery and development of next generation patient diagnostics and personalized treatments. Currently the causes and interpretation of signature multiplicity are unknown, and several, often contradictory, conjectures have been made to explain it. We present a formal characterization of signature multiplicity and a new efficient algorithm that offers theoretical guarantees for extracting the set of maximally predictive and non-redundant signatures independent of distribution. The new algorithm identifies exactly the set of optimal signatures in controlled experiments and yields signatures with significantly better predictivity and reproducibility than previous algorithms in human microarray gene expression datasets. Our results shed light on the causes of signature multiplicity, provide computational tools for studying it empirically and introduce a framework for in silico bioequivalence of this important new class of diagnostic and personalized medicine modalities
Supervised wavelet method to predict patient survival from gene expression data.
In microarray studies, the number of samples is relatively small compared to the number of genes per sample. An important aspect of microarray studies is the prediction of patient survival based on their gene expression profile. This naturally calls for the use of a dimension reduction procedure together with the survival prediction model. In this study, a new method based on combining wavelet approximation coefficients and Cox regression was presented. The proposed method was compared with supervised principal component and supervised partial least squares methods. The different fitted Cox models based on supervised wavelet approximation coefficients, the top number of supervised principal components, and partial least squares components were applied to the data. The results showed that the prediction performance of the Cox model based on supervised wavelet feature extraction was superior to the supervised principal components and partial least squares components. The results suggested the possibility of developing new tools based on wavelets for the dimensionally reduction of microarray data sets in the context of survival analysis
- …