31,760 research outputs found
An Overview of the Use of Neural Networks for Data Mining Tasks
In the recent years the area of data mining has experienced a considerable demand for technologies that extract knowledge from large and complex data sources. There is a substantial commercial interest as well as research investigations in the area that aim to develop new and improved approaches for extracting information, relationships, and patterns from datasets. Artificial Neural Networks (NN) are popular biologically inspired intelligent methodologies, whose classification, prediction and pattern recognition capabilities have been utilised successfully in many areas, including science, engineering, medicine, business, banking, telecommunication, and many other fields. This paper highlights from a data mining perspective the implementation of NN, using supervised and unsupervised learning, for pattern recognition, classification, prediction and cluster analysis, and focuses the discussion on their usage in bioinformatics and financial data analysis tasks
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
XenDB: Full length cDNA prediction and cross species mapping in Xenopus laevis
BACKGROUND: Research using the model system Xenopus laevis has provided critical insights into the mechanisms of early vertebrate development and cell biology. Large scale sequencing efforts have provided an increasingly important resource for researchers. To provide full advantage of the available sequence, we have analyzed 350,468 Xenopus laevis Expressed Sequence Tags (ESTs) both to identify full length protein encoding sequences and to develop a unique database system to support comparative approaches between X. laevis and other model systems. DESCRIPTION: Using a suffix array based clustering approach, we have identified 25,971 clusters and 40,877 singleton sequences. Generation of a consensus sequence for each cluster resulted in 31,353 tentative contig and 4,801 singleton sequences. Using both BLASTX and FASTY comparison to five model organisms and the NR protein database, more than 15,000 sequences are predicted to encode full length proteins and these have been matched to publicly available IMAGE clones when available. Each sequence has been compared to the KOG database and ~67% of the sequences have been assigned a putative functional category. Based on sequence homology to mouse and human, putative GO annotations have been determined. CONCLUSION: The results of the analysis have been stored in a publicly available database XenDB . A unique capability of the database is the ability to batch upload cross species queries to identify potential Xenopus homologues and their associated full length clones. Examples are provided including mapping of microarray results and application of 'in silico' analysis. The ability to quickly translate the results of various species into 'Xenopus-centric' information should greatly enhance comparative embryological approaches. Supplementary material can be found at
Integrative Model-based clustering of microarray methylation and expression data
In many fields, researchers are interested in large and complex biological
processes. Two important examples are gene expression and DNA methylation in
genetics. One key problem is to identify aberrant patterns of these processes
and discover biologically distinct groups. In this article we develop a
model-based method for clustering such data. The basis of our method involves
the construction of a likelihood for any given partition of the subjects. We
introduce cluster specific latent indicators that, along with some standard
assumptions, impose a specific mixture distribution on each cluster. Estimation
is carried out using the EM algorithm. The methods extend naturally to multiple
data types of a similar nature, which leads to an integrated analysis over
multiple data platforms, resulting in higher discriminating power.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS533 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Asterias: a parallelized web-based suite for the analysis of expression and aCGH data
Asterias (\url{http://www.asterias.info}) is an integrated collection of
freely-accessible web tools for the analysis of gene expression and aCGH data.
Most of the tools use parallel computing (via MPI). Most of our applications
allow the user to obtain additional information for user-selected genes by
using clickable links in tables and/or figures. Our tools include:
normalization of expression and aCGH data; converting between different types
of gene/clone and protein identifiers; filtering and imputation; finding
differentially expressed genes related to patient class and survival data;
searching for models of class prediction; using random forests to search for
minimal models for class prediction or for large subsets of genes with
predictive capacity; searching for molecular signatures and predictive genes
with survival data; detecting regions of genomic DNA gain or loss. The
capability to send results between different applications, access to additional
functional information, and parallelized computation make our suite unique and
exploit features only available to web-based applications.Comment: web based application; 3 figure
Variational Inference for Stochastic Block Models from Sampled Data
This paper deals with non-observed dyads during the sampling of a network and
consecutive issues in the inference of the Stochastic Block Model (SBM). We
review sampling designs and recover Missing At Random (MAR) and Not Missing At
Random (NMAR) conditions for the SBM. We introduce variants of the variational
EM algorithm for inferring the SBM under various sampling designs (MAR and
NMAR) all available as an R package. Model selection criteria based on
Integrated Classification Likelihood are derived for selecting both the number
of blocks and the sampling design. We investigate the accuracy and the range of
applicability of these algorithms with simulations. We explore two real-world
networks from ethnology (seed circulation network) and biology (protein-protein
interaction network), where the interpretations considerably depends on the
sampling designs considered
- …