175 research outputs found

    mGrid: A load-balanced distributed computing environment for the remote execution of the user-defined Matlab code

    Get PDF
    BACKGROUND: Matlab, a powerful and productive language that allows for rapid prototyping, modeling and simulation, is widely used in computational biology. Modeling and simulation of large biological systems often require more computational resources then are available on a single computer. Existing distributed computing environments like the Distributed Computing Toolbox, MatlabMPI, Matlab*G and others allow for the remote (and possibly parallel) execution of Matlab commands with varying support for features like an easy-to-use application programming interface, load-balanced utilization of resources, extensibility over the wide area network, and minimal system administration skill requirements. However, all of these environments require some level of access to participating machines to manually distribute the user-defined libraries that the remote call may invoke. RESULTS: mGrid augments the usual process distribution seen in other similar distributed systems by adding facilities for user code distribution. mGrid's client-side interface is an easy-to-use native Matlab toolbox that transparently executes user-defined code on remote machines (i.e. the user is unaware that the code is executing somewhere else). Run-time variables are automatically packed and distributed with the user-defined code and automated load-balancing of remote resources enables smooth concurrent execution. mGrid is an open source environment. Apart from the programming language itself, all other components are also open source, freely available tools: light-weight PHP scripts and the Apache web server. CONCLUSION: Transparent, load-balanced distribution of user-defined Matlab toolboxes and rapid prototyping of many simple parallel applications can now be done with a single easy-to-use Matlab command. Because mGrid utilizes only Matlab, light-weight PHP scripts and the Apache web server, installation and configuration are very simple. Moreover, the web-based infrastructure of mGrid allows for it to be easily extensible over the Internet

    HOME: A histogram based machine learning approach for effective identification of differentially methylated regions

    Get PDF
    Background The development of whole genome bisulfite sequencing has made it possible to identify methylation differences at single base resolution throughout an entire genome. However, a persistent challenge in DNA methylome analysis is the accurate identification of differentially methylated regions (DMRs) between samples. Sensitive and specific identification of DMRs among different conditions requires accurate and efficient algorithms, and while various tools have been developed to tackle this problem, they frequently suffer from inaccurate DMR boundary identification and high false positive rate. Results We present a novel Histogram Of MEthylation (HOME) based method that takes into account the inherent difference in the distribution of methylation levels between DMRs and non-DMRs to discriminate between the two using a Support Vector Machine. We show that generated features used by HOME are dataset-independent such that a classifier trained on, for example, a mouse methylome training set of regions of differentially accessible chromatin, can be applied to any other organism’s dataset and identify accurate DMRs. We demonstrate that DMRs identified by HOME exhibit higher association with biologically relevant genes, processes, and regulatory events compared to the existing methods. Moreover, HOME provides additional functionalities lacking in most of the current DMR finders such as DMR identification in non-CG context and time series analysis. HOME is freely available at https://github.com/ListerLab/HOME . Conclusion HOME produces more accurate DMRs than the current state-of-the-art methods on both simulated and biological datasets. The broad applicability of HOME to identify accurate DMRs in genomic data from any organism will have a significant impact upon expanding our knowledge of how DNA methylation dynamics affect cell development and differentiation.This work was supported by the Australian Research Council (ARC) Centre of Excellence program in Plant Energy Biology (CE140100008). RL was supported by a Sylvia and Charles Viertel Senior Medical Research Fellowship, ARC Future Fellowship (FT120100862), and Howard Hughes Medical Institute International Research Scholarship (RL

    Liquid Chromatography Mass Spectrometry-Based Proteomics: Biological and Technological Aspects

    Get PDF
    Mass spectrometry-based proteomics has become the tool of choice for identifying and quantifying the proteome of an organism. Though recent years have seen a tremendous improvement in instrument performance and the computational tools used, significant challenges remain, and there are many opportunities for statisticians to make important contributions. In the most widely used "bottom-up" approach to proteomics, complex mixtures of proteins are first subjected to enzymatic cleavage, the resulting peptide products are separated based on chemical or physical properties and analyzed using a mass spectrometer. The two fundamental challenges in the analysis of bottom-up MS-based proteomics are as follows: (1) Identifying the proteins that are present in a sample, and (2) Quantifying the abundance levels of the identified proteins. Both of these challenges require knowledge of the biological and technological context that gives rise to observed data, as well as the application of sound statistical principles for estimation and inference. We present an overview of bottom-up proteomics and outline the key statistical issues that arise in protein identification and quantification.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS341 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Normalization and missing value imputation for label-free LC-MS analysis

    Get PDF
    Shotgun proteomic data are affected by a variety of known and unknown systematic biases as well as high proportions of missing values. Typically, normalization is performed in an attempt to remove systematic biases from the data before statistical inference, sometimes followed by missing value imputation to obtain a complete matrix of intensities. Here we discuss several approaches to normalization and dealing with missing values, some initially developed for microarray data and some developed specifically for mass spectrometry-based data

    The value and significance of corporate community relations: an Italian SME perspective

    Get PDF
    Purpose – This paper investigates the link between community of place and small and medium-sized enterprises (SMEs) in Lombard industrial districts in Italy. Design/methodology/approach – A brief literature review of international authors from the stakeholder approach and Corporate Community Relations field is presented. This paper refers to a survey of Lombard industrial districts conducted by ALTIS. The data was collected via a telephone survey from 834 firms. Findings – The main finding is that managing Corporate Community Relations (CCR) is of major importance for company success. The results of the survey show that there are some tools and actions that Italian industrial district SMEs uses to interact with their particular communities of place to develop effective and coherent relationships with their stakeholder groups. Moreover, although the survey shows that though SMEs do implement different CCR activities, they are not able to communicate these effectively through systematic communication strategies. However, the narrow sample includes only a sample of some Lombard districts. Nonetheless, the findings indicate that effective CCR seems to confer competitive advantage based on stakeholder responses and rewards sought. Research limitations/implications – The framework could assist in supporting CCR developments between industrial districts as various players would know how to improve CCR activities. One further suggestion is that University and Research Centres could have a role to play in creating and communicating codified knowledge concerning community relations in industrial districts, while other public players still have to develop specific tasks in improving infrastructures. Originality/value – This study is in line with the main focus of CCR, which is in striving to meet stakeholder and societal needs. However, industrial district SMEs have to learn how to communicate their CCR activities from the examples set by large Italian companies. The paper links the notion of CCR with tools and actions to develop meaningful relationships with both community of place and interest. Moreover, considering the survey results, a new framework for local player roles is proposed

    Review of Machine Learning Algorithms in Differential Expression Analysis

    Get PDF
    In biological research machine learning algorithms are part of nearly every analytical process. They are used to identify new insights into biological phenomena, interpret data, provide molecular diagnosis for diseases and develop personalized medicine that will enable future treatments of diseases. In this paper we (1) illustrate the importance of machine learning in the analysis of large scale sequencing data, (2) present an illustrative standardized workflow of the analysis process, (3) perform a Differential Expression (DE) analysis of a publicly available RNA sequencing (RNA-Seq) data set to demonstrate the capabilities of various algorithms at each step of the workflow, and (4) show a machine learning solution in  improving the computing time, storage requirements, and minimize utilization of computer memory in analyses of RNA-Seq datasets. The source code of the analysis pipeline and associated scripts are presented in the paper appendix to allow replication of experiments

    An Introspective Comparison of Random Forest-Based Classifiers for the Analysis of Cluster-Correlated Data by Way of RF++

    Get PDF
    Many mass spectrometry-based studies, as well as other biological experiments produce cluster-correlated data. Failure to account for correlation among observations may result in a classification algorithm overfitting the training data and producing overoptimistic estimated error rates and may make subsequent classifications unreliable. Current common practice for dealing with replicated data is to average each subject replicate sample set, reducing the dataset size and incurring loss of information. In this manuscript we compare three approaches to dealing with cluster-correlated data: unmodified Breiman's Random Forest (URF), forest grown using subject-level averages (SLA), and RF++ with subject-level bootstrapping (SLB). RF++, a novel Random Forest-based algorithm implemented in C++, handles cluster-correlated data through a modification of the original resampling algorithm and accommodates subject-level classification. Subject-level bootstrapping is an alternative sampling method that obviates the need to average or otherwise reduce each set of replicates to a single independent sample. Our experiments show nearly identical median classification and variable selection accuracy for SLB forests and URF forests when applied to both simulated and real datasets. However, the run-time estimated error rate was severely underestimated for URF forests. Predictably, SLA forests were found to be more severely affected by the reduction in sample size which led to poorer classification and variable selection accuracy. Perhaps most importantly our results suggest that it is reasonable to utilize URF for the analysis of cluster-correlated data. Two caveats should be noted: first, correct classification error rates must be obtained using a separate test dataset, and second, an additional post-processing step is required to obtain subject-level classifications. RF++ is shown to be an effective alternative for classifying both clustered and non-clustered data. Source code and stand-alone compiled versions of command-line and easy-to-use graphical user interface (GUI) versions of RF++ for Windows and Linux as well as a user manual (Supplementary File S2) are available for download at: http://sourceforge.org/projects/rfpp/ under the GNU public license

    Lipopolysaccharide-induced interferon response networks at birth are predictive of severe viral lower respiratory infections in the first year of life

    Get PDF
    Appropriate innate immune function is essential to limit pathogenesis and severity of severe lower respiratory infections (sLRI) during infancy, a leading cause of hospitalization and risk factor for subsequent asthma in this age group. Employing a systems biology approach to analysis of multi-omic profiles generated from a high-risk cohort (n = 50), we found that the intensity of activation of an LPS-induced interferon gene network at birth was predictive of sLRI risk in infancy (AUC = 0.724). Connectivity patterns within this network were stronger among susceptible individuals, and a systems biology approach identified IRF1 as a putative master regulator of this response. These findings were specific to the LPS-induced interferon response and were not observed following activation of viral nucleic acid sensing pathways. Comparison of responses at birth versus age 5 demonstrated that LPS-induced interferon responses but not responses triggered by viral nucleic acid sensing pathways may be subject to strong developmental regulation. These data suggest that the risk of sLRI in early life is in part already determined at birth, and additionally that the developmental status of LPS-induced interferon responses may be a key determinant of susceptibility. Our findings provide a rationale for the identification of at-risk infants for early intervention aimed at sLRI prevention and identifies targets which may be relevant for drug development

    The mzqLibrary – An open source Java library supporting the HUPO-PSI quantitative proteomics standard

    Get PDF
    The mzQuantML standard has been developed by the Proteomics Standards Initiative for capturing, archiving and exchanging quantitative proteomic data, derived from mass spectrometry. It is a rich XML‐based format, capable of representing data about two‐dimensional features from LC‐MS data, and peptides, proteins or groups of proteins that have been quantified from multiple samples. In this article we report the development of an open source Java‐based library of routines for mzQuantML, called the mzqLibrary, and associated software for visualising data called the mzqViewer. The mzqLibrary contains routines for mapping (peptide) identifications on quantified features, inference of protein (group)‐level quantification values from peptide‐level values, normalisation and basic statistics for differential expression. These routines can be accessed via the command line, via a Java programming interface access or a basic graphical user interface. The mzqLibrary also contains several file format converters, including import converters (to mzQuantML) from OpenMS, Progenesis LC‐MS and MaxQuant, and exporters (from mzQuantML) to other standards or useful formats (mzTab, HTML, csv). The mzqViewer contains in‐built routines for viewing the tables of data (about features, peptides or proteins), and connects to the R statistical library for more advanced plotting options. The mzqLibrary and mzqViewer packages are available from https://code.google.com/p/mzq‐lib/

    Airway epithelium respiratory illnesses and allergy (AERIAL) birth cohort: Study protocol

    Get PDF
    Introduction: Recurrent wheezing disorders including asthma are complex and heterogeneous diseases that affect up to 30% of all children, contributing to a major burden on children, their families, and global healthcare systems. It is now recognized that a dysfunctional airway epithelium plays a central role in the pathogenesis of recurrent wheeze, although the underlying mechanisms are still not fully understood. This prospective birth cohort aims to bridge this knowledge gap by investigating the influence of intrinsic epithelial dysfunction on the risk for developing respiratory disorders and the modulation of this risk by maternal morbidities, in utero exposures, and respiratory exposures in the first year of life. Methods: The Airway Epithelium Respiratory Illnesses and Allergy (AERIAL) study is nested within the ORIGINS Project and will monitor 400 infants from birth to 5 years. The primary outcome of the AERIAL study will be the identification of epithelial endotypes and exposure variables that influence the development of recurrent wheezing, asthma, and allergic sensitisation. Nasal respiratory epithelium at birth to 6 weeks, 1, 3, and 5 years will be analysed by bulk RNA-seq and DNA methylation sequencing. Maternal morbidities and in utero exposures will be identified on maternal history and their effects measured through transcriptomic and epigenetic analyses of the amnion and newborn epithelium. Exposures within the first year of life will be identified based on infant medical history as well as on background and symptomatic nasal sampling for viral PCR and microbiome analysis. Daily temperatures and symptoms recorded in a study-specific Smartphone App will be used to identify symptomatic respiratory illnesses. Discussion: The AERIAL study will provide a comprehensive longitudinal assessment of factors influencing the association between epithelial dysfunction and respiratory morbidity in early life, and hopefully identify novel targets for diagnosis and early intervention
    corecore