31 research outputs found

    A Web-based and Grid-enabled dChip version for the analysis of large sets of gene expression data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Microarray techniques are one of the main methods used to investigate thousands of gene expression profiles for enlightening complex biological processes responsible for serious diseases, with a great scientific impact and a wide application area. Several standalone applications had been developed in order to analyze microarray data. Two of the most known free analysis software packages are the R-based Bioconductor and dChip. The part of dChip software concerning the calculation and the analysis of gene expression has been modified to permit its execution on both cluster environments (supercomputers) and Grid infrastructures (distributed computing).</p> <p>This work is not aimed at replacing existing tools, but it provides researchers with a method to analyze large datasets without any hardware or software constraints.</p> <p>Results</p> <p>An application able to perform the computation and the analysis of gene expression on large datasets has been developed using algorithms provided by dChip. Different tests have been carried out in order to validate the results and to compare the performances obtained on different infrastructures. Validation tests have been performed using a small dataset related to the comparison of HUVEC (Human Umbilical Vein Endothelial Cells) and Fibroblasts, derived from same donors, treated with IFN-α.</p> <p>Moreover performance tests have been executed just to compare performances on different environments using a large dataset including about 1000 samples related to Breast Cancer patients.</p> <p>Conclusion</p> <p>A Grid-enabled software application for the analysis of large Microarray datasets has been proposed. DChip software has been ported on Linux platform and modified, using appropriate parallelization strategies, to permit its execution on both cluster environments and Grid infrastructures. The added value provided by the use of Grid technologies is the possibility to exploit both computational and data Grid infrastructures to analyze large datasets of distributed data. The software has been validated and performances on cluster and Grid environments have been compared obtaining quite good scalability results.</p

    Survival Online: a web-based service for the analysis of correlations between gene expression and clinical and follow-up data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Complex microarray gene expression datasets can be used for many independent analyses and are particularly interesting for the validation of potential biomarkers and multi-gene classifiers. This article presents a novel method to perform correlations between microarray gene expression data and clinico-pathological data through a combination of available and newly developed processing tools.</p> <p>Results</p> <p>We developed Survival Online (available at <url>http://ada.dist.unige.it:8080/enginframe/bioinf/bioinf.xml</url>), a Web-based system that allows for the analysis of Affymetrix GeneChip microarrays by using a parallel version of dChip. The user is first enabled to select pre-loaded datasets or single samples thereof, as well as single genes or lists of genes. Expression values of selected genes are then correlated with sample annotation data by uni- or multi-variate Cox regression and survival analyses. The system was tested using publicly available breast cancer datasets and GO (Gene Ontology) derived gene lists or single genes for survival analyses.</p> <p>Conclusion</p> <p>The system can be used by bio-medical researchers without specific computation skills to validate potential biomarkers or multi-gene classifiers. The design of the service, the parallelization of pre-processing tasks and the implementation on an HPC (High Performance Computing) environment make this system a useful tool for validation on several independent datasets.</p

    ÎĽ-CS: An extension of the TM4 platform to manage Affymetrix binary data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>A main goal in understanding cell mechanisms is to explain the relationship among genes and related molecular processes through the combined use of technological platforms and bioinformatics analysis. High throughput platforms, such as microarrays, enable the investigation of the whole genome in a single experiment. There exist different kind of microarray platforms, that produce different types of binary data (images and raw data). Moreover, also considering a single vendor, different chips are available. The analysis of microarray data requires an initial preprocessing phase (i.e. normalization and summarization) of raw data that makes them suitable for use on existing platforms, such as the TIGR M4 Suite. Nevertheless, the annotations of data with additional information such as gene function, is needed to perform more powerful analysis. Raw data preprocessing and annotation is often performed in a manual and error prone way. Moreover, many available preprocessing tools do not support annotation. Thus novel, platform independent, and possibly open source tools enabling the semi-automatic preprocessing and annotation of microarray data are needed.</p> <p>Results</p> <p>The paper presents <it>ÎĽ</it>-CS (Microarray Cel file Summarizer), a cross-platform tool for the automatic normalization, summarization and annotation of Affymetrix binary data. <it>ÎĽ</it>-CS is based on a client-server architecture. The <it>ÎĽ</it>-CS client is provided both as a plug-in of the TIGR M4 platform and as a Java standalone tool and enables users to read, preprocess and analyse binary microarray data, avoiding the manual invocation of external tools (e.g. the Affymetrix Power Tools), the manual loading of preprocessing libraries, and the management of intermediate files. The <it>ÎĽ</it>-CS server automatically updates the references to the summarization and annotation libraries that are provided to the <it>ÎĽ</it>-CS client before the preprocessing. The <it>ÎĽ</it>-CS server is based on the web services technology and can be easily extended to support more microarray vendors (e.g. Illumina).</p> <p>Conclusions</p> <p>Thus <it>ÎĽ</it>-CS users can directly manage binary data without worrying about locating and invoking the proper preprocessing tools and chip-specific libraries. Moreover, users of the <it>ÎĽ</it>-CS plugin for TM4 can manage Affymetrix binary files without using external tools, such as APT (Affymetrix Power Tools) and related libraries. Consequently, <it>ÎĽ</it>-CS offers four main advantages: (i) it avoids to waste time for searching the correct libraries, (ii) it reduces possible errors in the preprocessing and further analysis phases, e.g. due to the incorrect choice of parameters or the use of old libraries, (iii) it implements the annotation of preprocessed data, and finally, (iv) it may enhance the quality of further analysis since it provides the most updated annotation libraries. The <it>ÎĽ</it>-CS client is freely available as a plugin of the TM4 platform as well as a standalone application at the project web site (<url>http://bioingegneria.unicz.it/M-CS</url>).</p

    Molecular Portrait of Clear Cell Renal Cell Carcinoma: An Integrative Analysis of Gene Expression and Genomic Copy Number Profiling

    Get PDF
    Renal cell carcinoma (RCC) incidence accounts for about 3 to 10 cases per 100,000 individuals with a predilection for adult males over 60 year old (1.6:1 male/female ratio) (Chow, 2010; Nese, 2009). In Europe, about 60,000 individuals are affected by RCC every year, with a mortality rate of about 18,000 subjects and an incidence rate for all stages steadily rising over the last three decades. Although inherited forms occur in a number of familial cancer syndromes, as the well-known von Hippel-Lindau (VHL) syndrome, RCC is commonly sporadic (Cohen & McGovern, 2005; Kaelin, 2007) and, as recently highlighted by the National Cancer Institute (NCI), influenced by the interplay between exposure to environmental risk factors and genetic susceptibility of exposed individuals (Chow et al., 2010). Being poorly symptomatic in early phases, many cases become clinically detectable only when already advanced and, as such, therapy-resistant (Motzer, 2011). Based on histology, RCC can be classified into several subtypes, i.e., clear cell (80% of cases), papillary (10%), chromophobe (5%) and oncocytoma (5%), each one characterized by specific histo- pathological features, malignant potential and clinical outcome (Cohen & McGovern, 2005). Patient stratification is normally achieved using prognostic algorithms and nomograms based on multiple clinico-pathological factors such as TNM stage, Fuhrman nuclear grade, tumor size, performance status, necrosis and other hematological indices (Flanigan et al., 2011), although the most efficient predictors of survival and recurrence are based on nuclear grade alone (Nese et al., 2009). As recently reviewed by Brannon et al. (Brannon & Rathmell, 2010), a finer RCC subtype classification could be obtained exploiting the vast amount of genomic and transcriptional data that have been presented in numerous studies. For instance, several authors proposed a molecular classification of RCC based on differential gene expression profiles, with any subtype characterized by the activation of distinct gene sets (Brannon, 2010; Furge, 2004; Skubitz, 2006; Su\u308ltmann, 2005; Zhang, 2008), while others identified RCC-specific biomarkers (e.g. CA9, ki67, VEGF proteins, phosphorylated AKT, PTEN, HIF-1). Lately, it has been reported that microRNAs, a small class of non coding RNA molecules, could contribute to RCC development at different levels and may represent a new group of potential tumor biomarkers (Redova et al., 2011). Despite the numerous efforts in dissecting the molecular features of RCC through functional genomics, not a single transcriptional signature or biomarker has gained approval for clinical application yet (Arsanious, 2009; Eichelberg, 2009; Lam, 2007; Yin-Goen, 2006), so that the identification of novel molecular markers to improve early diagnosis and prognostic prediction and of candidate targets to develop new therapeutic approaches remains of primary importance for this pathology

    INFORMATION VISUALIZATION DESIGN FOR MULTIDIMENSIONAL DATA: INTEGRATING THE RANK-BY-FEATURE FRAMEWORK WITH HIERARCHICAL CLUSTERING

    Get PDF
    Interactive exploration of multidimensional data sets is challenging because: (1) it is difficult to comprehend patterns in more than three dimensions, and (2) current systems are often a patchwork of graphical and statistical methods leaving many researchers uncertain about how to explore their data in an orderly manner. This dissertation offers a set of principles and a novel rank-by-feature framework that could enable users to better understand multidimensional and multivariate data by systematically studying distributions in one (1D) or two dimensions (2D), and then discovering relationships, clusters, gaps, outliers, and other features. Users of this rank-by-feature framework can view graphical presentations (histograms, boxplots, and scatterplots), and then choose a feature detection criterion to rank 1D or 2D axis-parallel projections. By combining information visualization techniques (overview, coordination, and dynamic query) with summaries and statistical methods, users can systematically examine the most important 1D and 2D axis-parallel projections. This research provides a number of valuable contributions: Graphics, Ranking, and Interaction for Discovery (GRID) principles- a set of principles for exploratory analysis of multidimensional data, which are summarized as: (1) study 1D, study 2D, then find features (2) ranking guides insight, statistics confirm. GRID principles help users organize their discovery process in an orderly manner so as to produce more thorough analyses and extract deeper insights in any multidimensional data application. Rank-by-feature framework - a user interface framework based on the GRID principles. Interactive information visualization techniques are combined with statistical methods and data mining algorithms to enable users to orderly examine multidimensional data sets using 1D and 2D projections. The design and implementation of the Hierarchical Clustering Explorer (HCE), an information visualization tool available at www.cs.umd.edu/hcil/hce. HCE implements the rank-by-feature framework and supports interactive exploration of hierarchical clustering results to reveal one of the important features - clusters. Validation through case studies and user surveys: Case studies with motivated experts in three research fields and a user survey via emails to a wide range of HCE users demonstrated the efficacy of HCE and the rank-by-feature framework. These studies also revealed potential improvement opportunities in terms of design and implementation

    A digital repository with an extensible data model for biobanking and genomic analysis management

    Get PDF
    Motivation: Molecular biology laboratories require extensive metadata to improve data collection and analysis. The heterogeneity of the collected metadata grows as research is evolving in to international multi-disciplinary collaborations and increasing data sharing among institutions. Single standardization is not feasible and it becomes crucial to develop digital repositories with flexible and extensible data models, as in the case of modern integrated biobanks management. Results: We developed a novel data model in JSON format to describe heterogeneous data in a generic biomedical science scenario. The model is built on two hierarchical entities: processes and events, roughly corresponding to research studies and analysis steps within a single study. A number of sequential events can be grouped in a process building up a hierarchical structure to track patient and sample history. Each event can produce new data. Data is described by a set of user-defined metadata, and may have one or more associated files. We integrated the model in a web based digital repository with a data grid storage to manage large data sets located in geographically distinct areas. We built a graphical interface that allows authorized users to define new data types dynamically, according to their requirements. Operators compose queries on metadata fields using a flexible search interface and run them on the database and on the grid. We applied the digital repository to the integrated management of samples, patients and medical history in the BIT-Gaslini biobank. The platform currently manages 1800 samples of over 900 patients. Microarray data from 150 analyses are stored on the grid storage and replicated on two physical resources for preservation. The system is equipped with data integration capabilities with other biobanks for worldwide information sharing. Conclusions: Our data model enables users to continuously define flexible, ad hoc, and loosely structured metadata, for information sharing in specific research projects and purposes. This approach can improve sensitively interdisciplinary research collaboration and allows to track patients' clinical records, sample management information, and genomic data. The web interface allows the operators to easily manage, query, and annotate the files, without dealing with the technicalities of the data grid.Peer reviewe

    Genes and Gene Networks Related to Age-associated Learning Impairments

    Get PDF
    The incidence of cognitive impairments, including age-associated spatial learning impairment (ASLI), has risen dramatically in past decades due to increasing human longevity. To better understand the genes and gene networks involved in ASLI, data from a number of past gene expression microarray studies in rats are integrated and used to perform a meta- and network analysis. Results from the data selection and preprocessing steps show that for effective downstream analysis to take place both batch effects and outlier samples must be properly removed. The meta-analysis undertaken in this research has identified significant differentially expressed genes across both age and ASLI in rats. Knowledge based gene network analysis shows that these genes affect many key functions and pathways in aged compared to young rats. The resulting changes might manifest as various neurodegenerative diseases/disorders or syndromic memory impairments at old age. Other changes might result in altered synaptic plasticity, thereby leading to normal, non-syndromic learning impairments such as ASLI. Next, I employ the weighted gene co-expression network analysis (WGCNA) on the datasets. I identify several reproducible network modules each highly significant with genes functioning in specific biological functional categories. It identifies a “learning and memory” specific module containing many potential key ASLI hub genes. Functions of these ASLI hub genes link a different set of mechanisms to learning and memory formation, which meta-analysis was unable to detect. This study generates some new hypotheses related to the new candidate genes and networks in ASLI, which could be investigated through future research

    Integration of Data Mining into Scientific Data Analysis Processes

    Get PDF
    In recent years, using advanced semi-interactive data analysis algorithms such as those from the field of data mining gained more and more importance in life science in general and in particular in bioinformatics, genetics, medicine and biodiversity. Today, there is a trend away from collecting and evaluating data in the context of a specific problem or study only towards extensively collecting data from different sources in repositories which is potentially useful for subsequent analysis, e.g. in the Gene Expression Omnibus (GEO) repository of high throughput gene expression data. At the time the data are collected, it is analysed in a specific context which influences the experimental design. However, the type of analyses that the data will be used for after they have been deposited is not known. Content and data format are focused only to the first experiment, but not to the future re-use. Thus, complex process chains are needed for the analysis of the data. Such process chains need to be supported by the environments that are used to setup analysis solutions. Building specialized software for each individual problem is not a solution, as this effort can only be carried out for huge projects running for several years. Hence, data mining functionality was developed to toolkits, which provide data mining functionality in form of a collection of different components. Depending on the different research questions of the users, the solutions consist of distinct compositions of these components. Today, existing solutions for data mining processes comprise different components that represent different steps in the analysis process. There exist graphical or script-based toolkits for combining such components. The data mining tools, which can serve as components in analysis processes, are based on single computer environments, local data sources and single users. However, analysis scenarios in medical- and bioinformatics have to deal with multi computer environments, distributed data sources and multiple users that have to cooperate. Users need support for integrating data mining into analysis processes in the context of such scenarios, which lacks today. Typically, analysts working with single computer environments face the problem of large data volumes since tools do not address scalability and access to distributed data sources. Distributed environments such as grid environments provide scalability and access to distributed data sources, but the integration of existing components into such environments is complex. In addition, new components often cannot be directly developed in distributed environments. Moreover, in scenarios involving multiple computers, multiple distributed data sources and multiple users, the reuse of components, scripts and analysis processes becomes more important as more steps and configuration are necessary and thus much bigger efforts are needed to develop and set-up a solution. In this thesis we will introduce an approach for supporting interactive and distributed data mining for multiple users based on infrastructure principles that allow building on data mining components and processes that are already available instead of designing of a completely new infrastructure, so that users can keep working with their well-known tools. In order to achieve the integration of data mining into scientific data analysis processes, this thesis proposes an stepwise approach of supporting the user in the development of analysis solutions that include data mining. We see our major contributions as the following: first, we propose an approach to integrate data mining components being developed for a single processor environment into grid environments. By this, we support users in reusing standard data mining components with small effort. The approach is based on a metadata schema definition which is used to grid-enable existing data mining components. Second, we describe an approach for interactively developing data mining scripts in grid environments. The approach efficiently supports users when it is necessary to enhance available components, to develop new data mining components, and to compose these components. Third, building on that, an approach for facilitating the reuse of existing data mining processes based on process patterns is presented. It supports users in scenarios that cover different steps of the data mining process including several components or scripts. The data mining process patterns support the description of data mining processes at different levels of abstraction between the CRISP model as most general and executable workflows as most concrete representation

    Widescale analysis of transcriptomics data using cloud computing methods

    Get PDF
    This study explores the handling and analyzing of big data in the field of bioinformatics. The focus has been on improving the analysis of public domain data for Affymetrix GeneChips which are a widely used technology for measuring gene expression. Methods to determine the bias in gene expression due to G-stacks associated with runs of guanine in probes have been explored via the use of a grid and various types of cloud computing. An attempt has been made to find the best way of storing and analyzing big data used in bioinformatics. A grid and various types of cloud computing have been employed. The experience gained in using a grid and different clouds has been reported. In the case of Windows Azure, a public cloud has been employed in a new way to demonstrate the use of the R statistical language for research in bioinformatics. This work has studied the G-stack bias in a broad range of GeneChip data from public repositories. A wide scale survey has been carried out to determine the extent of the Gstack bias in four different chips across three different species. The study commenced with the human GeneChip HG U133A. A second human GeneChip HG U133 Plus2 was then examined, followed by a plant chip, Arabidopsis thaliana, and then a bacterium chip, Pseudomonas aeruginosa. Comparisons have also been made between the use of widely recognised algorithms RMA and PLIER for the normalization stage of extracting gene expression from GeneChip data
    corecore