Large- and medium-scale computational molecular biology projects require accurate bioinformatics software and numerous heterogeneous biological databanks, which are distributed around the world. BioMAJ provides a flexible, robust, fully automated environment for managing such massive amounts of data. The JAVA application enables automation of the data update cycle process and supervision of the locally mirrored data repository. We have developed workflows that handle some of the most commonly used bioinformatics databases. A set of scripts is also available for post-synchronization data treatment consisting of indexation or format conversion (for NCBI blast, SRS, EMBOSS, GCG, etc.). BioMAJ can be easily extended by personal homemade processing scripts. Source history can be kept via html reports containing statements of locally managed databanks

A. Assi

C. Caron

D. Allouche

Etzold

H. Leroy

J.-M. Larre

L. Legrand

O. Collin

O. Filangi

V. Martin

Y. Beausse

English

PubMed

Crossref

BioMAJ: a flexible framework for databanks synchronization and processing

Large- and medium-scale computational molecular biology projects require accurate bioinformatics software and numerous heterogeneous biological databanks, which are distributed around the world. BioMAJ provides a flexible, robust, fully automated environment for managing such massive amounts of data. The JAVA application enables automation of the data update cycle process and supervision of the locally mirrored data repository. We have developed workflows that handle some of the most commonly used bioinformatics databases. A set of scripts is also available for post-synchronization data treatment consisting of indexation or format conversion (for NCBI blast, SRS, EMBOSS, GCG, etc.). BioMAJ can be easily extended by personal homemade processing scripts. Source history can be kept via html reports containing statements of locally managed databanks. AVAILABILITY: http://biomaj.genouest.org. BioMAJ is free open software. It is freely available under the CECILL version 2 license

Filangi, Olivier

Beausse, Yoann

Assi, Anthony

Legrand, Ludovic

Larre, Jean-Marc

Martin, Veronique

Collin, Olivier

Caron, Christophe

Leroy, Hugues

Allouche, David

ProdInra

International audienceLarge- and medium-scale computational molecular biology projects require accurate bioinformatics software and numerous heterogeneous biological databanks, which are distributed around the world. BioMAJ provides a flexible, robust, fully automated environment for managing such massive amounts of data. The JAVA application enables automation of the data update cycle process and supervision of the locally mirrored data repository. We have developed workflows that handle some of the most commonly used bioinformatics databases. A set of scripts is also available for post-synchronization data treatment consisting of indexation or format conversion (for NCBI blast, SRS, EMBOSS, GCG, etc.). BioMAJ can be easily extended by personal homemade processing scripts. Source history can be kept via html reports containing statements of locally managed databanks. AVAILABILITY: http://biomaj.genouest.org. BioMAJ is free open software. It is freely available under the CECILL version 2 license

Larré, Jean-Marc

Martin, Véronique

HAL Descartes

Hal-Diderot

HAL-Rennes 1

INRIA a CCSD electronic archive server

HAL Id: inria-00327502https://hal.inria.fr/inria-00327502Submitted on 31 May 2020HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.BioMAJ: a flexible framework for databankssynchronization and processingOlivier Filangi, Yoann Beausse, Anthony Assi, Ludovic Legrand, Jean-MarcLarré, Véronique Martin, Olivier Collin, Christophe Caron, Hugues Leroy,David AlloucheTo cite this version:Olivier Filangi, Yoann Beausse, Anthony Assi, Ludovic Legrand, Jean-Marc Larré, et al.. BioMAJ: aflexible framework for databanks synchronization and processing. Bioinformatics, Oxford UniversityPress (OUP), 2008, 24 (16), pp.1823-1825. ￿10.1093/bioinformatics/btn325￿. ￿inria-00327502￿[16:21 8/8/03 Bioinformatics-btn325.tex] Page: 1823 1823–1825BIOINFORMATICS APPLICATIONS NOTE Vol. 24 no. 16 2008, pages 1823–1825doi:10.1093/bioinformatics/btn325Databases and ontologiesBioMAJ: a flexible framework for databanks synchronizationand processingOlivier Filangi1, Yoann Beausse2, Anthony Assi1, Ludovic Legrand3, Jean-Marc Larré2,Véronique Martin3, Olivier Collin1, Christophe Caron3, Hugues Leroy1 andDavid Allouche2,∗1IRISA/INRIA, Symbiose Project, 2Unité de Biométrie et Intelligence Artificielle, UR875, INRA, F-31320Castanet-Tolosan and 3INRA Unité Mathématique Informatique et Génome, UR1077, INRA, F-78350 Jouy-en-Josas,FranceReceived on March 14, 2008; revised and accepted on June 20, 2008Advance Access publication June 30, 2008Associate Editor: Dmitrij FrishmanABSTRACTLarge- and medium-scale computational molecular biologyprojects require accurate bioinformatics software and numerousheterogeneous biological databanks, which are distributed aroundthe world. BioMAJ provides a flexible, robust, fully automatedenvironment for managing such massive amounts of data. The JAVAapplication enables automation of the data update cycle processand supervision of the locally mirrored data repository. We havedeveloped workflows that handle some of the most commonlyused bioinformatics databases. A set of scripts is also availablefor post-synchronization data treatment consisting of indexationor format conversion (for NCBI blast, SRS, EMBOSS, GCG, etc.).BioMAJ can be easily extended by personal homemade processingscripts. Source history can be kept via html reports containingstatements of locally managed databanks.Availability: http://biomaj.genouest.org. BioMAJ is free opensoftware. It is freely available under the CECILL version 2 license.Contact: biomaj@genouest.org1 INTRODUCTIONBiological knowledge, within the context of proteomics andgenomics, is mainly based on bioinformatic analyses consisting ofperiodic comparisons between newly produced data and the currentset of known information. This approach requires both locallyinstalled bioinformatic programs and collections of heterogeneousbiological databanks, which are available through numerous serverslocated around the world (NCBI, EBI, DDBJ, expasy, etc.). Localintegration of all of these data begins with data mirroring andindexation. This essential, preliminary step represents a majorbottleneck for data annotation and comparative genomics projectsdue to the fact that the databanks are large (measured in terabytes)and in heterogeneous formats. Once downloaded, the data mustundergo various post-processing steps before they can be used.These steps consist of intensive processing tasks correspondingto data indexation, format conversion and data concatenation orextraction.Additionally, the frequency with which various databanks∗To whom correspondence should be addressed.require updating is not constant and, depending upon the source,can vary from daily to several times per year. In most cases, theonly way to determine the existence of a new release is a periodiccheck of the files stored on the remote server. Data maintenance is acomplex and labor-consuming task that is absolutely necessary forprojects that perform intensive local analyses on that data. Thus, aflexible and robust software tool that maintains workflows associatedwith the various stages of local mirroring of databanks is essentialto automate the numerous iterative updates that are required.While several commercial solutions are available, free softwaresolutions range in complexity from server specific shell scriptsto the more sophisticated application Biodownloader (Shapovalovet al., 2007). Unfortunately, this program was not designed tomanage large numbers of sources, cannot monitor the globalrepository and is not compatible with UNIX. The best solutionevaluated was from the GMOD project, called Citrina (Goodman,2004). While less user-friendly than Biodownloader, it providesdata synchronization and the ability to launch sequential post-processing tasks on various UNIX-based operating systems. Sincethe Citrina project is no longer being developed, we began theprocess of creating a new project, called BioMAJ, using theCitrina source as inspiration. BioMAJ, where MAJ stands for‘mise a jour’, which is French word for ‘update’, is designed toautomate and manage data workflows associated with updatingand processing local mirrors of large biological databases. Thesoftware can be used by both large-scale bioinformatics projectsand administrators of large computational infrastructures thatprovide services based on well-known bioinformatics suites suchas EMBOSS, Sequence Retrieval System (SRS) (Etzold et al.,1996), (http://www.biowisdom.com/navigation/srs/srs) and GCG.This article describes how BioMAJ provides a flexible frameworkfor databank synchronization and processing.2 COMPATIBILITY AND PACKAGEBioMAJ was developed in JAVA and ANT. The application iscompliant with various UNIX operating systems. It provides:(1) A reliable engine, that automatically downloads remote data,provides for error correction, synchronizes local and remote© 2008 The Author(s)This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/)which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. by guest on January 28, 2013http://bioinformatics.oxfordjournals.org/Downloaded from [16:21 8/8/03 Bioinformatics-btn325.tex] Page: 1824 1823–1825O.Filangi et al.data, formats/post-processes this data and publishes the datafor all users and/or applications.(2) A group of predefined templates for synchronizing commonbiological databanks.(3) A processing script set than can be applied to the fetched datafor indexation and/or format conversion.(4) XSLT Scripts that generate reports describing repositorycontents and statistics.3 ENGINE BEHAVIORBioMAJ has been specifically designed to manage databank updatecycles. It permits flexible data synchronization, controls executionof local post-download processing tasks and logs all activity forulterior usage. All processing tasks are highly configurable andcan be executed serially or in parallel. The engine supervises theexecution of all tasks declared within each processing stage. In caseof an error, only the faulty sub-parts of a treatment are re-executed,which is extremely useful when a treatment requires extensivecomputational resources. BioMaJ’s features have been developedto iteratively execute workflows in order to routinely update hugeand/or numerous databanks in batch mode.The engine follows a predefined template mapped onto theprocesses of updating and indexing. Some parts of the templateare static and only need custom properties to define the remoteserver address, the file transfer protocol, regular expressions thatselect remote files and whether or not downloaded files should beuncompressed. Other stages of BioMAJ templates are more openand can utilize a meta-scheduler for deferred program executionand a basic description language that enables one to implementpersonalized databank processing.Each personalized databank update cycle is divided into fivestages: initialization, reprocessing, synchronization, post-processingand deployment. The exact steps to be executed are described in atext file, referred to as the properties file. The standard behavior ofeach stage is as follows:The initialization stage consists of setting up the session, loadingthe workflow and checking the state of both the current and previousdatabank releases.The preprocessing stage handles various actions that mustbe performed prior to synchronization, such as sending emailalert messages and ensuring available disk space. This stage iscustomizable and has the same features as the post-processing stagediscussed subsequently.During synchronization, the version of the latest databank releaseis extracted from the remote server using regular expressions on aspecified remote file or through interrogation of the remote file’stimestamp. Selected files are fetched using various protocols (ftp,http, rsync, local copy) and transfer integrity checks are performed toensure that valid local copies have been made. Remote databank treehierarchies can be preserved and the organization of both local andglobal repository contents is managed by the application. BioMAJcan perform multiple downloads or updates simultaneously. Afterthe files are downloaded, BioMAJ can automatically uncompressfiles and reconstruct the desired local release. All file attributes aswell as history and provenance information are stored in log filesthat can also be used to back trace local files and determine whichfiles require resynchronization.The post-processing stage consists of performing various taskson synchronized data. Integration of processing programs is easyand flexible, as BioMAJ relies on system calls to execute shellscripts. Information about the context of the databank updatecycle, such as input files, parameters and output locations, istransferred to each processing task using parameters declared inthe template or shell variables, which are automatically set duringsystem calls. Thus, generic wrappers for bioinformatic programscan be easily developed and reused by multiple workflows. Inaddition, BioMAJ provides a facility whereby task executioncan be organized on a local machine or on a cluster using anexternal scheduler system. Description of the post-processing stageis handled by three hierarchical elements. The most basic unit ofprocessing is a task, usually a wrapper script containing a set ofserial processing commands. Tasks are grouped into meta-processes,which can be further organized into blocks. Blocks, meta-processesand processing tasks allow one to describe customized topologiesfor data processing. It also permits one to control the order in whichdata processing tasks are executed. Blocks are launched seriallyfollowing their declaration order. In a given block, each meta-process is associated to a specific thread so that individual processingtasks can be run in parallel. This allows one to easily design adirected acyclic graph in which each vertex is a processing task withspecific attributes and edges are the chronology of execution. Unlikemore sophisticated workflow engines such as Taverna (Oinn et al.,2004), neither explicit dependencies of data nor specific semantichave been formalized for the input and output channels of treatments.Users need a priori knowledge of the location of produced data,but this job is greatly facilitated by the default tree directoryproposed by the application. Thus, BioMAJ makes it possible todefine dependencies between different stages of data processing andto take into account relationships and inter-dependencies betweentreatments. Tasks can be executed either sequentially or in parallelto optimize execution time.As an example, many sites routinely index all GenBank divisionsboth for EMBOSS and BLAST. One can parallelize this task withBioMAJ by defining two sequential blocks that each contains anindependent meta-process for each GenBank division. In the firstblock, each meta-process includes an indexation task for EMBOSS.The parallel BLAST block includes a meta-process for each divisionthat contains two sequential processing tasks: conversion fromGenBank to FASTA format and formatdb indexation for BLAST.Each task can be executed on an individual cluster node via anexternal scheduler such as pbs or sge. In this example, the samewrapper script is called by each process. Behaviors are changedby personalizing the call parameters in the workflow description.Details about the syntax of the current example can be foundon the website. The resulting DAG organization, which has alsobeen used in grid application such as Dagman (Condor Team,1990–2007), allows one to significantly reduce the amount of timerequired to process a new databank release. In the previouslyexample i.e.: the indexation of the 18 divisions of Genbank bothfor EMBOSS and BLAST, if 18 CPUS (Xeon 5140 Woodcrest2.3GHz, sharing data with Network File System) are used inparallel for data processing . The elapsed time takes 8h30, whichis roughly 10 times faster than when run sequentially on a singleprocessor. Due to IO contention and the unbalanced size of GenBankdivisions, the speed increase does not evolve linearly with thenumber of meta-processes. However, the benefit when processing1824 by guest on January 28, 2013http://bioinformatics.oxfordjournals.org/Downloaded from [16:21 8/8/03 Bioinformatics-btn325.tex] Page: 1825 1823–1825BioMAJlarge databanks is often measured in hours. Nevertheless, withcertain tasks, such as indexation for SRS, chunk decomposition ofdata processing is not feasible, or difficult to perform. In the bestcase, the parallelization can only be incorporated in subsequent,sequential tasks. For SRS indexation, the parallelization has beendirectly included into the wrapper script using a parallel makefileprogram.Finally, after all post-processing tasks are complete; thedeployment stage makes the new release available and removesall temporary files and obsolete releases, based upon specifiedretention/release parameters. Deployment concludes a successfulupdate cycle but BioMAJ can re-execute any faulty steps throughits exception handling facilities.4 BANK ADMINISTRATION AND MONITORINGBioMAJ also has many administrative functions such as onlinequerying tools that interrogate repository contents and managementcommands that import, delete, rename and move databank releases.Therefore, it is possible to manage the local repository using mainlyBioMAJ administrative functions.Each session for a specific database is registered in an xmlstate file, which can be exploited under different time scales formonitoring and querying/updating. Within the short time scale, xmlstate files are used for exception handling. This functionality enablesthe recovery or restarting of an update cycle. Under the long timescale, xml state files are treated as history files, containing usefulinformation about the database life cycle. BioMAJ can generatehtml reports including graphs describing statistics for each sourceand global statistics for all databanks included in the repository.Web reports are structured around meta-data information includedin the workflow descriptions, which are text variables that annotateproperties such as bank type, file format and descriptions of thedatabank and its associated processing. The resulting report offersa user-friendly way to browse the current and past state of the localrepository. Thus, another significant benefit of using BioMAJ isthat one can achieve a good level of quality of service (QoS) byobtaining a clear and precise state of the local data repository. QoSand traceability are essential in both large and small infrastructuresto ensure reproducibility of data analysis pipelines that make use ofthe downloaded databanks.5 RESULTS AND PROSPECTThe BioMAJ package currently provides the required functionalityto mirror over 100 public domain databases (from servers suchas NCBI, EBI, Expasy, tiger, etc.). New databases can easilybe added through the configuration of a single properties file.Samples for the most common bioinformatics databases areavailable in the package. Each sample describes a dedicatedworkflow of databank synchronization, including in some case datapost-processing. The project website has been specifically designedto share properties files and post-processing scripts between BioMAJusers.Concerning data processing, multiple indexation post-processesare supported for various applications: NCBI blast, SRS, EMBOSS,GCG. Furthermore, post-processing scripts for format conversionand testing the index integrity after data processing are alsoavailable. The BioMAJ architecture is open; so that users canalso easily integrate their personal homemade processing scripts(independent of programming language). Full guidelines on how todevelop and integrate scripts can be found in the application manual.In conclusion, our work was motivated by two main ideas. Onone hand, academic research needs free software for managingpublic databanks. On the other hand, this project has been inspiredby the workflow approach, an obvious way to capture knowledgefor practical usage. Applied to databank synchronization and logreporting, workflows represent a way to normalize data replicationand reduce the entropy associated with their dissemination. Webelieve that the normalization introduced by using a singleapplication for databank synchronization is very likely the only wayto efficiently control this process.In the future, BioMAJ will integrate the bittorrent file transferprotocol and provide Rich Site Summary support. Together, thesetechnologies will enable BioMAJ workflows to be executed whenthe software detects a torrentcast announcing a new release of thedatabank. Such automation will facilitate local synchronization aswell as distribute the network bandwidth requirements associatedwith moving the large repositories from the remote centers to thelocal installations.ACKNOWLEDGEMENTSThe authors would like to thank Jérome Gouzy and Jason S. Iacovonifor fruitful article comments.Funding: This work has been funded by the RNG (Reseau Nationaldes Genopoles), the ReNaBi network, Région Bretagne, INRA andINRIAConflict of Interest: none declared.REFERENCESCondor Team,UOW. (1990–2007) Dagman terminology. Copyright © 1990–2007. Available at http://www.cs.wisc.edu/condor/manual/v6.8/2_11DAGMan_Applications.html.Etzold,T. et al. (1996) SRS: information retrieval system for molecular biology databanks. Meth Enzymol, 266, 114–128.Goodman,J. (2004) Citrina. Available at http://www.gmod.org/wiki/index.php/Citrina.Oinn,T. et al. (2004) Taverna: a tool for the composition and enactment of bioinformaticsworkflows. Bioinformatics, 20, 3045–3054.Shapovalov,M.V. et al. (2007) Biodownloader: bioinformatics downloads and updatesin a few clicks. Bioinformatics, 23, 1437–1439.1825 by guest on January 28, 2013http://bioinformatics.oxfordjournals.org/Downloaded from 

HAL-CentraleSupelec

Biodownloader: bioinformatics downloads and updates in a few clicks.

Citrina. Available at http://www.gmod.org/wiki/index.php/ Citrina. Oinn,T.etal.(2004)Taverna:atoolforthecompositionandenactmentofbioinformatics workﬂows.

SRS: information retrieval system for molecular biology data banks.

BioMAJ: a flexible framework for databanks synchronization and processing

Abstract

Similar works

Full text

Available Versions