2,232 research outputs found

    Research on High-performance and Scalable Data Access in Parallel Big Data Computing

    Get PDF
    To facilitate big data processing, many dedicated data-intensive storage systems such as Google File System(GFS), Hadoop Distributed File System(HDFS) and Quantcast File System(QFS) have been developed. Currently, the Hadoop Distributed File System(HDFS) [20] is the state-of-art and most popular open-source distributed file system for big data processing. It is widely deployed as the bedrock for many big data processing systems/frameworks, such as the script-based pig system, MPI-based parallel programs, graph processing systems and scala/java-based Spark frameworks. These systems/applications employ parallel processes/executors to speed up data processing within scale-out clusters. Job or task schedulers in parallel big data applications such as mpiBLAST and ParaView can maximize the usage of computing resources such as memory and CPU by tracking resource consumption/availability for task assignment. However, since these schedulers do not take the distributed I/O resources and global data distribution into consideration, the data requests from parallel processes/executors in big data processing will unfortunately be served in an imbalanced fashion on the distributed storage servers. These imbalanced access patterns among storage nodes are caused because a). unlike conventional parallel file system using striping policies to evenly distribute data among storage nodes, data-intensive file systems such as HDFS store each data unit, referred to as chunk or block file, with several copies based on a relative random policy, which can result in an uneven data distribution among storage nodes; b). based on the data retrieval policy in HDFS, the more data a storage node contains, the higher the probability that the storage node could be selected to serve the data. Therefore, on the nodes serving multiple chunk files, the data requests from different processes/executors will compete for shared resources such as hard disk head and network bandwidth. Because of this, the makespan of the entire program could be significantly prolonged and the overall I/O performance will degrade. The first part of my dissertation seeks to address aspects of these problems by creating an I/O middleware system and designing matching-based algorithms to optimize data access in parallel big data processing. To address the problem of remote data movement, we develop an I/O middleware system, called SLAM, which allows MPI-based analysis and visualization programs to benefit from locality read, i.e, each MPI process can access its required data from a local or nearby storage node. This can greatly improve the execution performance by reducing the amount of data movement over network. Furthermore, to address the problem of imbalanced data access, we propose a method called Opass, which models the data read requests that are issued by parallel applications to cluster nodes as a graph data structure where edges weights encode the demands of load capacity. We then employ matching-based algorithms to map processes to data to achieve data access in a balanced fashion. The final part of my dissertation focuses on optimizing sub-dataset analyses in parallel big data processing. Our proposed methods can benefit different analysis applications with various computational requirements and the experiments on different cluster testbeds show their applicability and scalability

    JCoast – A biologist-centric software tool for data mining and comparison of prokaryotic (meta)genomes

    Get PDF
    Background Current sequencing technologies give access to sequence information for genomes and metagenomes at a tremendous speed. Subsequent data processing is mainly performed by automatic pipelines provided by the sequencing centers. Although, standardised workflows are desirable and useful in many respects, rational data mining, comparative genomics, and especially the interpretation of the sequence information in the biological context, demands for intuitive, flexible, and extendable solutions. Results The JCoast software tool was primarily designed to analyse and compare (meta)genome sequences of prokaryotes. Based on a pre-computed GenDB database project, JCoast offers a flexible graphical user interface (GUI), as well as an application programming interface (API) that facilitates back-end data access. JCoast offers individual, cross genome-, and metagenome analysis, and assists the biologist in exploration of large and complex datasets. Conclusion JCoast combines all functions required for the mining, annotation, and interpretation of (meta)genomic data. The lightweight software solution allows the user to easily take advantage of advanced back-end database structures by providing a programming and graphical user interface to answer biological questions. JCoast is available at the project homepage

    Optimization of DNA extraction from human urinary samples for mycobiome community profiling.

    Get PDF
    IntroductionRecent data suggest the urinary tract hosts a microbial community of varying composition, even in the absence of infection. Culture-independent methodologies, such as next-generation sequencing of conserved ribosomal DNA sequences, provide an expansive look at these communities, identifying both common commensals and fastidious organisms. A fundamental challenge has been the isolation of DNA representative of the entire resident microbial community, including fungi.Materials and methodsWe evaluated multiple modifications of commonly-used DNA extraction procedures using standardized male and female urine samples, comparing resulting overall, fungal and bacterial DNA yields by quantitative PCR. After identifying protocol modifications that increased DNA yields (lyticase/lysozyme digestion, bead beating, boil/freeze cycles, proteinase K treatment, and carrier DNA use), all modifications were combined for systematic confirmation of optimal protocol conditions. This optimized protocol was tested against commercially available methodologies to compare overall and microbial DNA yields, community representation and diversity by next-generation sequencing (NGS).ResultsOverall and fungal-specific DNA yields from standardized urine samples demonstrated that microbial abundances differed significantly among the eight methods used. Methodologies that included multiple disruption steps, including enzymatic, mechanical, and thermal disruption and proteinase digestion, particularly in combination with small volume processing and pooling steps, provided more comprehensive representation of the range of bacterial and fungal species. Concentration of larger volume urine specimens at low speed centrifugation proved highly effective, increasing resulting DNA levels and providing greater microbial representation and diversity.ConclusionsAlterations in the methodology of urine storage, preparation, and DNA processing improve microbial community profiling using culture-independent sequencing methods. Our optimized protocol for DNA extraction from urine samples provided improved fungal community representation. Use of this technique resulted in equivalent representation of the bacterial populations as well, making this a useful technique for the concurrent evaluation of bacterial and fungal populations by NGS

    Some problems of iron and steelmaking in the Hindustan Steel plants

    Get PDF
    The Credit for pioneering the growth of a fully integrated and well-planned iron and steel industry in India undoub-tedly goes to the house of Tata, which started from a humble beginning and grew to a mighty iron and steel complex.The Government of India has further, under the tempo of five year plans, fully realized the importance of well-knit heavy iron and steel bases to feed the chain reaction.growth of secondary and processing engineering industries, in turn forming the backbone of consumer indus-tries catering to the requirements of diverse products essential in times both of war and peace. Even though the iron and steel industry is highly capital intensive, it cannot be left to the vagaries of international trade agreements and barter arrangements to meet iron and steel requirements for almost unlimited applications in industry

    Accelerated Profile HMM Searches

    Get PDF
    Profile hidden Markov models (profile HMMs) and probabilistic inference methods have made important contributions to the theory of sequence database homology search. However, practical use of profile HMM methods has been hindered by the computational expense of existing software implementations. Here I describe an acceleration heuristic for profile HMMs, the “multiple segment Viterbi” (MSV) algorithm. The MSV algorithm computes an optimal sum of multiple ungapped local alignment segments using a striped vector-parallel approach previously described for fast Smith/Waterman alignment. MSV scores follow the same statistical distribution as gapped optimal local alignment scores, allowing rapid evaluation of significance of an MSV score and thus facilitating its use as a heuristic filter. I also describe a 20-fold acceleration of the standard profile HMM Forward/Backward algorithms using a method I call “sparse rescaling”. These methods are assembled in a pipeline in which high-scoring MSV hits are passed on for reanalysis with the full HMM Forward/Backward algorithm. This accelerated pipeline is implemented in the freely available HMMER3 software package. Performance benchmarks show that the use of the heuristic MSV filter sacrifices negligible sensitivity compared to unaccelerated profile HMM searches. HMMER3 is substantially more sensitive and 100- to 1000-fold faster than HMMER2. HMMER3 is now about as fast as BLAST for protein searches
    • …
    corecore