96 research outputs found

    The application of the Hadoop software framework in Bioinformatics programs

    Get PDF
    The project described in this dissertation proposal attempted to improve the efficiency and scalability performance as well as the usability and user experience of three Bioinformatics applications - DNA/peptide sequence similarity comparison, digital DNA library subtraction, and DNA/peptide sequence de-duplication - by 1) adopting the Hadoop MapReduce algorithms and distributed file system and 2) implementing the fully automated Hadoop programs into a user friendly graphical user interface (GUI). In addition, the researcher was also interested in investigating the advantages and limitations of applying the Hadoop software framework as a general methodology in parallelizing Bioinformatics programs. After considering the original calculation algorithms in the serial version of the programs, the available computational resources, the nature of the MapReduce framework, and the optimization of performance, a processing pipeline with one pre-processing step, three mappers, two reducers and one post-processing step was developed. Then a GUI interface that enabled users to specify input/output files and program parameters was created. Also implanted into the GUI were user friendly features such as organized instruction, detailed log files, multi-user accessibility, and so on. The new and fully automated Hadoop Bioinformatics toolkit showed execution efficiency comparable with their MPI counterparts with median to large scale data, and better efficiency than MPI when ultra-large dataset was provided. In addition, good scalability was observed with testing dataset up to 20 Gb

    Doctor of Philosophy

    Get PDF
    dissertationDataflow pipeline models are widely used in visualization systems. Despite recent advancements in parallel architecture, most systems still support only a single CPU or a small collection of CPUs such as a SMP workstation. Even for systems that are specifically tuned towards parallel visualization, their execution models only provide support for data-parallelism while ignoring taskparallelism and pipeline-parallelism. With the recent popularization of machines equipped with multicore CPUs and multi-GPU units, these visualization systems are undoubtedly falling further behind in reaching maximum efficiency. On the other hand, there exist several libraries that can schedule program executions on multiple CPUs and/or multiple GPUs. However, due to differences in executing a task graph and a pipeline along with their APIs being considerably low-level, it still remains a challenge to integrate these run-time libraries into current visualization systems. Thus, there is a need for a redesigned dataflow architecture to fully support and exploit the power of highly parallel machines in large-scale visualization. The new design must be able to schedule executions on heterogeneous platforms while at the same time supporting arbitrarily large datasets through the use of streaming data structures. The primary goal of this dissertation work is to develop a parallel dataflow architecture for streaming large-scale visualizations. The framework includes supports for platforms ranging from multicore processors to clusters consisting of thousands CPUs and GPUs. We achieve this in our system by introducing the notion of Virtual Processing Elements and Task-Oriented Modules along with a highly customizable scheduler that controls the assignment of tasks to elements dynamically. This creates an intuitive way to maintain multiple CPU/GPU kernels yet still provide coherency and synchronization across module executions. We have implemented these techniques into HyperFlow which is made of an API with all basic dataflow constructs described in the dissertation, and a distributed run-time library that can be used to deploy those pipelines on multicore, multi-GPU and cluster-based platforms

    Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data

    Get PDF
    Thesis (Ph.D.) - Indiana University, Computer Sciences, 2015As Big Data processing problems evolve, many modern applications demonstrate special characteristics. Data exists in the form of both large historical datasets and high-speed real-time streams, and many analysis pipelines require integrated parallel batch processing and stream processing. Despite the large size of the whole dataset, most analyses focus on specific subsets according to certain criteria. Correspondingly, integrated support for efficient queries and post- query analysis is required. To address the system-level requirements brought by such characteristics, this dissertation proposes a scalable architecture for integrated queries, batch analysis, and streaming analysis of Big Data in the cloud. We verify its effectiveness using a representative application domain - social media data analysis - and tackle related research challenges emerging from each module of the architecture by integrating and extending multiple state-of-the-art Big Data storage and processing systems. In the storage layer, we reveal that existing text indexing techniques do not work well for the unique queries of social data, which put constraints on both textual content and social context. To address this issue, we propose a flexible indexing framework over NoSQL databases to support fully customizable index structures, which can embed necessary social context information for efficient queries. The batch analysis module demonstrates that analysis workflows consist of multiple algorithms with different computation and communication patterns, which are suitable for different processing frameworks. To achieve efficient workflows, we build an integrated analysis stack based on YARN, and make novel use of customized indices in developing sophisticated analysis algorithms. In the streaming analysis module, the high-dimensional data representation of social media streams poses special challenges to the problem of parallel stream clustering. Due to the sparsity of the high-dimensional data, traditional synchronization method becomes expensive and severely impacts the scalability of the algorithm. Therefore, we design a novel strategy that broadcasts the incremental changes rather than the whole centroids of the clusters to achieve scalable parallel stream clustering algorithms. Performance tests using real applications show that our solutions for parallel data loading/indexing, queries, analysis tasks, and stream clustering all significantly outperform implementations using current state-of-the-art technologies

    Overcoming Challenges in Predictive Modeling of Laser-Plasma Interaction Scenarios. The Sinuous Route from Advanced Machine Learning to Deep Learning

    Get PDF
    The interaction of ultrashort and intense laser pulses with solid targets and dense plasmas is a rapidly developing area of physics, this being mostly due to the significant advancements in laser technology. There is, thus, a growing interest in diagnosing as accurately as possible the numerous phenomena related to the absorption and reflection of laser radiation. At the same time, envisaged experiments are in high demand of increased accuracy simulation software. As laser-plasma interaction modelings are experiencing a transition from computationally-intensive to data-intensive problems, traditional codes employed so far are starting to show their limitations. It is in this context that predictive modelings of laser-plasma interaction experiments are bound to reshape the definition of simulation software. This chapter focuses an entire class of predictive systems incorporating big data, advanced machine learning algorithms and deep learning, with improved accuracy and speed. Making use of terabytes of already available information (literature as well as simulation and experimental data) these systems enable the discovery and understanding of various physical phenomena occurring during interaction, hence allowing researchers to set up controlled experiments at optimal parameters. A comparative discussion in terms of challenges, advantages, bottlenecks, performances and suitability of laser-plasma interaction predictive systems is ultimately provided

    Large-scale Data Analysis and Deep Learning Using Distributed Cyberinfrastructures and High Performance Computing

    Get PDF
    Data in many research fields continues to grow in both size and complexity. For instance, recent technological advances have caused an increased throughput in data in various biological-related endeavors, such as DNA sequencing, molecular simulations, and medical imaging. In addition, the variance in the types of data (textual, signal, image, etc.) adds an additional complexity in analyzing the data. As such, there is a need for uniquely developed applications that cater towards the type of data. Several considerations must be made when attempting to create a tool for a particular dataset. First, we must consider the type of algorithm required for analyzing the data. Next, since the size and complexity of the data imposes high computation and memory requirements, it is important to select a proper hardware environment on which to build the application. By carefully both developing the algorithm and selecting the hardware, we can provide an effective environment in which to analyze huge amounts of highly complex data in a large-scale manner. In this dissertation, I go into detail regarding my applications using big data and deep learning techniques to analyze complex and large data. I investigate how big data frameworks, such as Hadoop, can be applied to problems such as large-scale molecular dynamics simulations. Following this, many popular deep learning frameworks are evaluated and compared to find those that suit certain hardware setups and deep learning models. Then, we explore an application of deep learning to a biomedical problem, namely ADHD diagnosis from fMRI data. Lastly, I demonstrate a framework for real-time and fine-grained vehicle detection and classification. With each of these works in this dissertation, a unique large-scale analysis algorithm or deep learning model is implemented that caters towards the problem and leverages specialized computing resources

    Scaling out Big Data Distributed Pricing in Gaming Industry

    Get PDF
    Game companies have millions of customers, billions of transactions and petabytes of other data related to game events. The vast volume and complexity of this data make it practically impossible to process and analyze it using traditional relational database models (RDBMs). This kind of data can be identified as Big Data, and in order to handle it in efficient manner, multiple issues have to be taken into account. It is more straightforward to answer to these problems when developing completely new system, that can be implemented with all the new techniques and platforms to support big data handling. However, if it is needed to modify an existing system to accommodate data volumes of big data, there are more issues to be taken into account. This thesis starts with the clarification of the definition 'big data'. Scalability and parallelism are key factors for handling big data, thus they will be explained and some of the conventions to do them will be reviewed. Next, different tools and platforms that do parallel programming, are presented. The relevance of big data in gaming industry is briefly explained, as well as the different monetization models that games have. Furthermore, price elasticity of demand is explained to give better understanding of a Dynamic Pricing Engine and what does it do. In this thesis, I solve a bottleneck that emerges in data transfer and processing when introducing big data to an existing system, a Dynamic Pricing Engine, by using parallel programming in order to scale the system. Spark will be used to deal with fetching and processing distributed data. The main focus is in the impact of using parallel programming in comparison to the current solution, which is done with PHP and MySQL. Furthermore, Spark implementations are done against different data storage solutions, such as MySQL, Hadoop and HDFS, and their performance is also compared. The results for utilizing Spark for the implementation show significant improvement in performance time for processing the data. However, the importance of choosing the right data storage for fetching the data can't be understated, as the speed for fetching the data can widely variate.Peliyhtiöillä on miljoonia asiakkaita, miljardeja maksutapahtumia ja petatavuja pelin tapahtumiin liittyvää dataa. Tämän datan suuri määrä ja kompleksisuus tekevät sen prosessoimisesta sekä analysoimisesta lähes mahdotonta tavallisilla relaatiotietokannoilla. Tällaista dataa voidaan kutsua Big Dataksi, ja jotta sen käsittely olisi tehokasta, useita asioita on otettava huomioon. Uuden järjestelmän toteutuksessa näihin ongelmiin pystytään vastaamaan melko johdonmukaisesti, sillä uusimmat tekniikat ja alustat voidaan ottaa tällöin helposti käyttöön. Jos kyseessä on jo olemassa oleva järjestelmä, jota halutaan muuttaa vastaamaan big datamaisiin datamääriin, huomioon otettavien asioden määrä kasvaa. Tämän diplomityön aluksi selitetään termi 'Big Data'. Big Datan kanssa työskentelyyn tarvitaan skaalautuvuutta ja rinnakkaisuutta, joten nämä termit, sekä näiden yleisimmät käytännöt käydään läpi. Seuraavaksi esitellään työkaluja ja alustoja, joilla on mahdollista tehdä rinnakkaisohjelmointia. Big Datan merkitys peliteollisuudessa selitetään lyhyesti, kuten myös eri monetisaatiomallit, joita peliyritykset käyttävät. Lisäksi kysynnän hintajousto käydään läpi, jotta lukijalle olisi helpompaa ymmärtää, mikä seuraavaksi esitelty Apprien on ja mihin sitä käytetään. Tässä diplomityössä etsin ratkaisua Big Datan siirrossa ja prosessoinnissa ilmenevään ongelmaan jo olemassa olevalle järjestelmälle, Apprienille. Tämä pullonkaula ratkaistaan käyttämällä rinnakkaisohjelmointia Sparkin avulla. Pääasiallinen painopiste on selvittää rinnakkaisohjelmoinnilla saavutettu hyöty verrattuna nykyiseen ratkaisuun, joka on toteutettu PHP:llä ja MySQL:llä. Tämän lisäksi, Spark toteusta hyödynnetään eri datan säilytysmalleilla (MySQL, Hadoop+HDFS), ja niiden suorityskykyä vertaillaan. Tulokset, jotka saatiin Spark toteutusta hyödyntämällä, osoittavat merkittävän parannuksen suoritusajassa datan prosessoimisessa. Oikean tietomallin valitsemisen tärkeyttä ei pidä aliarvioida, sillä datan siirtämiseen käytetty aika vaihtelee myös huomattavasti alustasta riippuen
    corecore