4 research outputs found

    PiCo: A Domain-Specific Language for Data Analytics Pipelines

    Get PDF
    In the world of Big Data analytics, there is a series of tools aiming at simplifying programming applications to be executed on clusters. Although each tool claims to provide better programming, data and execution models—for which only informal (and often confusing) semantics is generally provided—all share a common under- lying model, namely, the Dataflow model. Using this model as a starting point, it is possible to categorize and analyze almost all aspects about Big Data analytics tools from a high level perspective. This analysis can be considered as a first step toward a formal model to be exploited in the design of a (new) framework for Big Data analytics. By putting clear separations between all levels of abstraction (i.e., from the runtime to the user API), it is easier for a programmer or software designer to avoid mixing low level with high level aspects, as we are often used to see in state-of-the-art Big Data analytics frameworks. From the user-level perspective, we think that a clearer and simple semantics is preferable, together with a strong separation of concerns. For this reason, we use the Dataflow model as a starting point to build a programming environment with a simplified programming model implemented as a Domain-Specific Language, that is on top of a stack of layers that build a prototypical framework for Big Data analytics. The contribution of this thesis is twofold: first, we show that the proposed model is (at least) as general as existing batch and streaming frameworks (e.g., Spark, Flink, Storm, Google Dataflow), thus making it easier to understand high-level data-processing applications written in such frameworks. As result of this analysis, we provide a layered model that can represent tools and applications following the Dataflow paradigm and we show how the analyzed tools fit in each level. Second, we propose a programming environment based on such layered model in the form of a Domain-Specific Language (DSL) for processing data collections, called PiCo (Pipeline Composition). The main entity of this programming model is the Pipeline, basically a DAG-composition of processing elements. This model is intended to give the user an unique interface for both stream and batch processing, hiding completely data management and focusing only on operations, which are represented by Pipeline stages. Our DSL will be built on top of the FastFlow library, exploiting both shared and distributed parallelism, and implemented in C++11/14 with the aim of porting C++ into the Big Data world

    Yleiskäyttöinen muunnos tiedonvälitysolioiden ja entiteettien välillä

    Get PDF
    Layered architecture in Java EE web applications is one example of a situation where parallel, non-matching class hierarchies need to be maintained. The mapping of Data Transfer Objects (DTO) and entities causes manual overhead, more code to maintain and the lack of automated solution may lead to architectural anti patterns. To avoid these problems and to streamline the coding process, this mapping process can be supported and partially automated. To access the problem, the solutions and related techniques to the mapping process are analyzed. For further analysis, a runtime mapping component approach is chosen. There are multiple techniques for mapping the class hierarchies, such as XML, annotations, APIs or Domain-Specific Languages. Mapping components use reflection for mapping but for actual copying of the values, dynamic code generation and caches can be used for better performance. In this thesis, a comprehensive Business Process Readiness (BRR) analysis was performed. Analyzed categories included features, usability, quality, performance, scalability, support and documentation. The requirements for a generic purpose mapping component were derived from the needs of Dicode Ltd. Out of the eleven found implementations, six were chosen for the complete analysis based on feature category. Finally, a rating in range from 1 to 5 was assigned to each of the components as a weighted average of the results in each category. There are notable differences related to usability, measured as the amount configuration needed, between the implementations. Additionally, components using dynamic code generation perform better compared to others but no scalability concerns were noted for a real application. Overall, based on the analysis, we found that there exist very good solutions to support the mapping process for Dicode Ltd. that can be recommended to be used in future projects.Rinnakkaisia, toisistaan rakenteeltaan poikkeavia luokkahierarkioita tarvitaan muun muassa kerrosarkkitehtuurilla toteutetuissa Java EE -pohjaisissa websovelluksissa. Tiedonvälitysolioiden (engl. Data Transfer Object) ja entiteettien välinen muunnos aiheuttaa manuaalista työtä ohjelmoijalle, lisää ylläpidettävää koodia ja toisaalta automatisoidun ratkaisun puuttuminen voi johtaa arkkitehtuurin kannalta haitallisiin piirteisiin. Näiden haasteiden välttämiseksi tämä muunnosprosessi on osittain automatisoitavissa. Tekniset ratkaisut ongelman ratkaisemiseksi analysoitiin ja tarkempaan käsittelyyn valittiin lähestymistapa, jossa muunnos suoritetaan ajonaikaisesti. Luokkahierarkioiden rakenteen kohdentamiseen voidaan käyttää useita eri tekniikoita, kuten XML:ää, annotaatioita, ohjelmointirajapintoja tai toimialueeseen sidonnaisia kieliä (engl. Domain-Specific Language). Kohdentamisessa käytetään Javan reflektointia mutta varsinaiseen arvojen kopiointiin voidaan saavutettujen tehokkuusetujen vuoksi hyödyntää ajon aikana tuotettua ohjelmakoodia sekä välimuisteja. Toteutusten vertailuun käytetään Business Process Readiness -arviointia, josta on käytössä toiminnallisuuden, käytettävyyden, laadun, tehokkuuden, skaalautuvuuden, tuen ja dokumentaation osa-alueet. Toiminnalliset vaatimukset on johdettu Dicode Oy:n tarpeista. Näiden pohjalta yhteensä yhdestätoista arvioidusta toteutuksesta kuusi valittiin kattavamman arvioinnin vaiheeseen, jossa kokonaisarvio muodostui kaikkien arvioitujen osa-alueiden painotetusta keskiarvosta välille 1-5. Käytettävyyttä mitattiin vaaditun konfiguraation määrällä, ja tällä osa-alueella toteutusten välillä havaittiin merkittäviä eroja. Ajon aikana ohjelmakoodia tuottavat toteutukset erottuivat tehokkuusmittauksista, mutta todellisen sovelluksen tapauksessa mitattavissa olevia skaalautuvuuseroja ei havaittu. Vertailun pohjalta voidaan todeta, että Dicode Oy:n tarpeisiin on olemassa erittäin hyviä toteutuksia ja niiden käyttöä voidaan suositella tulevissa projekteissa

    電子メール保存の未来:電子メールアーカイブズの技術的アプローチに関するタスクフォース報告書(仮訳)

    Get PDF
    本稿は、クリティブコモンズ 表示 - 非営利 - 継承 4.0 国際 (CC BY-NC-SA 4.0) ライセンスにもとづき、図書館情報資源財団(Council on Library and Information Resources: CLIR)から刊行された報告書を翻訳したものです

    Statistical methods in intra-tumor heterogeneity

    Get PDF
    A tumor sample of a single patient often includes a conglomerate of heterogeneous cells. While its genetic and transcriptomic sequencing data represent a mixture of signals from different cell types, they can be further deconvolved so that we can get down to the level of each homogeneous component and test the association between the composition and some interesting response variables. Understanding intra-tumor heterogeneity through deconvolution of genetic data may help us identify useful biomarkers to guide the practice of precision medicine. While popular methods exist, they usually do not jointly consider copy number aberrations and somatic point mutations and their timings under a valid statistical framework. Differential expression using RNA sequencing data of bulk tissue samples (bulk RNA-seq) is a very popular and effective approach to study many biomedical problems. However, most tissue samples are composed of different cell types. Differential expression analysis without accounting for cell type composition cannot separate the changes due to cell type composition or cell type-specific expression. In addition, cell type-specific signals may be masked or even misrepresented, especially for relatively rare cell types. In Chapter 2 of the proposed dissertation, we develop a new statistical method, SHARE (Statistical method for Heterogeneity using Allele-specific REads and somatic point mutations), that reconstructs clonal evolution history using whole exome sequencing data of matched tumor and normal samples. Our method jointly models copy number aberrations and somatic point mutations using both total and allele-specific read counts. Cellular prevalence, allele-specific copy number and multiplicity of point mutations within each subclone can be estimated by maximizing the model likelihood. We apply our method to infer the subclonal composition in tumor samples from TCGA colon cancer patients. In Chapter 3, we propose a new framework to address these limitations: Cell Type Aware analysis of RNA-seq (CARseq). CARseq employs a negative binomial regression approach to fully utilize the countfeatures of RNA-seq data to improve statistical power. After evaluating its performance in simulations, we apply CARseq to compare gene expression of schizophrenia/autism subjects versus controls. Our results show that these two neurodevelopmental disorders differ from each other in terms of cell type composition changes and genes related to different types of neurotransmitter receptors were differentially expressed in neuron cells. We also discover some overlapping signals of differential expression in microglia, supporting the two diseases' similarity through immune regulation.Doctor of Philosoph
    corecore