1,013,231 research outputs found

    Genetic Sequence Matching Using D4M Big Data Approaches

    Full text link
    Recent technological advances in Next Generation Sequencing tools have led to increasing speeds of DNA sample collection, preparation, and sequencing. One instrument can produce over 600 Gb of genetic sequence data in a single run. This creates new opportunities to efficiently handle the increasing workload. We propose a new method of fast genetic sequence analysis using the Dynamic Distributed Dimensional Data Model (D4M) - an associative array environment for MATLAB developed at MIT Lincoln Laboratory. Based on mathematical and statistical properties, the method leverages big data techniques and the implementation of an Apache Acculumo database to accelerate computations one-hundred fold over other methods. Comparisons of the D4M method with the current gold-standard for sequence analysis, BLAST, show the two are comparable in the alignments they find. This paper will present an overview of the D4M genetic sequence algorithm and statistical comparisons with BLAST.Comment: 6 pages; to appear in IEEE High Performance Extreme Computing (HPEC) 201

    SMAGEXP: a galaxy tool suite for transcriptomics data meta-analysis

    Full text link
    Bakground: With the proliferation of available microarray and high throughput sequencing experiments in the public domain, the use of meta-analysis methods increases. In these experiments, where the sample size is often limited, meta-analysis offers the possibility to considerably enhance the statistical power and give more accurate results. For those purposes, it combines either effect sizes or results of single studies in a appropriate manner. R packages metaMA and metaRNASeq perform meta-analysis on microarray and NGS data, respectively. They are not interchangeable as they rely on statistical modeling specific to each technology. Results: SMAGEXP (Statistical Meta-Analysis for Gene EXPression) integrates metaMA and metaRNAseq packages into Galaxy. We aim to propose a unified way to carry out meta-analysis of gene expression data, while taking care of their specificities. We have developed this tool suite to analyse microarray data from Gene Expression Omnibus (GEO) database or custom data from affymetrix microarrays. These data are then combined to carry out meta-analysis using metaMA package. SMAGEXP also offers to combine raw read counts from Next Generation Sequencing (NGS) experiments using DESeq2 and metaRNASeq package. In both cases, key values, independent from the technology type, are reported to judge the quality of the meta-analysis. These tools are available on the Galaxy main tool shed. Source code, help and installation instructions are available on github. Conclusion: The use of Galaxy offers an easy-to-use gene expression meta-analysis tool suite based on the metaMA and metaRNASeq packages

    Non-parametric Bayesian modelling of digital gene expression data

    Full text link
    Next-generation sequencing technologies provide a revolutionary tool for generating gene expression data. Starting with a fixed RNA sample, they construct a library of millions of differentially abundant short sequence tags or "reads", which constitute a fundamentally discrete measure of the level of gene expression. A common limitation in experiments using these technologies is the low number or even absence of biological replicates, which complicates the statistical analysis of digital gene expression data. Analysis of this type of data has often been based on modified tests originally devised for analysing microarrays; both these and even de novo methods for the analysis of RNA-seq data are plagued by the common problem of low replication. We propose a novel, non-parametric Bayesian approach for the analysis of digital gene expression data. We begin with a hierarchical model for modelling over-dispersed count data and a blocked Gibbs sampling algorithm for inferring the posterior distribution of model parameters conditional on these counts. The algorithm compensates for the problem of low numbers of biological replicates by clustering together genes with tag counts that are likely sampled from a common distribution and using this augmented sample for estimating the parameters of this distribution. The number of clusters is not decided a priori, but it is inferred along with the remaining model parameters. We demonstrate the ability of this approach to model biological data with high fidelity by applying the algorithm on a public dataset obtained from cancerous and non-cancerous neural tissues

    Effect of Organizational Factors on Retention of Generation Y Employees in Parastatals: A Case of Kenya Revenue Authority

    Get PDF
    The purpose of this study was to establish generally the effect of organizational factors on retention of generation Y employees in Kenya Revenue Authority. The specific objectives for this study were:  career development, remuneration, employee recognition and management styles on retention of Generation Y employees in Kenya Revenue Authority. The target population for this study comprised of 461 the top level management, middle level management and lower level management.  The research used descriptive survey design and stratified sampling technique. The sample size was 285 respondents. The study used primary data collected using pre-determined questionnaires. The quantitative data was analyzed and presented using descriptive statistics, graphs, pie charts and qualitative data was analyzed through content analysis. The analysis was done using Statistical Package for Social Sciences (SPSS) version 21. Keywords: career development, remuneration, employee recognition, management styles and retention of generation y employees

    On the utility of RNA sample pooling to optimize cost and statistical power in RNA sequencing experiments

    Get PDF
    Background: In gene expression studies, RNA sample pooling is sometimes considered because of budget constraints or lack of sufficient input material. Using microarray technology, RNA sample pooling strategies have been reported to optimize both the cost of data generation as well as the statistical power for differential gene expression (DGE) analysis. For RNA sequencing, with its different quantitative output in terms of counts and tunable dynamic range, the adequacy and empirical validation of RNA sample pooling strategies have not yet been evaluated. In this study, we comprehensively assessed the utility of pooling strategies in RNA-seq experiments using empirical and simulated RNA-seq datasets. Result: The data generating model in pooled experiments is defined mathematically to evaluate the mean and variability of gene expression estimates. The model is further used to examine the trade-off between the statistical power of testing for DGE and the data generating costs. Empirical assessment of pooling strategies is done through analysis of RNA-seq datasets under various pooling and non-pooling experimental settings. Simulation study is also used to rank experimental scenarios with respect to the rate of false and true discoveries in DGE analysis. The results demonstrate that pooling strategies in RNA-seq studies can be both cost-effective and powerful when the number of pools, pool size and sequencing depth are optimally defined. Conclusion: For high within-group gene expression variability, small RNA sample pools are effective to reduce the variability and compensate for the loss of the number of replicates. Unlike the typical cost-saving strategies, such as reducing sequencing depth or number of RNA samples (replicates), an adequate pooling strategy is effective in maintaining the power of testing DGE for genes with low to medium abundance levels, along with a substantial reduction of the total cost of the experiment. In general, pooling RNA samples or pooling RNA samples in conjunction with moderate reduction of the sequencing depth can be good options to optimize the cost and maintain the power

    Fully Synthetic Data for Complex Surveys

    Full text link
    When seeking to release public use files for confidential data, statistical agencies can generate fully synthetic data. We propose an approach for making fully synthetic data from surveys collected with complex sampling designs. Specifically, we generate pseudo-populations by applying the weighted finite population Bayesian bootstrap to account for survey weights, take simple random samples from those pseudo-populations, estimate synthesis models using these simple random samples, and release simulated data drawn from the models as the public use files. We use the framework of multiple imputation to enable variance estimation using two data generation strategies. In the first, we generate multiple data sets from each simple random sample, whereas in the second, we generate a single synthetic data set from each simple random sample. We present multiple imputation combining rules for each setting. We illustrate each approach and the repeated sampling properties of the combining rules using simulation studies

    The effect of faking on the correlation between two ordinal variables : some population and monte carlo results

    Get PDF
    Correlational measures are probably the most spread statistical tools in psychological research. They are used by researchers to investigate, for example, relations between self-report measures usually collected using paper-pencil or online questionnaires. Like many other statistical analysis, also correlational measures can be seriously affected by specific sources of bias which constitute serious threats to the final observed results. In this contribution, we will focus on the impact of the fake data threat on the interpretation of statistical results for two well-know correlational measures (the Pearson product-moment correlation and the Spearman rank-order correlation). By using the Sample Generation by Replacement (SGR) approach, we analyze uncertainty in inferences based on possible fake data and evaluate the implications of fake data for correlational results. A population-level analysis and a Monte Carlo simulation are performed to study different modulations of faking on bivariate discrete variables with finite supports and varying sample sizes. We show that by using our paradigm it is always possible, under specific faking conditions, to increase (resp. decrease) the original correlation between two discrete variables in a predictable and systematic manner

    The Power of Shopee Live Streaming on Z Generation Purchasing Decisions

    Get PDF
    The focus of the research is to determine the power of live streaming carried out by shops selling on Shopee on product purchasing decisions by Generation Z. Quantitative methods using survey data collection techniques are used to explain this phenomenon. The instrument used was 15 question items consisting of 9 items about live streaming variables and six items about purchasing decisions, then packaged in a Google Form questionnaire and distributed via social media. The sample in this study consisted of 185 respondents aged 13 – 26 years (generation Z). The data obtained was then analyzed using descriptive and differential statistics to determine the relationship between variables. Statistical data analysis using SPSS version 25. The results show that Shopee Live Streaming contributes 62% to Z Generation Purchasing Decisions. So it provides information that live streaming activities can attract consumers to buy, the seller's products can reach a wider market, educate consumers directly about the products offered by the seller, and increase immediate feedback

    Takeuchi's Information Criteria as a form of Regularization

    Full text link
    Takeuchi's Information Criteria (TIC) is a linearization of maximum likelihood estimator bias which shrinks the model parameters towards the maximum entropy distribution, even when the model is mis-specified. In statistical machine learning, L2L_2 regularization (a.k.a. ridge regression) also introduces a parameterized bias term with the goal of minimizing out-of-sample entropy, but generally requires a numerical solver to find the regularization parameter. This paper presents a novel regularization approach based on TIC; the approach does not assume a data generation process and results in a higher entropy distribution through more efficient sample noise suppression. The resulting objective function can be directly minimized to estimate and select the best model, without the need to select a regularization parameter, as in ridge regression. Numerical results applied to a synthetic high dimensional dataset generated from a logistic regression model demonstrate superior model performance when using the TIC based regularization over a L1L_1 and a L2L_2 penalty term
    • …
    corecore