1,013,231 research outputs found
Genetic Sequence Matching Using D4M Big Data Approaches
Recent technological advances in Next Generation Sequencing tools have led to
increasing speeds of DNA sample collection, preparation, and sequencing. One
instrument can produce over 600 Gb of genetic sequence data in a single run.
This creates new opportunities to efficiently handle the increasing workload.
We propose a new method of fast genetic sequence analysis using the Dynamic
Distributed Dimensional Data Model (D4M) - an associative array environment for
MATLAB developed at MIT Lincoln Laboratory. Based on mathematical and
statistical properties, the method leverages big data techniques and the
implementation of an Apache Acculumo database to accelerate computations
one-hundred fold over other methods. Comparisons of the D4M method with the
current gold-standard for sequence analysis, BLAST, show the two are comparable
in the alignments they find. This paper will present an overview of the D4M
genetic sequence algorithm and statistical comparisons with BLAST.Comment: 6 pages; to appear in IEEE High Performance Extreme Computing (HPEC)
201
SMAGEXP: a galaxy tool suite for transcriptomics data meta-analysis
Bakground: With the proliferation of available microarray and high throughput
sequencing experiments in the public domain, the use of meta-analysis methods
increases. In these experiments, where the sample size is often limited,
meta-analysis offers the possibility to considerably enhance the statistical
power and give more accurate results. For those purposes, it combines either
effect sizes or results of single studies in a appropriate manner. R packages
metaMA and metaRNASeq perform meta-analysis on microarray and NGS data,
respectively. They are not interchangeable as they rely on statistical modeling
specific to each technology.
Results: SMAGEXP (Statistical Meta-Analysis for Gene EXPression) integrates
metaMA and metaRNAseq packages into Galaxy. We aim to propose a unified way to
carry out meta-analysis of gene expression data, while taking care of their
specificities. We have developed this tool suite to analyse microarray data
from Gene Expression Omnibus (GEO) database or custom data from affymetrix
microarrays. These data are then combined to carry out meta-analysis using
metaMA package. SMAGEXP also offers to combine raw read counts from Next
Generation Sequencing (NGS) experiments using DESeq2 and metaRNASeq package. In
both cases, key values, independent from the technology type, are reported to
judge the quality of the meta-analysis. These tools are available on the Galaxy
main tool shed. Source code, help and installation instructions are available
on github.
Conclusion: The use of Galaxy offers an easy-to-use gene expression
meta-analysis tool suite based on the metaMA and metaRNASeq packages
Non-parametric Bayesian modelling of digital gene expression data
Next-generation sequencing technologies provide a revolutionary tool for
generating gene expression data. Starting with a fixed RNA sample, they
construct a library of millions of differentially abundant short sequence tags
or "reads", which constitute a fundamentally discrete measure of the level of
gene expression. A common limitation in experiments using these technologies is
the low number or even absence of biological replicates, which complicates the
statistical analysis of digital gene expression data. Analysis of this type of
data has often been based on modified tests originally devised for analysing
microarrays; both these and even de novo methods for the analysis of RNA-seq
data are plagued by the common problem of low replication. We propose a novel,
non-parametric Bayesian approach for the analysis of digital gene expression
data. We begin with a hierarchical model for modelling over-dispersed count
data and a blocked Gibbs sampling algorithm for inferring the posterior
distribution of model parameters conditional on these counts. The algorithm
compensates for the problem of low numbers of biological replicates by
clustering together genes with tag counts that are likely sampled from a common
distribution and using this augmented sample for estimating the parameters of
this distribution. The number of clusters is not decided a priori, but it is
inferred along with the remaining model parameters. We demonstrate the ability
of this approach to model biological data with high fidelity by applying the
algorithm on a public dataset obtained from cancerous and non-cancerous neural
tissues
Effect of Organizational Factors on Retention of Generation Y Employees in Parastatals: A Case of Kenya Revenue Authority
The purpose of this study was to establish generally the effect of organizational factors on retention of generation Y employees in Kenya Revenue Authority. The specific objectives for this study were: career development, remuneration, employee recognition and management styles on retention of Generation Y employees in Kenya Revenue Authority. The target population for this study comprised of 461 the top level management, middle level management and lower level management. The research used descriptive survey design and stratified sampling technique. The sample size was 285 respondents. The study used primary data collected using pre-determined questionnaires. The quantitative data was analyzed and presented using descriptive statistics, graphs, pie charts and qualitative data was analyzed through content analysis. The analysis was done using Statistical Package for Social Sciences (SPSS) version 21. Keywords: career development, remuneration, employee recognition, management styles and retention of generation y employees
On the utility of RNA sample pooling to optimize cost and statistical power in RNA sequencing experiments
Background: In gene expression studies, RNA sample pooling is sometimes considered because of budget constraints or lack of sufficient input material. Using microarray technology, RNA sample pooling strategies have been reported to optimize both the cost of data generation as well as the statistical power for differential gene expression (DGE) analysis. For RNA sequencing, with its different quantitative output in terms of counts and tunable dynamic range, the adequacy and empirical validation of RNA sample pooling strategies have not yet been evaluated. In this study, we comprehensively assessed the utility of pooling strategies in RNA-seq experiments using empirical and simulated RNA-seq datasets.
Result: The data generating model in pooled experiments is defined mathematically to evaluate the mean and variability of gene expression estimates. The model is further used to examine the trade-off between the statistical power of testing for DGE and the data generating costs. Empirical assessment of pooling strategies is done through analysis of RNA-seq datasets under various pooling and non-pooling experimental settings. Simulation study is also used to rank experimental scenarios with respect to the rate of false and true discoveries in DGE analysis. The results demonstrate that pooling strategies in RNA-seq studies can be both cost-effective and powerful when the number of pools, pool size and sequencing depth are optimally defined.
Conclusion: For high within-group gene expression variability, small RNA sample pools are effective to reduce the variability and compensate for the loss of the number of replicates. Unlike the typical cost-saving strategies, such as reducing sequencing depth or number of RNA samples (replicates), an adequate pooling strategy is effective in maintaining the power of testing DGE for genes with low to medium abundance levels, along with a substantial reduction of the total cost of the experiment. In general, pooling RNA samples or pooling RNA samples in conjunction with moderate reduction of the sequencing depth can be good options to optimize the cost and maintain the power
Fully Synthetic Data for Complex Surveys
When seeking to release public use files for confidential data, statistical
agencies can generate fully synthetic data. We propose an approach for making
fully synthetic data from surveys collected with complex sampling designs.
Specifically, we generate pseudo-populations by applying the weighted finite
population Bayesian bootstrap to account for survey weights, take simple random
samples from those pseudo-populations, estimate synthesis models using these
simple random samples, and release simulated data drawn from the models as the
public use files. We use the framework of multiple imputation to enable
variance estimation using two data generation strategies. In the first, we
generate multiple data sets from each simple random sample, whereas in the
second, we generate a single synthetic data set from each simple random sample.
We present multiple imputation combining rules for each setting. We illustrate
each approach and the repeated sampling properties of the combining rules using
simulation studies
The effect of faking on the correlation between two ordinal variables : some population and monte carlo results
Correlational measures are probably the most spread statistical tools in psychological research. They are used by researchers to investigate, for example, relations between self-report measures usually collected using paper-pencil or online questionnaires. Like many other statistical analysis, also correlational measures can be seriously affected by specific sources of bias which constitute serious threats to the final observed results. In this contribution, we will focus on the impact of the fake data threat on the interpretation of statistical results for two well-know correlational measures (the Pearson product-moment correlation and the Spearman rank-order correlation). By using the Sample Generation by Replacement (SGR) approach, we analyze uncertainty in inferences based on possible fake data and evaluate the implications of fake data for correlational results. A population-level analysis and a Monte Carlo simulation are performed to study different modulations of faking on bivariate discrete variables with finite supports and varying sample sizes. We show that by using our paradigm it is always possible, under specific faking conditions, to increase (resp. decrease) the original correlation between two discrete variables in a predictable and systematic manner
The Power of Shopee Live Streaming on Z Generation Purchasing Decisions
The focus of the research is to determine the power of live streaming carried out by shops selling on Shopee on product purchasing decisions by Generation Z. Quantitative methods using survey data collection techniques are used to explain this phenomenon. The instrument used was 15 question items consisting of 9 items about live streaming variables and six items about purchasing decisions, then packaged in a Google Form questionnaire and distributed via social media. The sample in this study consisted of 185 respondents aged 13 – 26 years (generation Z). The data obtained was then analyzed using descriptive and differential statistics to determine the relationship between variables. Statistical data analysis using SPSS version 25. The results show that Shopee Live Streaming contributes 62% to Z Generation Purchasing Decisions. So it provides information that live streaming activities can attract consumers to buy, the seller's products can reach a wider market, educate consumers directly about the products offered by the seller, and increase immediate feedback
Takeuchi's Information Criteria as a form of Regularization
Takeuchi's Information Criteria (TIC) is a linearization of maximum
likelihood estimator bias which shrinks the model parameters towards the
maximum entropy distribution, even when the model is mis-specified. In
statistical machine learning, regularization (a.k.a. ridge regression)
also introduces a parameterized bias term with the goal of minimizing
out-of-sample entropy, but generally requires a numerical solver to find the
regularization parameter. This paper presents a novel regularization approach
based on TIC; the approach does not assume a data generation process and
results in a higher entropy distribution through more efficient sample noise
suppression. The resulting objective function can be directly minimized to
estimate and select the best model, without the need to select a regularization
parameter, as in ridge regression. Numerical results applied to a synthetic
high dimensional dataset generated from a logistic regression model demonstrate
superior model performance when using the TIC based regularization over a
and a penalty term
- …