295,294 research outputs found
Optimization of miRNA-seq data preprocessing.
The past two decades of microRNA (miRNA) research has solidified the role of these small non-coding RNAs as key regulators of many biological processes and promising biomarkers for disease. The concurrent development in high-throughput profiling technology has further advanced our understanding of the impact of their dysregulation on a global scale. Currently, next-generation sequencing is the platform of choice for the discovery and quantification of miRNAs. Despite this, there is no clear consensus on how the data should be preprocessed before conducting downstream analyses. Often overlooked, data preprocessing is an essential step in data analysis: the presence of unreliable features and noise can affect the conclusions drawn from downstream analyses. Using a spike-in dilution study, we evaluated the effects of several general-purpose aligners (BWA, Bowtie, Bowtie 2 and Novoalign), and normalization methods (counts-per-million, total count scaling, upper quartile scaling, Trimmed Mean of M, DESeq, linear regression, cyclic loess and quantile) with respect to the final miRNA count data distribution, variance, bias and accuracy of differential expression analysis. We make practical recommendations on the optimal preprocessing methods for the extraction and interpretation of miRNA count data from small RNA-sequencing experiments
Data Preprocessing
Tato diplomová práce se zaměřuje na problematiku předzpracováním dat. První část se zabývá přehledem a popisem charakteristických testů pro popis atributů, metodami pro práci s daty a atributy. Druhá část práce se popisuje práci s programem Rapidminer. Věnuje se jednotlivým funkcím předzpracování v tomto programu popisuje jejich funkci. Ve třetí části je srovnání výsledku při použití metod předzpracování a bez předzpracování dat.This thesis surveys on problems preprocessing data. Forepart deal with view and description characteristic tests for description attributes, methods for work with data and attributes. Second part work describes work with program Rapidminer. It pays pay attention to single functions preprocessing in this programme describes their function. Third part equate to results with using methods preprocessing and without using data preprocessing.
Signature extension preprocessing for LANDSAT MSS data
There are no author-identified significant results in this report
Making Queries Tractable on Big Data with Preprocessing
A query class is traditionally considered tractable if there exists a polynomial-time (PTIME) algorithm to answer its queries. When it comes to big data, however, PTIME al-gorithms often become infeasible in practice. A traditional and effective approach to coping with this is to preprocess data off-line, so that queries in the class can be subsequently evaluated on the data efficiently. This paper aims to pro-vide a formal foundation for this approach in terms of com-putational complexity. (1) We propose a set of Π-tractable queries, denoted by ΠT0Q, to characterize classes of queries that can be answered in parallel poly-logarithmic time (NC) after PTIME preprocessing. (2) We show that several natu-ral query classes are Π-tractable and are feasible on big data. (3) We also study a set ΠTQ of query classes that can be ef-fectively converted to Π-tractable queries by re-factorizing its data and queries for preprocessing. We introduce a form of NC reductions to characterize such conversions. (4) We show that a natural query class is complete for ΠTQ. (5) We also show that ΠT0Q ⊂ P unless P = NC, i.e., the set ΠT0Q of all Π-tractable queries is properly contained in the set P of all PTIME queries. Nonetheless, ΠTQ = P, i.e., all PTIME query classes can be made Π-tractable via proper re-factorizations. This work is a step towards understanding the tractability of queries in the context of big data. 1
Kapre: On-GPU Audio Preprocessing Layers for a Quick Implementation of Deep Neural Network Models with Keras
We introduce Kapre, Keras layers for audio and music signal preprocessing.
Music research using deep neural networks requires a heavy and tedious
preprocessing stage, for which audio processing parameters are often ignored in
parameter optimisation. To solve this problem, Kapre implements time-frequency
conversions, normalisation, and data augmentation as Keras layers. We report
simple benchmark results, showing real-time on-GPU preprocessing adds a
reasonable amount of computation.Comment: ICML 2017 machine learning for music discover
Sam2bam: High-Performance Framework for NGS Data Preprocessing Tools
This paper introduces a high-throughput software tool framework called {\it
sam2bam} that enables users to significantly speedup pre-processing for
next-generation sequencing data. The sam2bam is especially efficient on
single-node multi-core large-memory systems. It can reduce the runtime of data
pre-processing in marking duplicate reads on a single node system by 156-186x
compared with de facto standard tools. The sam2bam consists of parallel
software components that can fully utilize the multiple processors, available
memory, high-bandwidth of storage, and hardware compression accelerators if
available.
The sam2bam provides file format conversion between well-known genome file
formats, from SAM to BAM, as a basic feature. Additional features such as
analyzing, filtering, and converting the input data are provided by {\it
plug-in} tools, e.g., duplicate marking, which can be attached to sam2bam at
runtime.
We demonstrated that sam2bam could significantly reduce the runtime of NGS
data pre-processing from about two hours to about one minute for a whole-exome
data set on a 16-core single-node system using up to 130 GB of memory. The
sam2bam could reduce the runtime for whole-genome sequencing data from about 20
hours to about nine minutes on the same system using up to 711 GB of memory
Sparse and Unique Nonnegative Matrix Factorization Through Data Preprocessing
Nonnegative matrix factorization (NMF) has become a very popular technique in
machine learning because it automatically extracts meaningful features through
a sparse and part-based representation. However, NMF has the drawback of being
highly ill-posed, that is, there typically exist many different but equivalent
factorizations. In this paper, we introduce a completely new way to obtaining
more well-posed NMF problems whose solutions are sparser. Our technique is
based on the preprocessing of the nonnegative input data matrix, and relies on
the theory of M-matrices and the geometric interpretation of NMF. This approach
provably leads to optimal and sparse solutions under the separability
assumption of Donoho and Stodden (NIPS, 2003), and, for rank-three matrices,
makes the number of exact factorizations finite. We illustrate the
effectiveness of our technique on several image datasets.Comment: 34 pages, 11 figure
- …
