Search CORE

3 research outputs found

Data Quality Coverage

Author: Cruz David Rissato
Cunningham Emmett
Ezete Chioma
He Keyu
Lee Steven
Lees W. Max
Li Mingyang
Liu Yang
Wu Eric
Publication venue: Technical Disclosure Commons
Publication date: 13/09/2021
Field of study

The quality of data is typically measured on a subset of all available data. It is of interest to know if such a measurement, performed on a subset of data, is representative of the entire corpus of data. This disclosure describes techniques that use historical data and metadata of a given time series to determine the set of useful data quality checks that can exist. The set of useful data quality checks is compared to the actual set of data quality checks to provide a percentage of data quality coverage that a given data set has

Technical Disclosure Common

Automated Generation of Data Quality Checks by Identifying Key Dimensions and Values

Author: Cruz David Rissato
Cunningham Emmett
Ezete Chioma
He Keyu
Lee Steven
Lees W. Max
Li Mingyang
Liu Yang
Wu Eric
Publication venue: Technical Disclosure Commons
Publication date: 18/11/2021
Field of study

Monitoring the quality and integrity of the data stored in a data warehouse is necessary to ensure correct and reliable operation. Such checks can include detecting anomalous and/or impermissible values for the metrics and dimensions present in the database tables. Determining useful and effective checks and bounds is difficult and tedious, and requires high levels of expertise and familiarity with the data. This disclosure describes techniques for automating the creation of data quality checks based on examining database schema and contents to identify important dimensions and values for data quality checks. The techniques utilize the observation that, in practice, a subset of the values of a database field are likely of operational importance. These are automatically identified based on calculating importance-adjusted data quality coverage by assigning importance to metrics, dimensions, and dimension values. Data quality checks are automatically generated for effective coverage of the key dimensions and values. The generation of checks can involve selecting from a repository of historically effective checks generated by experts and/or applying time series anomaly detection to metrics in entirety or sliced by key dimension values

Technical Disclosure Common

Recommended from our members

IDseq-An open source cloud-based pipeline and analysis service for metagenomic pathogen detection and monitoring.

Author: Ahyong Vida
Bohl Jennifer A
Carvalho Tiago
Chea Sophana
Cruz David Rissato
de Bourcy Charles FA
DeRisi Joseph L
Dimitrov Boris
Dingle Greg
Egger Rebecca
Han Julie
Holmes Olivia B
Juan Yun-Fang
Kalantar Katrina L
King Ryan
Kislyuk Andrey
Lay Sreyngim
Lin Michael F
Manning Jessica E
Mariano Maria
Morse Todd
Reynoso Lucia V
Sheu Jonathan
Tang Jennifer
Tato Cristina M
Wang James
Zhang Mark A
Zhong Emily
Publication venue: eScholarship, University of California
Publication date: 01/10/2020
Field of study

BackgroundMetagenomic next-generation sequencing (mNGS) has enabled the rapid, unbiased detection and identification of microbes without pathogen-specific reagents, culturing, or a priori knowledge of the microbial landscape. mNGS data analysis requires a series of computationally intensive processing steps to accurately determine the microbial composition of a sample. Existing mNGS data analysis tools typically require bioinformatics expertise and access to local server-class hardware resources. For many research laboratories, this presents an obstacle, especially in resource-limited environments.FindingsWe present IDseq, an open source cloud-based metagenomics pipeline and service for global pathogen detection and monitoring (https://idseq.net). The IDseq Portal accepts raw mNGS data, performs host and quality filtration steps, then executes an assembly-based alignment pipeline, which results in the assignment of reads and contigs to taxonomic categories. The taxonomic relative abundances are reported and visualized in an easy-to-use web application to facilitate data interpretation and hypothesis generation. Furthermore, IDseq supports environmental background model generation and automatic internal spike-in control recognition, providing statistics that are critical for data interpretation. IDseq was designed with the specific intent of detecting novel pathogens. Here, we benchmark novel virus detection capability using both synthetically evolved viral sequences and real-world samples, including IDseq analysis of a nasopharyngeal swab sample acquired and processed locally in Cambodia from a tourist from Wuhan, China, infected with the recently emergent SARS-CoV-2.ConclusionThe IDseq Portal reduces the barrier to entry for mNGS data analysis and enables bench scientists, clinicians, and bioinformaticians to gain insight from mNGS datasets for both known and novel pathogens

eScholarship - University of California