508 research outputs found
Improving quality of high-throughput sequencing reads
Rapid advances in high-throughput sequencing (HTS) technologies have led to an exponential increase in the amount of sequencing data. HTS sequencing reads, however, contain far more errors than does data collected through traditional sequencing methods. Errors in HTS reads degrade the quality of downstream analyses. Correcting errors has been shown to improve the quality of these analyses.
Correcting errors in sequencing data is a time-consuming and memory-intensive process. Even though many methods for correcting errors in HTS data have been developed, no one could correct errors with high accuracy while using a small amount of memory and in a short time. Another problem in using error correction methods is that no standard or comprehensive method is yet available to evaluate the accuracy and effectiveness of these error correction methods.
To alleviate these limitations and analyze error correction outputs, this dissertation presents three novel methods. The first one, known as BLESS (Bloom-filter-based error correction solution for high-throughput sequencing reads), is a new error correction method that uses a Bloom filter as the main data structure. Compared to previous methods, it allows for the correction of errors with the highest accuracy at an average of 40 X memory usage reduction. BLESS is parallelized using hybrid OpenMP and MPI programming, which makes BLESS one of the fastest error correction tools. The second method, known as SPECTACLE (Software Package for Error Correction Tool Assessment on Nucleic Acid Sequences), supplies a standard way to evaluate error correction methods. SPECTACLE is the comprehensive method that can (1) do a quantitative analysis on both DNA and RNA corrected reads from any sequencing platforms and (2) handle diploid genomes and differentiate heterozygous alleles from sequencing errors.
Lastly, this research analyzes the effect of sequencing errors on variant calling, which is one of the most important clinical applications for HTS data. For this, the environments for tracing the effect of sequencing errors on germline and somatic variant calling was developed. Using the environment, this research studies how sequencing errors degrade the results of variant calling and how the results can be improved. Based on the new findings, ROOFTOP (RemOve nOrmal reads From TumOr samPles) that can improve the accuracy of somatic variant calling by removing normal cells in tumor samples.
A series of studies on sequencing errors in this dissertation would be helpful to understand how sequencing errors degrade downstream analysis outputs and how the quality of sequencing data could be improved by removing errors in the data
Big science and big data in nephrology
There have been tremendous advances during the last
decade in methods for large-scale, high-throughput data
generation and in novel computational approaches to
analyze these datasets. These advances have had a
profound impact on biomedical research and clinical
medicine. The field of genomics is rapidly developing
toward single-cell analysis, and major advances in
proteomics and metabolomics have been made in recent
years. The developments on wearables and electronic
health records are poised to change clinical trial design.
This rise of ‘big data’ holds the promise to transform not
only research progress, but also clinical decision making
towards precision medicine. To have a true impact, it
requires integrative and multi-disciplinary approaches that
blend experimental, clinical and computational expertise
across multiple institutions. Cancer research has been at
the forefront of the progress in such large-scale initiatives,
so-called ‘big science,’ with an emphasis on precision
medicine, and various other areas are quickly catching up.
Nephrology is arguably lagging behind, and hence these
are exciting times to start (or redirect) a research career to
leverage these developments in nephrology. In this review,
we summarize advances in big data generation,
computational analysis, and big science initiatives, with a
special focus on applications to nephrology
Quality assurance within non-professional translation teams : action research in the non-profit sector
La traduction bénévole et collaborative, sous ses diverses formes telles que l’initiative Wikipedia, croît de jour en jour et requiert d’être encadrée pour garantir un contrôle de qualité. Et pourtant, la traduction non professionnelle est un domaine encore peu exploré. Dans le milieu de la traduction professionnelle, des critères assez sévères existent en ce qui concerne la formation et l’expérience du traducteur, l’assurance-qualité, les délais, et les droits d’auteur. Néanmoins, dans le cadre de la traduction qui implique la collaboration en ligne et des centaines de traducteurs (souvent bénévoles), ces aspects sont nettement plus flous. Ce projet de recherche aborde la question centrale de l’assurance qualité au sein des équipes de traducteurs non professionnels. Cela se fait sous l’angle de la recherche-action effectuée dans le secteur sans but lucratif, précisément avec le groupe des Traducteurs du Roi que j’ai formé en 2011 pour combler le manque de documentation en français au sein de notre confession religieuse. Il me semblait que nous ne pouvions simplement appliquer à notre équipe les normes et méthodes de la traduction professionnelle pour assurer la qualité. La nature de la traduction non professionnelle exige une approche personnalisée. J’ai décidé d’effectuer des recherches à l’intérieur des Traducteurs du Roi pour élaborer un système d’assurance qualité conçu spécifiquement pour la traduction non professionnelle. En adaptant des modèles professionnels au contexte non professionnel, j’ai été en mesure de créer un processus approprié de sélection des traducteurs, un processus global de révision des traductions et un processus ciblé de formation des traducteurs. Les critères de sélection comprennent les compétences et les traits de caractère qui favorisent la réussite au sein du système d’assurance qualité. Des processus spécifiques de révision sont jumelés aux niveaux de qualité souhaités en fonction de l’objet des documents. La composante de formation se concentre sur les changements de paradigmes encapsulés dans un ensemble de meilleures pratiques pour les traducteurs non professionnels. Ces trois éléments, la sélection, la révision et la formation, se complètent dans un système efficace d’assurance qualité. D’autres équipes de traducteurs non professionnels peuvent intégrer ce système, puisqu’il est spécifiquement adapté aux défis de travailler avec des traducteurs bénévoles et n’est pas spécifique à certaines langues. Mon projet apporte une importante contribution à la traductologie, premièrement en valorisant le domaine de la traduction non professionnelle et en soulignant le besoin d’une approche différente de celle de la traduction professionnelle. Je démontre que l’assurance qualité est possible au sein d’un groupe de traducteurs non professionnels et fournit un système efficace pour y arriver. Plus largement, mes recherches visent à sensibiliser les traductologues à deux idées importantes. D'abord, l'apparition inévitable de nouvelles pratiques de traduction différentes des pratiques traditionnelles, pratiques qu'il conviendra de prendre de plus en plus en compte et dont les leçons pourront être tirées. En outre, les chercheurs pourraient s’efforcer davantage d'assurer que les concepts, les normes et le métalangage de la traductologie soient compréhensibles et applicables dans des contextes non traditionnels.Volunteer and collaborative translation in diverse forms, such as the Wikipedia initiative, is growing daily and needs direction in order to guarantee quality. And yet, non-professional translation is a field that remains largely unexplored. In the realm of professional translation, there are strict criteria related to translator training and experience, quality assurance, deadlines, and copyrights. However, in a context that involves online collaboration and hundreds of translators (often volunteers), these aspects are much less defined. This research project addresses the crucial issue of quality assurance within non-professional translation teams. This is done through the lens of action research carried out in the non-profit sector, specifically with a group called The King’s Translators which I formed in November 2011 to meet the need for French resources within our church denomination. It was apparent to me that we could not simply apply professional translation norms and methods within our team in order to ensure quality. The nature of non-professional translation requires a customized approach. I decided to conduct research from within The King’s Translators to develop a quality assurance system designed specifically for non-professional translation. By adapting professional models to the non-professional environment, I was able to create processes for appropriate translator selection, comprehensive translation revision/editing, and focused translator training. The criteria for translator selection include skills and character traits that enable a team member to succeed within the quality assurance system. Specific translation revision/editing processes are matched to desired quality levels based on the purpose of the translated documents. The translator training component concentrates on paradigm shifts encapsulated in a set of best practices for non-professional translators. These three elements of translator selection, translation revision/editing, and translator training harmonize in an effective quality assurance system. This system can be implemented by other non-professional translation teams, as it is specifically adapted to the challenges of working with volunteer translators and is not language specific. This project makes an important contribution to Translation Studies, first by highlighting the field of non-professional translation and emphasizing the need of an approach different than what is used for professional translation. I demonstrate how quality assurance is possible within a team of non-professional translators and provide an effective system for achieving it. On a broader level, my research aims to make Translation Studies scholars more aware that while new translation practices running counter to traditional mindsets will inevitably emerge, this should not prevent us from investigating and learning from them. In addition, researchers could make a greater effort to ensure that Translation Studies concepts, norms, and metalanguage are understandable and applicable in non-traditional contexts
Breast Cancer Biomarkers with Clinical Relevance Identified by Massively-parallel DNA and RNA Sequencing
Women have a 10% lifetime risk of developing breast cancer, and the disease has surpassed lung cancer as the most frequently diagnosed type of cancer in the world. Breast cancer originates in the epithelial cells of the mammary gland and tumor cells have undergone a series of genetic and phenotypic changes that confer tumor promoting properties.Genomic rearrangement is a common phenomenon in cancer, involving breakage and dysfunctional repair of chromosomes. With the aim to characterize such variants and their progression from primary to metastatic disease, we performed whole-genome sequencing of paired primary tumors and metastases (study I) and paired contralateral breast cancers (CBC) (study II). Metastasis rearrangement profiles bore a remarkable resemblance to the respective primary tumors (median 89% shared), indicating that the rearrangements were early events in tumor development, remaining stable throughout progression. Our study on CBC (study II) subsequently allowed us to identify 1 in 10 tumor pairs that likely represented metastatic spread rather than a new primary tumor (76% of rearrangements shared). One of the risk factors for breast cancer is high exposure to estrogens; signaling via estrogen receptor (ER) α is considered the most important driver for the 75% of tumors expressing this marker. Mutations in the gene for ERα are known to be common in endocrine therapy-refractory breast cancer and confer resistance to standard anti-hormonal treatment. In study III, we interrogated RNA-seq data from 3217 primary breast tumors from the SCAN-B initiative and found that 1% of tumors were positive for one of the mutations at surgery. For those patients that received adjuvant endocrine therapy, the mutations were associated to worse overall and relapse-free survival. In study IV, we further explored the SCAN-B dataset to investigate the phenotypic properties and prognosis associated to high expression of the much less well studied ERβ. We discovered that this receptor was not abundantly expressed, with 1/3 of tumors entirely negative. Further, we saw that patients with high levels of ERβ mRNA had slightly improved overall survival and that the expression of ERβ was associated to expression of genes involved in immune cell activation.In summary, we have employed sequencing technology to study breast cancer patient material to identify and assess the validity of genomic and transcriptomic changes that may both be of value as potential biomarkers, and in elucidating biological mechanisms that drive or suppress breast cancer progression
Characterization of the genetic architecture of dilated cardiomyopathy using families and cohorts
Cardiomyopathies are the leading cause of heart transplantation in the developed world, and dilated cardiomyopathy accounts for an important proportion of all heart failure cases in large clinical trials. In spite of a strong genetic basis for dilated cardiomyopathy being demonstrated widely in the past two decades, 60% of familial cases remain unexplained. Dilated cardiomyopathy is characterized by marked genetic heterogeneity, with more than 60 individual genes reported to cause the disease, yet only one (TTN) explaining more than 10% of cases.
Here, high-throughput sequencing data, advanced imaging techniques and bioinformatics analyses were used to dissect the genetic architecture of dilated cardiomyopathy, by measuring the contribution of single genes and multi-genic variation on disease risk and severity, and performing gene and variant discovery in affected families. Burden testing (using bespoke software developed in the R programming language for this study) and regression modelling were used to examine the genetic determinants of disease by comparing a cohort of disease cases (n=332) to ethnically matched, phenotypically characterised healthy controls (n=319). This produced a measure of the contribution of each gene to dilated cardiomyopathy, taking into account the background variation rate in the general population. Analyses of multi-genic interactions were also performed, and having detected the signature of additive effects of variation in multiple genes on both disease likelihood and severity, further analyses were performed to identify specific gene-gene interactions in causing dilated cardiomyopathy.
Subsequently, variant prioritisation strategies were developed to identify, from whole-exome
sequencing data, possible genetic causes of an unexplained and very severe form of early-onset
dilated cardiomyopathy segregating in a family. This led to the identification of new candidate
genes, which might contribute towards a genetic diagnosis in the analysed family and to new
insights into the pathogenesis of dilated cardiomyopathy. Preparatory work in developing variant prioritisation pipelines from whole-exome sequencing data had been performed earlier, on families affected with various inherited arrhythmia syndromes.Open Acces
Whole-cell segmentation of tissue images with human-level performance using large-scale data annotation and deep learning
Understanding the spatial organization of tissues is of critical importance for both basic and translational research. While recent advances in tissue imaging are opening an exciting new window into the biology of human tissues, interpreting the data that they create is a significant computational challenge. Cell segmentation, the task of uniquely identifying each cell in an image, remains a substantial barrier for tissue imaging, as existing approaches are inaccurate or require a substantial amount of manual curation to yield useful results. Here, we addressed the problem of cell segmentation in tissue imaging data through large-scale data annotation and deep learning. We constructed TissueNet, an image dataset containing >1 million paired whole-cell and nuclear annotations for tissue images from nine organs and six imaging platforms. We created Mesmer, a deep learning-enabled segmentation algorithm trained on TissueNet that performs nuclear and whole-cell segmentation in tissue imaging data. We demonstrated that Mesmer has better speed and accuracy than previous methods, generalizes to the full diversity of tissue types and imaging platforms in TissueNet, and achieves human-level performance for whole-cell segmentation. Mesmer enabled the automated extraction of key cellular features, such as subcellular localization of protein signal, which was challenging with previous approaches. We further showed that Mesmer could be adapted to harness cell lineage information present in highly multiplexed datasets. We used this enhanced version to quantify cell morphology changes during human gestation. All underlying code and models are released with permissive licenses as a community resource
Building essential biodiversity variables (EBVs) of species distribution and abundance at a global scale
Much biodiversity data is collected worldwide, but it remains challenging to assemble the scattered knowledge for assessing biodiversity status and trends. The concept of Essential Biodiversity Variables (EBVs) was introduced to structure biodiversity monitoring globally, and to harmonize and standardize biodiversity data from disparate sources to capture a minimum set of critical variables required to study, report and manage biodiversity change. Here, we assess the challenges of a ‘Big Data’ approach to building global EBV data products across taxa and spatiotemporal scales, focusing on species distribution and abundance. The majority of currently available data on species distributions derives from incidentally reported observations or from surveys where presence-only or presence–absence data are sampled repeatedly with standardized protocols. Most abundance data come from opportunistic population counts or from population time series using standardized protocols (e.g. repeated surveys of the same population from single or multiple sites). Enormous complexity exists in integrating these heterogeneous, multi-source data sets across space, time, taxa and different sampling methods. Integration of such data into global EBV data products requires correcting biases introduced by imperfect detection and varying sampling effort, dealing with different spatial resolution and extents, harmonizing measurement units from different data sources or sampling methods, applying statistical tools and models for spatial inter- or extrapolation, and quantifying sources of uncertainty and errors in data and models. To support the development of EBVs by the Group on Earth Observations Biodiversity Observation Network (GEO BON), we identify 11 key workflow steps that will operationalize the process of building EBV data products within and across research infrastructures worldwide. These workflow steps take multiple sequential activities into account, including identification and aggregation of various raw data sources, data quality control, taxonomic name matching and statistical modelling of integrated data. We illustrate these steps with concrete examples from existing citizen science and professional monitoring projects, including eBird, the Tropical Ecology Assessment and Monitoring network, the Living Planet Index and the Baltic Sea zooplankton monitoring. The identified workflow steps are applicable to both terrestrial and aquatic systems and a broad range of spatial, temporal and taxonomic scales. They depend on clear, findable and accessible metadata, and we provide an overview of current data and metadata standards. Several challenges remain to be solved for building global EBV data products: (i) developing tools and models for combining heterogeneous, multi-source data sets and filling data gaps in geographic, temporal and taxonomic coverage, (ii) integrating emerging methods and technologies for data collection such as citizen science, sensor networks, DNA-based techniques and satellite remote sensing, (iii) solving major technical issues related to data product structure, data storage, execution of workflows and the production process/cycle as well as approaching technical interoperability among research infrastructures, (iv) allowing semantic interoperability by developing and adopting standards and tools for capturing consistent data and metadata, and (v) ensuring legal interoperability by endorsing open data or data that are free from restrictions on use, modification and sharing. Addressing these challenges is critical for biodiversity research and for assessing progress towards conservation policy targets and sustainable development goals
- …