18 research outputs found

    Dynamic quantile causal inference and forecasting

    Get PDF
    Standard impulse response functions measure the average effect of a shock on a response variable. However, different parts of the distribution of the response variable may react to the shock differently. The first chapter, “Quantile Structural Vector Autoregression”, introduces a framework to measure the dynamic causal effects of shocks on the entire distribution of response variables, not just on the mean. Various identification schemes are considered: shortrun and long-run restrictions, external instruments, and their combinations. Asymptotic distribution of the estimators is established. Simulations show our method is robust to heavy tails. Empirical applications reveal causal effects that cannot be captured by the standard approach. For example, the effect of oil price shock on GDP growth is statistically significant only in the left part of GDP growth distribution, so a spike in oil price may cause a recession, but there is no evidence that a drop in oil price may cause an expansion. Another application reveals that real activity shocks reduce stock market volatility. The second chapter, “Quantile Local Projections: Identification, Smooth Estimation, and Inference”, is devoted to an increasingly popular method to capture heterogeneity of impulse response functions, namely to local projections estimated by quantile regression. We study their identification by short-run restrictions, long-run restrictions, and external instruments. To overcome their excessive volatility, we introduce two novel estimators: Smooth Quantile Projections (SQP) and Smooth Quantile Projections with Instruments (SQPI). The SQPI inference is valid under weak instruments. We propose information criteria for optimal smoothing and apply the estimators to shocks in financial conditions and monetary policy. We demonstrate that financial conditions affect the entire distribution of future GDP growth and not just its lower part as previously thought. The third chapter, “Smooth Quantile Projections in a Data-Rich Environment”, modifies the estimator from the second chapter to construct distribution forecasting in a setting with potentially many variables. To this end we introduce a novel estimator, Smooth Quantile Projections with Lasso. The estimator involves two penalties, one controlling roughness of the forecasts over forecast horizons, while the other penalty selects the most informative set of predictors. We also introduce information criteria to guide the optimal choice of the two penalties and represent the problem as a linear program in standard form.I gratefully acknowledge funding from the Ministerio de Educación, Cultura y Deporte through its grant Formación de Profesorado Universitario (FPU).Programa de Doctorado en Economía por la Universidad Carlos III de MadridPresidente: José Olmo Badenas.- Secretario: Victor Emilio Troster.- Vocal: Mario Alloza Fruto

    Novel computational techniques for mapping and classifying Next-Generation Sequencing data

    Get PDF
    Since their emergence around 2006, Next-Generation Sequencing technologies have been revolutionizing biological and medical research. Quickly obtaining an extensive amount of short or long reads of DNA sequence from almost any biological sample enables detecting genomic variants, revealing the composition of species in a metagenome, deciphering cancer biology, decoding the evolution of living or extinct species, or understanding human migration patterns and human history in general. The pace at which the throughput of sequencing technologies is increasing surpasses the growth of storage and computer capacities, which creates new computational challenges in NGS data processing. In this thesis, we present novel computational techniques for read mapping and taxonomic classification. With more than a hundred of published mappers, read mapping might be considered fully solved. However, the vast majority of mappers follow the same paradigm and only little attention has been paid to non-standard mapping approaches. Here, we propound the so-called dynamic mapping that we show to significantly improve the resulting alignments compared to traditional mapping approaches. Dynamic mapping is based on exploiting the information from previously computed alignments, helping to improve the mapping of subsequent reads. We provide the first comprehensive overview of this method and demonstrate its qualities using Dynamic Mapping Simulator, a pipeline that compares various dynamic mapping scenarios to static mapping and iterative referencing. An important component of a dynamic mapper is an online consensus caller, i.e., a program collecting alignment statistics and guiding updates of the reference in the online fashion. We provide Ococo, the first online consensus caller that implements a smart statistics for individual genomic positions using compact bit counters. Beyond its application to dynamic mapping, Ococo can be employed as an online SNP caller in various analysis pipelines, enabling SNP calling from a stream without saving the alignments on disk. Metagenomic classification of NGS reads is another major topic studied in the thesis. Having a database with thousands of reference genomes placed on a taxonomic tree, the task is to rapidly assign a huge amount of NGS reads to tree nodes, and possibly estimate the relative abundance of involved species. In this thesis, we propose improved computational techniques for this task. In a series of experiments, we show that spaced seeds consistently improve the classification accuracy. We provide Seed-Kraken, a spaced seed extension of Kraken, the most popular classifier at present. Furthermore, we suggest ProPhyle, a new indexing strategy based on a BWT-index, obtaining a much smaller and more informative index compared to Kraken. We provide a modified version of BWA that improves the BWT-index for a quick k-mer look-up

    Development of a minimally invasive molecular biomarker for early detection of lung cancer

    Full text link
    The diagnostic evaluation of ever smokers with pulmonary nodules represents a growing clinical challenge due to the implementation of lung cancer screening. The high false-positive rate of screening frequently results in the use of unnecessary invasive procedures in patients who are ultimately diagnosed as benign, clearly highlighting the need for additional diagnostic approaches. We previously derived and validated a bronchial epithelial gene-expression biomarker to detect lung cancer in ever smokers. However, bronchoscopy is not always chosen as a diagnostic modality. Given that bronchial and nasal epithelial gene-expression are similarly altered by cigarette smoke exposure, we sought to determine if cancer-associated gene-expression might also be detectable in the more readily accessible nasal epithelium. Nasal epithelial brushings were prospectively collected from ever smokers undergoing diagnostic evaluation for lung cancer in the AEGIS-1 (n=375) and AEGIS-2 (n=130) clinical trials and gene-expression profiled using microarrays. The computational framework used to discover biomarkers in these data was formalized and implemented in an open-source R-package. We identified 535 genes in the nasal epithelium of AEGIS-1 patients whose expression was associated with lung cancer status. Using matched bronchial gene-expression data from a subset of these patients, we found significantly concordant cancer-associated gene-expression alterations between the two airway sites. A nasal lung cancer classifier derived in the AEGIS-1 cohort that combined clinical factors and nasal gene-expression had significantly higher AUC (0.81) and sensitivity (0.91) than the clinical-factor model alone in independent samples from the AEGIS-2 cohort. These results support that the airway epithelial field of lung cancer-associated injury extends to the nose and demonstrates the potential of using nasal gene-expression as a non-invasive biomarker for lung cancer detection. The framework for deriving this biomarker was generalized and implemented in an open-source R-package. The package provides a computational pipeline to compare biomarker development strategies using microarray data. The results from this pipeline can be used to highlight the optimal model development parameters for a given dataset leading to more robust and accurate models. This package provides the community with a novel and powerful tool to facilitate biomarker discovery in microarray data

    Bacterial strain-tracking across the human skin landscape in health and disease

    Get PDF
    Metagenomics, or genomic sequence of the community of microbiota (bacteria, fungi, virus), enables an investigation of the full complement of genetic material, including virulence, antibiotic resistance, and strain differentiating markers. The granularity to distinguish between closely related strains is important as within one species, these strains possess distinct functions and relationships to a host. To analyze metagenomic samples, I developed a reference-based approach that utilizes both single nucleotide variants and genetic content to assign species and strain-level designations. After refining this approach with complex simulated communities, I utilized it to analyze the microbial communities present in skin samples from healthy and diseased individuals. First, to investigate strain-level heterogeneity in healthy adults, I focused on the common skin commensals Propionibacterium acnes and Staphylococcus epidermidis with well-documented sequence variation. Results indicated that an individual’s strains of P. acnes are shared across multiple sites of his or her body, and that those strains are more similar within than between individuals. For S. epidermidis, in addition to individual site similarities, there were also site-specific strains. Overall these results emphasize that both individuality and site specificity shape our bodies’ microbial communities. Based on longitudinal data, an individual’s strain signatures remain stable for up to a year despite external, environmental perturbations. I then used metagenomic data to explore microbial temporal dynamics in atopic dermatitis (AD; eczema), an inflammatory skin disease commonly associated with Staphylococcal species. Species-level investigation of AD flares demonstrated a microbial dichotomy in which S. aureus predominated on more severely affected patients while S. epidermidis predominated on less severely affected patients. Strain-level analysis determined that S. aureus-predominant patients were monocolonized with distinct S. aureus strains, while all patients had heterogeneous S. epidermidis strain communities. To assess the host immunologic effects of these species, I topically applied patient-derived strains to mice. AD strains of S. aureus were sufficient to elicit a skin immune response, characteristic of AD patients. This suggests a model whereby staphylococcal strains contribute to AD progression through activation of the host immune system. Overall, this strain-level analysis of healthy and disease communities provides previously unexplored resolution of human skin microbiome.2018-03-24T00:00:00