24 research outputs found

    Improved statistical methodology for high-throughput omics data analysis

    Get PDF
    Over the last two decades, the advent of high-throughput omics technology has substantially revolutionized biological and biomedical research. A large volume of omics data has been produced with the rapid development of sequencing techniques. Meanwhile, researchers have developed a wide range of computational tools to manage and analyze the omics data. Although the implementation of these tools generates significant discoveries, processing and interpreting the omics data efficiently and accurately is still a big challenge. In this thesis, we aim to develop novel statistical methodologies and algorithms for omics data analysis. We implement the methods for both simulated and real data from different types of cancers. Based on the evaluation and comparison with existing tools, we find that our methods achieve higher accuracy and better performance in analyzing different types of omics data. In Study I, we build an analysis pipeline to integrate multiple levels of omics data and identify potential driver genes in neuroblastoma. The pipeline employs gene expression profile, microarray-based comparative genomic hybridization data, and functional gene interaction network to detect cancer-related driver genes. We identify a total of 66 patient-specific and four common driver genes. The genes are summarized into a driver-gene score (DGscore) for each patient. We find that the patients with a low DGscore have better survival than those with a high DGscore (p-value=0.006). In Study II, we develop a novel method named XAEM to quantify isoformlevel expression using RNA sequencing data. There are two major components in this method. First, we construct a design matrix X as the starting parameter in the quantification model. Second, we utilize an alternating Expectation Maximization algorithm to estimate the design matrix X and isoform expression b iteratively. We compare XAEM with several quantification methods using both simulated and real data. The result shows that XAEM achieves higher accuracy in multipleisoform genes and obtains substantially better rediscovery rates in the differentialexpression analysis. In Study III, we extend the algorithm from Study II and develop an approach named MAX to quantify mutant-allele expression at the isoform level. For a given gene and a list of mutations, we first generate the mutant reference by incorporating all possible mutant isoforms from the wild-type isoform. The alternating Expectation Maximization algorithm is then applied to estimate the isoform abundance. We implement MAX to a real dataset of acute myeloid leukemia. Using the mutant-allele expression, we discover a subgroup of NPM1-mutated patients that has better drug response to a kinase inhibitor. In Study IV, we build a pipeline to detect fusion genes at DNA level using whole-exome sequencing data. The pipeline is utilized to three comprehensive datasets of acute myeloid leukemia and prostate cancer patients. Compared with the detection results from RNA sequencing data, we find that several major fusion events in these two cancer types are validated in some of the patients. However, the overall results indicate that it is challenging to identify chimeric genes using exome sequencing data due to its inherent limitations. Altogether, we have developed several statistical and bioinformatics tools to analyze different types of omics data, which demonstrate higher accuracy and better performance than other competing approaches. The results in this thesis will provide novel insights into omics data analysis and facilitate significant discoveries in cancer research

    Fusion Gene Detection Using Whole-Exome Sequencing Data in Cancer Patients

    Get PDF
    Several fusion genes are directly involved in the initiation and progression of cancers. Numerous bioinformatics tools have been developed to detect fusion events, but they are mainly based on RNA-seq data. The whole-exome sequencing (WES) represents a powerful technology that is widely used for disease-related DNA variant detection. In this study, we build a novel analysis pipeline called Fuseq-WES to detect fusion genes at DNA level based on the WES data. The same method applies also for targeted panel sequencing data. We assess the method to real datasets of acute myeloid leukemia (AML) and prostate cancer patients. The result shows that two of the main AML fusion genes discovered in RNA-seq data, PML-RARA and CBFB-MYH11, are detected in the WES data in 36 and 63% of the available samples, respectively. For the targeted deep-sequencing of prostate cancer patients, detection of the TMPRSS2-ERG fusion, which is the most frequent chimeric alteration in prostate cancer, is 91% concordant with a manually curated procedure based on four other methods. In summary, the overall results indicate that it is challenging to detect fusion genes in WES data with a standard coverage of ∼ 15–30x, where fusion candidates discovered in the RNA-seq data are often not detected in the WES data and vice versa. A subsampling study of the prostate data suggests that a coverage of at least 75x is necessary to achieve high accuracy

    Conservation and implications of eukaryote transcriptional regulatory regions across multiple species

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Increasing evidence shows that whole genomes of eukaryotes are almost entirely transcribed into both protein coding genes and an enormous number of non-protein-coding RNAs (ncRNAs). Therefore, revealing the underlying regulatory mechanisms of transcripts becomes imperative. However, for a complete understanding of transcriptional regulatory mechanisms, we need to identify the regions in which they are found. We will call these transcriptional regulation regions, or TRRs, which can be considered functional regions containing a cluster of regulatory elements that cooperatively recruit transcriptional factors for binding and then regulating the expression of transcripts.</p> <p>Results</p> <p>We constructed a hierarchical stochastic language (HSL) model for the identification of core TRRs in yeast based on regulatory cooperation among TRR elements. The HSL model trained based on yeast achieved comparable accuracy in predicting TRRs in other species, e.g., fruit fly, human, and rice, thus demonstrating the conservation of TRRs across species. The HSL model was also used to identify the TRRs of genes, such as p53 or <it>OsALYL1</it>, as well as microRNAs. In addition, the ENCODE regions were examined by HSL, and TRRs were found to pervasively locate in the genomes.</p> <p>Conclusion</p> <p>Our findings indicate that 1) the HSL model can be used to accurately predict core TRRs of transcripts across species and 2) identified core TRRs by HSL are proper candidates for the further scrutiny of specific regulatory elements and mechanisms. Meanwhile, the regulatory activity taking place in the abundant numbers of ncRNAs might account for the ubiquitous presence of TRRs across the genome. In addition, we also found that the TRRs of protein coding genes and ncRNAs are similar in structure, with the latter being more conserved than the former.</p

    Thermoelectric and mechanical performances of ionic liquid-modulated PEDOT:PSS/SWCNT composites at high temperatures

    No full text
    Significant progress has been achieved for flexible polymer thermoelectric (TE) composites in the last decade due to their potential application in wearable devices and sensors. In sharp contrast to the exceptional increase in TE studies at room temperature, the mechanical performance of polymer TE composites has received relatively less attention despite the significance of the application of TE composites in high-temperature environments. The TE and mechanical performances of flexible poly(3,4-ethylenedioxythiophene):poly(styrene sulfonate)/single-walled carbon nanotube (PEDOT:PSS/SWCNT) composite films with an ionic liquid (IL) (referred to as “PEDOT:PSS/SWCNT-IL”) at high temperatures are studied in the present work. The fabricated composite films show increasing TE performance with increasing temperature and SWCNT content. The maximum value of the power factor reaches 301.35 μW m-1 K-2 at 470 K for the PEDOT:PSS/SWCNT-IL composite. Furthermore, the addition of the IL improves the elongation at break of the composites compared to the IL-free composites. This work promotes the advancement of flexible polymer TE composites and widens their potential applications at different temperature ranges

    AMO- Advanced Modeling and Optimization

    No full text
    Abstract: We investigate the modeling of commodity prices that exhibit &amp;quot;fat tails &amp;quot; in the empirical marginaldistributions. Using electricity price data, we explore the goodness-of-fit of different classes of distributions with an emphasis on capturing the fat tails in the data. Specifically, we fit empirical marginal distributionsof time series data to distributions with either quantile functions or probability density functions in closedforms. The theoretical distributions under consideration all have rich tail behaviors that enable us to modelthe heavy tails in the commodity prices caused by jumps and stochastic volatility. The fact that the theoretical distributions are easy to simulate makes the models appealing since the tasks of parameter estimation andderivative pricing can be directly implemented based on observed market data

    A Widely Linear MMSE Anti-Collision Method for Multi-Antenna RFID Readers

    No full text

    Accumulation of potential driver genes with genomic alterations predicts survival of high-risk neuroblastoma patients

    No full text
    Abstract Background Neuroblastoma is the most common pediatric malignancy with heterogeneous clinical behaviors, ranging from spontaneous regression to aggressive progression. Many studies have identified aberrations related to the pathogenesis and prognosis, broadly classifying neuroblastoma patients into high- and low-risk groups, but predicting tumor progression and clinical management of high-risk patients remains a big challenge. Results We integrate gene-level expression, array-based comparative genomic hybridization and functional gene-interaction network of 145 neuroblastoma patients to detect potential driver genes. The drivers are summarized into a driver-gene score (DGscore) for each patient, and we then validate its clinical relevance in terms of association with patient survival. Focusing on a subset of 48 clinically defined high-risk patients, we identify 193 recurrent regions of copy number alterations (CNAs), resulting in 274 altered genes whose copy-number gain or loss have parallel impact on the gene expression. Using a network enrichment analysis, we detect four common driver genes, ERCC6, HECTD2, KIAA1279, EMX2, and 66 patient-specific driver genes. Patients with high DGscore, thus carrying more copy-number-altered genes with correspondingly up- or down-regulated expression and functional implications, have worse survival than those with low DGscore (P = 0.006). Furthermore, Cox proportional-hazards regression analysis shows that, adjusted for age, tumor stage and MYCN amplification, DGscore is the only significant prognostic factor for high-risk neuroblastoma patients (P = 0.008). Conclusions Integration of genomic copy number alteration, expression and functional interaction-network data reveals clinically relevant and prognostic putative driver genes in high-risk neuroblastoma patients. The identified putative drivers are potential drug targets for individualized therapy. Reviewers This article was reviewed by Armand Valsesia, Susmita Datta and Aleksandra Gruca

    Research on Transmission Line Voltage Measurement Method of D-Dot Sensor Based on Gaussian Integral

    No full text
    D-dot sensors meet the development trend towards the downsizing, automation and digitalization of voltage sensors and is one of research hotspots for new voltage sensors at present. The traditional voltage measurement system of D-dot sensors makes possible the reverse solving of wire potentials according to the computational principles of the electric field inverse problem by measuring electric field values beneath the transmission line. Nevertheless, as it is limited by the solving method of the electric field inverse problem, the D-dot sensor voltage measurement system is struggling with solving difficulties and poor accuracy. To solve these problems, this paper suggests introducing a Gaussian integral into the D-dot sensor voltage measurement system to accurately measure the voltage of transmission lines. Based on studies of D-dot sensors, a transmission line voltage measurement method based on Gaussian integrals is proposed and used for the simulation of the electric field of a 220 kV and a 20 kV transmission line. The feasibility of the introduction of the Gaussian integral to solve transmission line voltage was verified by the simulation results. Finally, the performance of the Gaussian integral was verified by an experiment using the transmission line voltage measurement platform. The experimental results demonstrated that the D-dot sensor measurement system based on a Gaussian integral achieves high accuracy and the relative error is lower than 0.5%

    A fast detection of fusion genes from paired-end RNA-seq data

    Get PDF
    Background Fusion genes are known to be drivers of many common cancers, so they are potential markers for diagnosis, prognosis or therapy response. The advent of paired-end RNA sequencing enhances our ability to discover fusion genes. While there are available methods, routine analyses of large number of samples are still limited due to high computational demands. Results We develop FuSeq, a fast and accurate method to discover fusion genes based on quasi-mapping to quickly map the reads, extract initial candidates from split reads and fusion equivalence classes of mapped reads, and finally apply multiple filters and statistical tests to get the final candidates. We apply FuSeq to four validated datasets: breast cancer, melanoma and glioma datasets, and one spike-in dataset. The results reveal high sensitivity and specificity in all datasets, and compare well against other methods such as FusionMap, TRUP, TopHat-Fusion, SOAPfuse and JAFFA. In terms of computational time, FuSeq is two-fold faster than FusionMap and orders of magnitude faster than the other methods. Conclusions With this advantage of less computational demands, FuSeq makes it practical to investigate fusion genes in large numbers of samples. FuSeq is implemented in C++ and R, and available at https://github.com/nghiavtr/FuSeq for non-commercial uses.This work is partially supported by funding from the Swedish Cancer Fonden, the Swedish Science Council (VR) and the Swedish Foundation for Strategic Research (SSF)
    corecore