4,896 research outputs found

    Reduced hyperBF networks : practical optimization, regularization, and applications in bioinformatics.

    Get PDF
    A hyper basis function network (HyperBF) is a generalized radial basis function network (RBF) where the activation function is a radial function of a weighted distance. The local weighting of the distance accounts for the variation in local scaling and discriminative power along each feature. Such generalization makes HyperBF networks capable of interpolating decision functions with high accuracy. However, such complexity makes HyperBF networks susceptible to overfitting. Moreover, training a HyperBF network demands weights, centers and local scaling factors to be optimized simultaneously. In the case of a relatively large dataset with a large network structure, such optimization becomes computationally challenging. In this work, a new regularization method that performs soft local dimension reduction and weight decay is presented. The regularized HyperBF (Reduced HyperBF) network is shown to provide classification accuracy comparable to a Support Vector Machines (SVM) while requiring a significantly smaller network structure. Furthermore, the soft local dimension reduction is shown to be informative for ranking features based on their localized discriminative power. In addition, a practical training approach for constructing HyperBF networks is presented. This approach uses hierarchal clustering to initialize neurons followed by a gradient optimization using a scaled Rprop algorithm with a localized partial backtracking step (iSRprop). Experimental results on a number of datasets show a faster and smoother convergence than the regular Rprop algorithm. The proposed Reduced HyperBF network is applied to two problems in bioinformatics. The first is the detection of transcription start sites (TSS) in human DNA. A novel method for improving the accuracy of TSS recognition for recently published methods is proposed. This method incorporates a new metric feature based on oligonucleotide positional frequencies. The second application is the accurate classification of microarray samples. A new feature selection algorithm based on a Reduced HyperBF network is proposed. The method is applied to two microarray datasets and is shown to select a minimal subset of features with high discriminative information. The algorithm is compared to two widely used methods and is shown to provide competitive results. In both applications, the final Reduced HyperBF network is used for higher level analysis. Significant neurons can indicate subpopulations, while local active features provide insight into the characteristics of the subpopulation in specific and the whole class in general

    A Machine Learning Model for Discovery of Protein Isoforms as Biomarkers

    Get PDF
    Prostate cancer is the most common cancer in men. One in eight Canadian men will be diagnosed with prostate cancer in their lifetime. The accurate detection of the disease’s subtypes is critical for providing adequate therapy; hence, it is critical for increasing both survival rates and quality of life. Next generation sequencing can be beneficial when studying cancer. This technology generates a large amount of data that can be used to extract information about biomarkers. This thesis proposes a model that discovers protein isoforms for different stages of prostate cancer progression. A tool has been developed that utilizes RNA-Seq data to infer open reading frames (ORFs) corresponding to transcripts. These ORFs are used as features for classificatio. A quantification measurement, Adaptive Fragments Per Kilobase of transcript per Million mapped reads (AFPKM), is proposed to compute the expression level for ORFs. The new measurement considers the actual length of the ORF and the length of the transcript. Using these ORFs and the new expression measure, several classifiers were built using different machine learning techniques. That enabled the identification of some protein isoforms related to prostate cancer progression. The biomarkers have had a great impact on the discrimination of prostate cancer stages and are worth further investigation

    A new regularized least squares support vector regression for gene selection

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Selection of influential genes with microarray data often faces the difficulties of a large number of genes and a relatively small group of subjects. In addition to the curse of dimensionality, many gene selection methods weight the contribution from each individual subject equally. This equal-contribution assumption cannot account for the possible dependence among subjects who associate similarly to the disease, and may restrict the selection of influential genes.</p> <p>Results</p> <p>A novel approach to gene selection is proposed based on kernel similarities and kernel weights. We do not assume uniformity for subject contribution. Weights are calculated via regularized least squares support vector regression (RLS-SVR) of class levels on kernel similarities and are used to weight subject contribution. The cumulative sum of weighted expression levels are next ranked to select responsible genes. These procedures also work for multiclass classification. We demonstrate this algorithm on acute leukemia, colon cancer, small, round blue cell tumors of childhood, breast cancer, and lung cancer studies, using kernel Fisher discriminant analysis and support vector machines as classifiers. Other procedures are compared as well.</p> <p>Conclusion</p> <p>This approach is easy to implement and fast in computation for both binary and multiclass problems. The gene set provided by the RLS-SVR weight-based approach contains a less number of genes, and achieves a higher accuracy than other procedures.</p

    Discovering semantic features in the literature: a foundation for building functional associations

    Get PDF
    BACKGROUND: Experimental techniques such as DNA microarray, serial analysis of gene expression (SAGE) and mass spectrometry proteomics, among others, are generating large amounts of data related to genes and proteins at different levels. As in any other experimental approach, it is necessary to analyze these data in the context of previously known information about the biological entities under study. The literature is a particularly valuable source of information for experiment validation and interpretation. Therefore, the development of automated text mining tools to assist in such interpretation is one of the main challenges in current bioinformatics research. RESULTS: We present a method to create literature profiles for large sets of genes or proteins based on common semantic features extracted from a corpus of relevant documents. These profiles can be used to establish pair-wise similarities among genes, utilized in gene/protein classification or can be even combined with experimental measurements. Semantic features can be used by researchers to facilitate the understanding of the commonalities indicated by experimental results. Our approach is based on non-negative matrix factorization (NMF), a machine-learning algorithm for data analysis, capable of identifying local patterns that characterize a subset of the data. The literature is thus used to establish putative relationships among subsets of genes or proteins and to provide coherent justification for this clustering into subsets. We demonstrate the utility of the method by applying it to two independent and vastly different sets of genes. CONCLUSION: The presented method can create literature profiles from documents relevant to sets of genes. The representation of genes as additive linear combinations of semantic features allows for the exploration of functional associations as well as for clustering, suggesting a valuable methodology for the validation and interpretation of high-throughput experimental data

    INTEGRATIVE ANALYSIS OF OMICS DATA IN ADULT GLIOMA AND OTHER TCGA CANCERS TO GUIDE PRECISION MEDICINE

    Get PDF
    Transcriptomic profiling and gene expression signatures have been widely applied as effective approaches for enhancing the molecular classification, diagnosis, prognosis or prediction of therapeutic response towards personalized therapy for cancer patients. Thanks to modern genome-wide profiling technology, scientists are able to build engines leveraging massive genomic variations and integrating with clinical data to identify “at risk” individuals for the sake of prevention, diagnosis and therapeutic interventions. In my graduate work for my Ph.D. thesis, I have investigated genomic sequencing data mining to comprehensively characterise molecular classifications and aberrant genomic events associated with clinical prognosis and treatment response, through applying high-dimensional omics genomic data to promote the understanding of gene signatures and somatic molecular alterations contributing to cancer progression and clinical outcomes. Following this motivation, my dissertation has been focused on the following three topics in translational genomics. 1) Characterization of transcriptomic plasticity and its association with the tumor microenvironment in glioblastoma (GBM). I have integrated transcriptomic, genomic, protein and clinical data to increase the accuracy of GBM classification, and identify the association between the GBM mesenchymal subtype and reduced tumorpurity, accompanied with increased presence of tumor-associated microglia. Then I have tackled the sole source of microglial as intrinsic tumor bulk but not their corresponding neurosphere cells through both transcriptional and protein level analysis using a panel of sphere-forming glioma cultures and their parent GBM samples.FurthermoreI have demonstrated my hypothesis through longitudinal analysis of paired primary and recurrent GBM samples that the phenotypic alterations of GBM subtypes are not due to intrinsic proneural-to-mesenchymal transition in tumor cells, rather it is intertwined with increased level of microglia upon disease recurrence. Collectively I have elucidated the critical role of tumor microenvironment (Microglia and macrophages from central nervous system) contributing to the intra-tumor heterogeneity and accurate classification of GBM patients based on transcriptomic profiling, which will not only significantly impact on clinical perspective but also pave the way for preclinical cancer research. 2) Identification of prognostic gene signatures that stratify adult diffuse glioma patientsharboring1p/19q co-deletions. I have compared multiple statistical methods and derived a gene signature significantly associated with survival by applying a machine learning algorithm. Then I have identified inflammatory response and acetylation activity that associated with malignant progression of 1p/19q co-deleted glioma. In addition, I showed this signature translates to other types of adult diffuse glioma, suggesting its universality in the pathobiology of other subset gliomas. My efforts on integrative data analysis of this highly curated data set usingoptimizedstatistical models will reflect the pending update to WHO classification system oftumorsin the central nervous system (CNS). 3) Comprehensive characterization of somatic fusion transcripts in Pan-Cancers. I have identified a panel of novel fusion transcripts across all of TCGA cancer types through transcriptomic profiling. Then I have predicted fusion proteins with kinase activity and hub function of pathway network based on the annotation of genetically mobile domains and functional domain architectures. I have evaluated a panel of in -frame gene fusions as potential driver mutations based on network fusion centrality hypothesis. I have also characterised the emerging complexity of genetic architecture in fusion transcripts through integrating genomic structure and somatic variants and delineating the distinct genomic patterns of fusion events across different cancer types. Overall my exploration of the pathogenetic impact and clinical relevance of candidate gene fusions have provided fundamental insights into the management of a subset of cancer patients by predicting the oncogenic signalling and specific drug targets encoded by these fusion genes. Taken together, the translational genomic research I have conducted during my Ph.D. study will shed new light on precision medicine and contribute to the cancer research community. The novel classification concept, gene signature and fusion transcripts I have identified will address several hotly debated issues in translational genomics, such as complex interactions between tumor bulks and their adjacent microenvironments, prognostic markers for clinical diagnostics and personalized therapy, distinct patterns of genomic structure alterations and oncogenic events in different cancer types, therefore facilitating our understanding of genomic alterations and moving us towards the development of precision medicine

    Linear discriminant analysis for the small sample size problem: an overview

    Get PDF
    Dimensionality reduction is an important aspect in the pattern classification literature, and linear discriminant analysis (LDA) is one of the most widely studied dimensionality reduction technique. The application of variants of LDA technique for solving small sample size (SSS) problem can be found in many research areas e.g. face recognition, bioinformatics, text recognition, etc. The improvement of the performance of variants of LDA technique has great potential in various fields of research. In this paper, we present an overview of these methods. We covered the type, characteristics and taxonomy of these methods which can overcome SSS problem. We have also highlighted some important datasets and software/packages

    BIOINFORMATICS ANALYSIS OF OMICS DATA TOWARDS CANCER DIAGNOSIS AND PROGNOSIS

    Get PDF
    I would first like to thank my mentor, Dr. Arul M. Chinnaiyan, for his expert guidance, support, encouragement, and inspiration. I would also like to thank Dr. Debashis Ghosh for his continuous statistical support and great advice, Dr. David G. Beer, Dr. Jill A. Macoska, and Dr. Kerby A. Shedden for serving on my Doctoral committee and giving me valuable suggestions on this thesis work. I would like to thank Jindan Yu, Xiaoju Wang, Guoan Chen, Saravana Dhanasekaran, Daniel Rhodes, Scott A. Tomlins, and Sooryanarayana Varambally, who have contributed to most of the work described here. I would like to express my gratitude to all the members in the Chinnaiyan lab for their support. Without them, none of the work described here could have been completed. I would also like to thank William P. Worzel and Arpit A. Almal for their support on genetic programming project. I would like to express my deepest gratitude to my wife and my love, Yipin, without whom I would be nowhere. Thanks for putting up with my late nights and giving me unconditional love and encouragement through my Doctoral study and the writing of this work. I would also like to thank my parents, my sister, and my grandparents for giving constant support and love. And last but not least, I would like to thank my friends and all whose support helped me completing this thesis in time. ii TABLE OF CONTENT

    A multiomics disease progression signature of low‑risk ccRCC

    Get PDF
    Clear cell renal cell carcinoma (ccRCC) is the most common renal cancer. Identification of ccRCC likely to progress, despite an apparent low risk at the time of surgery, represents a key clinical issue. From a cohort of adult ccRCC patients (n = 443), we selected low-risk tumors progressing within a 5-years average follow-up (progressors: P, n = 8) and non-progressing (NP) tumors (n = 16). Transcriptome sequencing, miRNA sequencing and proteomics were performed on tissues obtained at surgery. We identified 151 proteins, 1167 mRNAs and 63 miRNAs differentially expressed in P compared to NP low-risk tumors. Pathway analysis demonstrated overrepresentation of proteins related to “LXR/ RXR and FXR/RXR Activation”, “Acute Phase Response Signaling” in NP compared to P samples. Integrating mRNA, miRNA and proteomic data, we developed a 10-component classifier including two proteins, three genes and five miRNAs, effectively differentiating P and NP ccRCC and capturing underlying biological differences, potentially useful to identify “low-risk” patients requiring closer surveillance and treatment adjustments. Key results were validated by immunohistochemistry, qPCR and data from publicly available databases. Our work suggests that LXR, FXR and macrophage activation pathways could be critically involved in the inhibition of the progression of low-risk ccRCC. Furthermore, a 10-component classifier could support an early identification of apparently low-risk ccRCC patients.Peer reviewe
    corecore