1,554 research outputs found
Gene set based ensemble methods for cancer classification
Diagnosis of cancer very often depends on conclusions drawn after both clinical and microscopic examinations of tissues to study the manifestation of the disease in order to place tumors in known categories. One factor which determines the categorization of cancer is the tissue from which the tumor originates. Information gathered from clinical exams may be partial or not completely predictive of a specific category of cancer. Further complicating the problem of categorizing various tumors is that the histological classification of the cancer tissue and description of its course of development may be atypical. Gene expression data gleaned from micro-array analysis provides tremendous promise for more accurate cancer diagnosis. One hurdle in the classification of tumors based on gene expression data is that the data space is ultra-dimensional with relatively few points; that is, there are a small number of examples with a large number of genes. A second hurdle is expression bias caused by the correlation of genes. Analysis of subsets of genes, known as gene set analysis, provides a mechanism by which groups of differentially expressed genes can be identified. We propose an ensemble of classifiers whose base classifiers are β1-regularized logistic regression models with restriction of the feature space to biologically relevant genes. Some researchers have already explored the use of ensemble classifiers to classify cancer but the effect of the underlying base classifiers in conjunction with biologically-derived gene sets on cancer classification has not been explored
Feature selection of imbalanced gene expression microarray data
Gene expression data is a very complex data set characterised by abundant numbers of features but with a low number of observations. However, only a small number of these features are relevant to an outcome of interest. With this kind of data set, feature selection becomes a real prerequisite. This paper proposes a methodology for feature selection for an imbalanced leukaemia gene expression data based on random forest algorithm. It presents the importance of feature selection in terms of reducing the number of features, enhancing the quality of machine learning and providing better understanding for biologists in diagnosis and prediction. Algorithms are presented to show the methodology and strategy for feature selection taking care to avoid over fitting. Moreover, experiments are done using imbalanced Leukaemia gene expression data and special measurement is used to evaluate the quality of feature selection and performance of classification. Β© 2011 IEEE
Mycobacterium tuberculosis proteins involved in cell wall lipid biosynthesis improve BCG vaccine efficacy in a murine TB model
OBJECTIVES: Advances in tuberculosis (TB) vaccine development are urgently required to enhance global disease management. We evaluated the potential of Mycobacterium tuberculosis (M. tb)-derived protein antigens Rv0447c, Rv2957 and Rv2958c to boost BCG vaccine efficacy in the presence or absence of glucopyranosyl lipid adjuvant formulated in a stable emulsion (GLA-SE) adjuvant. METHODS: Mice received the BCG vaccine, followed by Rv0447c, Rv2957 and Rv2958c protein boosting with or without GLA-SE adjuvant 3 and 6 weeks later. Immune responses were examined at given time points. 9 weeks post vaccination, mice were aerosol-challenged with M. tb, and sacrificed at 6 and 12 weeks to assess bacterial burden. RESULTS: Vaccination of mice with BCG and M. tb proteins in the presence of GLA-SE adjuvant triggered strong IFN-Ξ³ and IL-2 production by splenocytes; more TNF-Ξ± was produced without GLA-SE addition. Antibody responses to all three antigens did not differ, with or without GLA-SE adjuvant. Protein boosting without GLA-SE adjuvant resulted in vaccinated animals having better control of pulmonary M. tb load at 6 and 12 weeks post aerosol infection, while animals receiving the protein boost with GLA-SE adjuvant exhibited more bacteria in the lungs. CONCLUSIONS: Our data provides evidence for developing Rv2958c, Rv2957 and Rv0447c in a heterologous prime-boost vaccination strategy with BCG
Computational Intelligence Based Classifier Fusion Models for Biomedical Classification Applications
The generalization abilities of machine learning algorithms often depend on the algorithmsβ initialization, parameter settings, training sets, or feature selections. For instance, SVM classifier performance largely relies on whether the selected kernel functions are suitable for real application data. To enhance the performance of individual classifiers, this dissertation proposes classifier fusion models using computational intelligence knowledge to combine different classifiers. The first fusion model called T1FFSVM combines multiple SVM classifiers through constructing a fuzzy logic system. T1FFSVM can be improved by tuning the fuzzy membership functions of linguistic variables using genetic algorithms. The improved model is called GFFSVM. To better handle uncertainties existing in fuzzy MFs and in classification data, T1FFSVM can also be improved by applying type-2 fuzzy logic to construct a type-2 fuzzy classifier fusion model (T2FFSVM). T1FFSVM, GFFSVM, and T2FFSVM use accuracy as a classifier performance measure. AUC (the area under an ROC curve) is proved to be a better classifier performance metric. As a comparison study, AUC-based classifier fusion models are also proposed in the dissertation. The experiments on biomedical datasets demonstrate promising performance of the proposed classifier fusion models comparing with the individual composing classifiers. The proposed classifier fusion models also demonstrate better performance than many existing classifier fusion methods. The dissertation also studies one interesting phenomena in biology domain using machine learning and classifier fusion methods. That is, how protein structures and sequences are related each other. The experiments show that protein segments with similar structures also share similar sequences, which add new insights into the existing knowledge on the relation between protein sequences and structures: similar sequences share high structure similarity, but similar structures may not share high sequence similarity
Machine learning and computational methods to identify molecular and clinical markers for complex diseases β case studies in cancer and obesity
In biomedical research, applied machine learning and bioinformatics are the essential disciplines heavily involved in translating data-driven findings into medical practice. This task is especially accomplished by developing computational tools and algorithms assisting in detection and clarification of underlying causes of the diseases. The continuous advancements in high-throughput technologies coupled with the recently promoted data sharing policies have contributed to presence of a massive wealth of data with remarkable potential to improve human health care. In concordance with this massive boost in data production, innovative data analysis tools and methods are required to meet the growing demand. The data analyzed by bioinformaticians and computational biology experts can be broadly divided into molecular and conventional clinical data categories. The aim of this thesis was to develop novel statistical and machine learning tools and to incorporate the existing state-of-the-art methods to analyze bio-clinical data with medical applications. The findings of the studies demonstrate the impact of computational approaches in clinical decision making by improving patients risk stratification and prediction of disease outcomes.
This thesis is comprised of five studies explaining method development for 1) genomic data, 2) conventional clinical data and 3) integration of genomic and clinical data. With genomic data, the main focus is detection of differentially expressed genes as the most common task in transcriptome profiling projects. In addition to reviewing available differential expression tools, a data-adaptive statistical method called Reproducibility Optimized Test Statistic (ROTS) is proposed for detecting differential expression in RNA-sequencing studies. In order to prove the efficacy of ROTS in real biomedical applications, the method is used to identify prognostic markers in clear cell renal cell carcinoma (ccRCC). In addition to previously known markers, novel genes with potential prognostic and therapeutic role in ccRCC are detected. For conventional clinical data, ensemble based predictive models are developed to provide clinical decision support in treatment of patients with metastatic castration resistant prostate cancer (mCRPC). The proposed predictive models cover treatment and survival stratification tasks for both trial-based and realworld patient cohorts. Finally, genomic and conventional clinical data are integrated to demonstrate the importance of inclusion of genomic data in predictive ability of clinical models. Again, utilizing ensemble-based learners, a novel model is proposed to predict adulthood obesity using both genetic and social-environmental factors.
Overall, the ultimate objective of this work is to demonstrate the importance of clinical bioinformatics and machine learning for bio-clinical marker discovery in complex disease with high heterogeneity. In case of cancer, the interpretability of clinical models strongly depends on predictive markers with high reproducibility supported by validation data. The discovery of these markers would increase chance of early detection and improve prognosis assessment and treatment choice
Algorithms for pre-microrna classification and a GPU program for whole genome comparison
MicroRNAs (miRNAs) are non-coding RNAs with approximately 22 nucleotides that are derived from precursor molecules. These precursor molecules or pre-miRNAs often fold into stem-loop hairpin structures. However, a large number of sequences with pre-miRNA-like hairpin can be found in genomes. It is a challenge to distinguish the real pre-miRNAs from other hairpin sequences with similar stem-loops (referred to as pseudo pre-miRNAs). The first part of this dissertation presents a new method, called MirID, for identifying and classifying microRNA precursors. MirID is comprised of three steps. Initially, a combinatorial feature mining algorithm is developed to identify suitable feature sets. Then, the feature sets are used to train support vector machines to obtain classification models, based on which classifier ensemble is constructed. Finally, an AdaBoost algorithm is adopted to further enhance the accuracy of the classifier ensemble. Experimental results on a variety of species demonstrate the good performance of the proposed approach, and its superiority over existing methods.
In the second part of this dissertation, A GPU (Graphics Processing Unit) program is developed for whole genome comparison. The goal for the research is to identify the commonalities and differences of two genomes from closely related organisms, via multiple sequencing alignments by using a seed and extend technique to choose reliable subsets of exact or near exact matches, which are called anchors. A rigorous method named Smith-Waterman search is applied for the anchor seeking, but takes days and months to map millions of bases for mammalian genome sequences. With GPU programming, which is designed to run in parallel hundreds of short functions called threads, up to 100X speed up is achieved over similar CPU executions
Expression and Cellular Immunogenicity of a Transgenic Antigen Driven by Endogenous Poxviral Early Promoters at Their Authentic Loci in MVA
CD8+ T cell responses to vaccinia virus are directed almost exclusively against early gene products. The attenuated strain modified vaccinia virus Ankara (MVA) is under evaluation in clinical trials of new vaccines designed to elicit cellular immune responses against pathogens including Plasmodium spp., M. tuberculosis and HIV-1. All of these recombinant MVAs (rMVA) utilize the well-established method of linking the gene of interest to a cloned poxviral promoter prior to insertion into the viral genome at a suitable locus by homologous recombination in infected cells. Using BAC recombineering, we show that potent early promoters that drive expression of non-functional or non-essential MVA open reading frames (ORFs) can be harnessed for immunogenic expression of recombinant antigen. Precise replacement of the MVA orthologs of C11R, F11L, A44L and B8R with a model antigen positioned to use the same translation initiation codon allowed early transgene expression similar to or slightly greater than that achieved by the commonly-used p7.5 or short synthetic promoters. The frequency of antigen-specific CD8+ T cells induced in mice by single shot or adenovirus-prime, rMVA-boost vaccination were similarly equal or marginally enhanced using endogenous promoters at their authentic genomic loci compared to the traditional constructs. The enhancement in immunogenicity observed using the C11R or F11L promoters compared with p7.5 was similar to that obtained with the mH5 promoter compared with p7.5. Furthermore, the growth rates of the viruses were unimpaired and the insertions were genetically stable. Insertion of a transgenic ORF in place of a viral ORF by BAC recombineering can thus provide not only a potent promoter, but also, concomitantly, a suitable insertion site, potentially facilitating development of MVA vaccines expressing multiple recombinant antigens
- β¦