58 research outputs found

    Diversified Ensemble Classifiers for Highly Imbalanced Data Learning and their Application in Bioinformatics

    Get PDF
    In this dissertation, the problem of learning from highly imbalanced data is studied. Imbalance data learning is of great importance and challenge in many real applications. Dealing with a minority class normally needs new concepts, observations and solutions in order to fully understand the underlying complicated models. We try to systematically review and solve this special learning task in this dissertation.We propose a new ensemble learning framework—Diversified Ensemble Classifiers for Imbal-anced Data Learning (DECIDL), based on the advantages of existing ensemble imbalanced learning strategies. Our framework combines three learning techniques: a) ensemble learning, b) artificial example generation, and c) diversity construction by reversely data re-labeling. As a meta-learner, DECIDL utilizes general supervised learning algorithms as base learners to build an ensemble committee. We create a standard benchmark data pool, which contains 30 highly skewed sets with diverse characteristics from different domains, in order to facilitate future research on imbalance data learning. We use this benchmark pool to evaluate and compare our DECIDL framework with several ensemble learning methods, namely under-bagging, over-bagging, SMOTE-bagging, and AdaBoost. Extensive experiments suggest that our DECIDL framework is comparable with other methods. The data sets, experiments and results provide a valuable knowledge base for future research on imbalance learning. We develop a simple but effective artificial example generation method for data balancing. Two new methods DBEG-ensemble and DECIDL-DBEG are then designed to improve the power of imbalance learning. Experiments show that these two methods are comparable to the state-of-the-art methods, e.g., GSVM-RU and SMOTE-bagging. Furthermore, we investigate learning on imbalanced data from a new angle—active learning. By combining active learning with the DECIDL framework, we show that the newly designed Active-DECIDL method is very effective for imbalance learning, suggesting the DECIDL framework is very robust and flexible.Lastly, we apply the proposed learning methods to a real-world bioinformatics problem—protein methylation prediction. Extensive computational results show that the DECIDL method does perform very well for the imbalanced data mining task. Importantly, the experimental results have confirmed our new contributions on this particular data learning problem

    Psoriaasi, atoopilise dermatiidi ja ateroskleroosi metaboloomne profileerimine

    Get PDF
    Väitekirja elektrooniline versioon ei sisalda publikatsiooneMetaboloomika on teadusharu, mis tegeleb madalmolekulaarsete ühendite mõõtmise ja analüüsimisega. Nendeks on aminohapped, biogeensed amiinid, süsivesikud, rasvhapped, nukleiinhapped või peptiidid, mis võivad olla nii eksogeenset kui ka endogeenset päritolu. Nende ainete samaaegne mõõtmine võimaldab näha ainevahetusradade otsest peegeldust, nö. metaboloomset sõrmejälge. Psoriaas on laialt levinud krooniline põletikuline nahahaigus, mis esineb kuni 1%-l lastest ja 2%-3% üldpopulatsioonist. Haiguse teke on seotud mitme põhjusega, sealhulgas geneetiline eelsoodumus ja vastuvõtlikkus, keskkonna mõjutegurid koos immuunsüsteemi düsfunktsiooni ja nahabarjääri häirega. Atoopiline dermatiit on laialt levinud ja kompleksne nahahaigus, mis mõjutab kuni 15% lapsi ja täiskasvanuid üldpopulatsioonis. Kuigi enamik lapsi kasvab haigusest välja, hõlmab see teatud juhtudel ka täiskasvanuid, mõjutades patsientide heaolu ja põhjustades rida kaasuvaid haigusi, sealhulgas allergiad, astma, tähelepanuhäired ning aneemiat. Ateroskleroos on põletikuline haigus, hõlmates arterite seinu, kuhu kogunevad põletikulised rakud ja lipiidid. See viib arterite ahenemiseni, mis võib päädida trombi tekkega, põhjustades infarkti. Ateroskleroosi kõige levinumad vormid on perifeerne arterite haigus ja koronaar-arteri haigus, millest mõlemast on saanud suured rahvatervise probleemid. Käesoleva doktoritöö peamiseks eesmärgiks oli analüüsida psoriaasi, atoopilise dermatiidi ja ateroskleroosi patsientide metaboloomseid profiile ning hinnata sarnasusi ja erinevusi leitud metaboliitides.Metabolomics concerns with the measurement and analysis of small molecule compounds (< 1 kDa, e.g. amino acids, biogenic amines, carbohydrates, fatty acids, nucleic acids, peptides) of both exogenous and endogenous origins. These are the substrates and products of various chemical reactions within metabolic pathways. Psoriasis (PS) is a widespread chronic inflammatory skin disease affecting 2%-3% of the population in the world. The disease is considered to be multifactorial with a number of key contributing factors including genetic predisposition and susceptibility, environmental influences along with immune dysfunction and the disruption of the skin barrier. Atopic dermatitis (AD) is a widespread and complex condition that affects up to 15% adults and children worldwide. Although children have an increased prevalence of atopic dermatitis, many adults remain affected throughout their life. Atherosclerosis is classified as an inflammatory disease that involves the arterial wall and is characterized by the continuous accumulation of inflammatory cells and lipids within the intima of large arteries. The metabolomic profiles of patients with psoriasis and atopic dermatitis were explored to find possible disease-specific metabolites that could be used to characterise and better understand the underlying mechanisms of the disease pathogenesis. The application of the established methods was expanded to peripheral arterial disease and coronary arterial disease to further search for similarities and differences in the metabolomic profiles of the diseaseshttps://www.ester.ee/record=b522842

    Avian muscle development and growth mechanisms: association with muscle myopathies and meat quality Volume II

    Get PDF
    open2siGiven the significant interest in Volume I, it was decided to launch Volume II of the Research Topic “Avian Muscle Development and Growth Mechanisms: Association With Muscle Myopathies and Meat Quality.” The broiler industry is still facing an unsustainable occurrence of growth-related muscular abnormalities that mainly affect fast-growing genotypes selected for high growth rate and breast yield. From their onset, research interest in these issues continues as proven by the temporal trend of published papers during the past decade (Figure 1). Even if meat affected by white striping, wooden breast, and spaghetti meat abnormalities is not harmful for human nutrition, these conditions impair quality traits of both raw and processed meat products causing severe economic losses in the poultry industry worldwide (Petracci et al., 2019; Velleman, 2019). Since the Research Topic of “Avian Muscle Development and Growth Mechanisms: Association With Muscle Myopathies and Meat Quality” is quite diverse, contributions in this second volume reflect the broad scope of areas of investigation related to muscle growth and development with 11 original research papers and one mini-review from prominent scientists in the sector. We hope that this collection will instigate novel questions in the minds of our readers and will be helpful in facilitating the development of the field.openMassimiliano Petracci; Sandra G. VellemanMassimiliano Petracci; Sandra G. Vellema

    Information Extraction from Text for Improving Research on Small Molecules and Histone Modifications

    Get PDF
    The cumulative number of publications, in particular in the life sciences, requires efficient methods for the automated extraction of information and semantic information retrieval. The recognition and identification of information-carrying units in text – concept denominations and named entities – relevant to a certain domain is a fundamental step. The focus of this thesis lies on the recognition of chemical entities and the new biological named entity type histone modifications, which are both important in the field of drug discovery. As the emergence of new research fields as well as the discovery and generation of novel entities goes along with the coinage of new terms, the perpetual adaptation of respective named entity recognition approaches to new domains is an important step for information extraction. Two methodologies have been investigated in this concern: the state-of-the-art machine learning method, Conditional Random Fields (CRF), and an approximate string search method based on dictionaries. Recognition methods that rely on dictionaries are strongly dependent on the availability of entity terminology collections as well as on its quality. In the case of chemical entities the terminology is distributed over more than 7 publicly available data sources. The join of entries and accompanied terminology from selected resources enables the generation of a new dictionary comprising chemical named entities. Combined with the automatic processing of respective terminology – the dictionary curation – the recognition performance reached an F1 measure of 0.54. That is an improvement by 29 % in comparison to the raw dictionary. The highest recall was achieved for the class of TRIVIAL-names with 0.79. The recognition and identification of chemical named entities provides a prerequisite for the extraction of related pharmacological relevant information from literature data. Therefore, lexico-syntactic patterns were defined that support the automated extraction of hypernymic phrases comprising pharmacological function terminology related to chemical compounds. It was shown that 29-50 % of the automatically extracted terms can be proposed for novel functional annotation of chemical entities provided by the reference database DrugBank. Furthermore, they are a basis for building up concept hierarchies and ontologies or for extending existing ones. Successively, the pharmacological function and biological activity concepts obtained from text were included into a novel descriptor for chemical compounds. Its successful application for the prediction of pharmacological function of molecules and the extension of chemical classification schemes, such as the the Anatomical Therapeutic Chemical (ATC), is demonstrated. In contrast to chemical entities, no comprehensive terminology resource has been available for histone modifications. Thus, histone modification concept terminology was primary recognized in text via CRFs with a F1 measure of 0.86. Subsequent, linguistic variants of extracted histone modification terms were mapped to standard representations that were organized into a newly assembled histone modification hierarchy. The mapping was accomplished by a novel developed term mapping approach described in the thesis. The combination of term recognition and term variant resolution builds up a new procedure for the assembly of novel terminology collections. It supports the generation of a term list that is applicable in dictionary-based methods. For the recognition of histone modification in text it could be shown that the named entity recognition method based on dictionaries is superior to the used machine learning approach. In conclusion, the present thesis provides techniques which enable an enhanced utilization of textual data, hence, supporting research in epigenomics and drug discovery

    Interpretability-oriented data-driven modelling of bladder cancer via computational intelligence

    Get PDF

    The Pharmacoepigenomics Informatics Pipeline and H-GREEN Hi-C Compiler: Discovering Pharmacogenomic Variants and Pathways with the Epigenome and Spatial Genome

    Full text link
    Over the last decade, biomedical science has been transformed by the epigenome and spatial genome, but the discipline of pharmacogenomics, the study of the genetic underpinnings of pharmacological phenotypes like drug response and adverse events, has not. Scientists have begun to use omics atlases of increasing depth, and inferences relating to the bidirectional causal relationship between the spatial epigenome and gene expression, as a foundational underpinning for genetics research. The epigenome and spatial genome are increasingly used to discover causative regulatory variants in the significance regions of genome-wide association studies, for the discovery of the biological mechanisms underlying these phenotypes and the design of genetic tests to predict them. Such variants often have more predictive power than coding variants, but in the area of pharmacogenomics, such advances have been radically underapplied. The majority of pharmacogenomics tests are designed manually on the basis of mechanistic work with coding variants in candidate genes, and where genome wide approaches are used, they are typically not interpreted with the epigenome. This work describes a series of analyses of pharmacogenomics association studies with the tools and datasets of the epigenome and spatial genome, undertaken with the intent of discovering causative regulatory variants to enable new genetic tests. It describes the potent regulatory variants discovered thereby to have a putative causative and predictive role in a number of medically important phenotypes, including analgesia and the treatment of depression, bipolar disorder, and traumatic brain injury with opiates, anxiolytics, antidepressants, lithium, and valproate, and in particular the tendency for such variants to cluster into spatially interacting, conceptually unified pathways which offer mechanistic insight into these phenotypes. It describes the Pharmacoepigenomics Informatics Pipeline (PIP), an integrative multiple omics variant discovery pipeline designed to make this kind of analysis easier and cheaper to perform, more reproducible, and amenable to the addition of advanced features. It described the successes of the PIP in rediscovering manually discovered gene networks for lithium response, as well as discovering a previously unknown genetic basis for warfarin response in anticoagulation therapy. It describes the H-GREEN Hi-C compiler, which was designed to analyze spatial genome data and discover the distant target genes of such regulatory variants, and its success in discovering spatial contacts not detectable by preceding methods and using them to build spatial contact networks that unite disparate TADs with phenotypic relationships. It describes a potential featureset of a future pipeline, using the latest epigenome research and the lessons of the previous pipeline. It describes my thinking about how to use the output of a multiple omics variant pipeline to design genetic tests that also incorporate clinical data. And it concludes by describing a long term vision for a comprehensive pharmacophenomic atlas, to be constructed by applying a variant pipeline and machine learning test design system, such as is described, to thousands of phenotypes in parallel. Scientists struggled to assay genotypes for the better part of a century, and in the last twenty years, succeeded. The struggle to predict phenotypes on the basis of the genotypes we assay remains ongoing. The use of multiple omics variant pipelines and machine learning models with omics atlases, genetic association, and medical records data will be an increasingly significant part of that struggle for the foreseeable future.PHDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/145835/1/ariallyn_1.pd

    A Study of Raman Spectroscopy as a Clinical Diagnostic Tool for the Detection of Lynch Syndrome/Hereditary NonPolyposis Colorectal Cancer (HNPCC)

    Get PDF
    Lynch syndrome also known as hereditary non-polyposis colorectal cancer (HNPCC) is a highly penetrant hereditary form of colorectal cancer that accounts for approximately 3% of all cases. It is caused by mutations in DNA mismatch repair resulting in accelerated adenoma to carcinoma progression. The current clinical guidelines used to identify Lynch Syndrome (LS) are known to be too stringent resulting in overall underdiagnoses. Raman spectroscopy is a powerful analytical tool used to probe the molecular vibrations of a sample to provide a unique chemical fingerprint. The potential of using Raman as a diagnostic tool for discriminating LS from sporadic adenocarcinoma is explored within this thesis. A number of experimental parameters were initially optimized for use with formalin fixed paraffin embedded colonic tissue (FFPE). This has resulted in the development of a novel cost-effective backing substrate shown to be superior to the conventionally used calcium fluoride (CaF2). This substrate is a form of silanized super mirror stainless steel that was found to have a much lower Raman background, enhanced Raman signal and complete paraffin removal from FFPE tissues. Performance of the novel substrate was compared against CaF2 by acquiring large high resolution Raman maps from FFPE rat and human colonic tissue. All of the major histological features were discerned from steel mounted tissue with the benefit of clear lipid signals without paraffin obstruction. Biochemical signals were comparable to those obtained on CaF2 with no detectable irregularities. By using principal component analysis to reduce the dimensionality of the dataset it was then possible to use linear discriminant analysis to build a classification model for the discrimination of normal colonic tissue (n=10) from two pathological groups: LS (n=10) and sporadic adenocarcinoma (n=10). Using leaveone-map-out cross-validation of the model classifier has shown that LS was predicted with a sensitivity of 63% and a specificity of 89% - values that are competitive with classification techniques applied routinely in clinical practice

    Hematology

    Get PDF
    Hematology encompasses the physiology and pathology of blood and of the blood-forming organs. In common with other areas of medicine, the pace of change in hematology has been breathtaking over recent years. There are now many treatment options available to the modern hematologist and, happily, a greatly improved outlook for the vast majority of patients with blood disorders and malignancies. Improvements in the clinic reflect, and in many respects are driven by, advances in our scientific understanding of hematological processes under both normal and disease conditions. Hematology - Science and Practice consists of a selection of essays which aim to inform both specialist and non-specialist readers about some of the latest advances in hematology, in both laboratory and clinic

    Model based approaches to characterize heterogeneity in gene regulation across cells and disease types

    Get PDF
    Access to large genome-wide biological datasets has now enabled computational researchers to tackle long-standing questions in Biomedicine through the lens of Machine Learning (ML) and Artificial Intelligence (AI). The potential benefits of such computational approaches to biological research are immense. For example, efficient, and yet interpretable, machine learning models of disease/drug response/phenotype can impact our life at both personal and social levels. However, heterogeneity is found at multiple scales in biology, manifested as the context-specificity of biological processes. This context-specific heterogeneity poses a major challenge to ML models. Even though context-specific models are often trained, this is mostly done without the benefit of mechanistic insights about the biological processes being modeled, and as such do not help improve our biological understanding. This dissertation addresses these challenges and their limitations by: a) designing appropriate features and ML models motivated by the current biological hypothesis at hand, b) building pipelines to analyze multiple context-specific models together, and c) developing data integration and imputation methods to address the problems of insufficient and missing data. The first project studies loss of methylation or hypo-methylation in large blocks causing aberrant gene activity, a well-known phenomenon in cancer. To find the associated markers, I designed a classification model of hypo-methylated block boundaries and non-boundaries in colon cancer. The second project models binding of transcription factor (TF) to specific DNA element to the genome, one of the principal components of gene regulation. Since condition specificity of TF binding is not yet well understood, this dissertation examines a design of cell type-specific models for transcription factor (TF) binding using ChIPSeq data. A meta-analysis pipeline, called TRISECT, is applied for multiple TF binding models to understand heterogeneity of cell specificity across those models. Next, models for breast cancer metastasis using gene expression data are discussed. In breast cancer metastasis, the affinity towards distant tissues called secondary tissues has not been comprehended. Therefore, going beyond mere discriminatory models, I propose another meta-analysis pipeline, MONTAGE intending to understand the organotropism of breast cancer metastasis across secondary tissues. Building ML models can be hindered by the data size, specially, for rare diseases. Therefore, by necessity, molecular data have been merged across multiple studies, and across multiple technical platforms which has vulnerability of so called batch effects diluting the actual biological signal. Existing methods are not capable of removing multi-variate confounding artifacts leading to inaccurate models. To circumvent this issue, this dissertation examines a deep learning based technique (deepSavior) which ‘translates’ the gene expression profile from samples of one technical platform to another platform. To summarize, this dissertation makes three distinct contributions, a) designing effective ML model to explore the determinants of cancer-associated hypomethlation, b) designing meta-analysis pipelines to compare multiple related but context-specific ML models to understand heterogeneous relations among biological processes, and b) developing new method to overcome the data integration and imputation challenges

    Metabolite profiles in the investigation of childhood brain tumours

    Get PDF
    Paediatric brain tumours are the leading cause of cancer mortality in 0-14 year olds. New diagnostic and prognostic tools are required, as are new therapeutic targets to improve survival. Metabolism is a powerful characterising feature of tumours, with potential to aid the management of these diseases. This thesis acquires metabolite profiles from paediatric brain tumour tissue, primarily by High Resolution Magic Angle Spinning NMR (HR-MAS), which are statistically analysed. Firstly, metabolite profiles obtained by HR-MAS were shown to accurately and robustly classify tissue from the three main cerebellar tumours. When compared to current histological intraoperative testing, metabolites profiles correctly diagnosed histologically ambiguous tumours. Secondly, survival analysis identified markers of prognosis. High glutamine concentration was associated with better clinical outcome both in terms of overall survival in a mixed tumour cohort and progression-free survival in a cohort of pilocytic astrocytomas. Finally, metabolic pathway analysis identified pathways altered by cerebellar tumours relative to each other, supported with mass spectrometry and independent gene expression data. Pathways identified include alanine, aspartate and glutamate metabolism, taurine and hypotaurine metabolism and glycine, serine and threonine metabolism. Considered together, this work provides strong reasoning for incorporating tissue metabolite profiles into clinical workflows for paediatric brain tumours
    corecore