36 research outputs found

    Discovery and annotation of novel microRNAs in the porcine genome by using a semi-supervised transductive learning approach

    Get PDF
    Despite the broad variety of available microRNA (miRNA) prediction tools, their application to the discovery and annotation of novel miRNA genes in domestic species is still limited. In this study we designed a comprehensive pipeline (eMIRNA) for miRNA identification in the yet poorly annotated porcine genome and demonstrated the usefulness of implementing a motif search positional refinement strategy for the accurate determination of precursor miRNA boundaries. The small RNA fraction from gluteus medius skeletal muscle of 48 Duroc gilts was sequenced and used for the prediction of novel miRNA loci. Additionally, we selected the human miRNA annotation for a homology-based search of porcine miRNAs with orthologous genes in the human genome. A total of 20 novel expressed miRNAs were identified in the porcine muscle transcriptome and 27 additional novel porcine miRNAs were also detected by homology-based search using the human miRNA annotation. The existence of three selected novel miRNAs (ssc-miR-483, ssc-miR484 and ssc-miR-200a) was further confirmed by reverse transcription quantitative real-time PCR analyses in the muscle and liver tissues of Göttingen minipigs. In summary, the eMIRNA pipeline presented in the current work allowed us to expand the catalogue of porcine miRNAs and showed better performance than other commonly used miRNA prediction approaches. More importantly, the flexibility of our pipeline makes possible its application in other yet poorly annotated non-model species.info:eu-repo/semantics/acceptedVersio

    Novel Semi-Supervised Learning Models to Balance Data Inclusivity and Usability in Healthcare Applications

    Get PDF
    abstract: Semi-supervised learning (SSL) is sub-field of statistical machine learning that is useful for problems that involve having only a few labeled instances with predictor (X) and target (Y) information, and abundance of unlabeled instances that only have predictor (X) information. SSL harnesses the target information available in the limited labeled data, as well as the information in the abundant unlabeled data to build strong predictive models. However, not all the included information is useful. For example, some features may correspond to noise and including them will hurt the predictive model performance. Additionally, some instances may not be as relevant to model building and their inclusion will increase training time and potentially hurt the model performance. The objective of this research is to develop novel SSL models to balance data inclusivity and usability. My dissertation research focuses on applications of SSL in healthcare, driven by problems in brain cancer radiomics, migraine imaging, and Parkinson’s Disease telemonitoring. The first topic introduces an integration of machine learning (ML) and a mechanistic model (PI) to develop an SSL model applied to predicting cell density of glioblastoma brain cancer using multi-parametric medical images. The proposed ML-PI hybrid model integrates imaging information from unbiopsied regions of the brain as well as underlying biological knowledge from the mechanistic model to predict spatial tumor density in the brain. The second topic develops a multi-modality imaging-based diagnostic decision support system (MMI-DDS). MMI-DDS consists of modality-wise principal components analysis to incorporate imaging features at different aggregation levels (e.g., voxel-wise, connectivity-based, etc.), a constrained particle swarm optimization (cPSO) feature selection algorithm, and a clinical utility engine that utilizes inverse operators on chosen principal components for white-box classification models. The final topic develops a new SSL regression model with integrated feature and instance selection called s2SSL (with “s2” referring to selection in two different ways: feature and instance). s2SSL integrates cPSO feature selection and graph-based instance selection to simultaneously choose the optimal features and instances and build accurate models for continuous prediction. s2SSL was applied to smartphone-based telemonitoring of Parkinson’s Disease patients.Dissertation/ThesisDoctoral Dissertation Industrial Engineering 201

    Spectral Feature Selection for Data Mining

    Get PDF
    This timely introduction to spectral feature selection illustrates the potential of this powerful dimensionality reduction technique in high-dimensional data processing. It presents the theoretical foundations of spectral feature selection, its connections to other algorithms, and its use in handling both large-scale data sets and small sample problems. Readers learn how to use spectral feature selection to solve challenging problems in real-life applications and discover how general feature selection and extraction are connected to spectral feature selection. Source code for the algorithms is available online

    Contribution to supervised representation learning: algorithms and applications.

    Get PDF
    278 p.In this thesis, we focus on supervised learning methods for pattern categorization. In this context, itremains a major challenge to establish efficient relationships between the discriminant properties of theextracted features and the inter-class sparsity structure.Our first attempt to address this problem was to develop a method called "Robust Discriminant Analysiswith Feature Selection and Inter-class Sparsity" (RDA_FSIS). This method performs feature selectionand extraction simultaneously. The targeted projection transformation focuses on the most discriminativeoriginal features while guaranteeing that the extracted (or transformed) features belonging to the sameclass share a common sparse structure, which contributes to small intra-class distances.In a further study on this approach, some improvements have been introduced in terms of theoptimization criterion and the applied optimization process. In fact, we proposed an improved version ofthe original RDA_FSIS called "Enhanced Discriminant Analysis with Class Sparsity using GradientMethod" (EDA_CS). The basic improvement is twofold: on the first hand, in the alternatingoptimization, we update the linear transformation and tune it with the gradient descent method, resultingin a more efficient and less complex solution than the closed form adopted in RDA_FSIS.On the other hand, the method could be used as a fine-tuning technique for many feature extractionmethods. The main feature of this approach lies in the fact that it is a gradient descent based refinementapplied to a closed form solution. This makes it suitable for combining several extraction methods andcan thus improve the performance of the classification process.In accordance with the above methods, we proposed a hybrid linear feature extraction scheme called"feature extraction using gradient descent with hybrid initialization" (FE_GD_HI). This method, basedon a unified criterion, was able to take advantage of several powerful linear discriminant methods. Thelinear transformation is computed using a descent gradient method. The strength of this approach is thatit is generic in the sense that it allows fine tuning of the hybrid solution provided by different methods.Finally, we proposed a new efficient ensemble learning approach that aims to estimate an improved datarepresentation. The proposed method is called "ICS Based Ensemble Learning for Image Classification"(EM_ICS). Instead of using multiple classifiers on the transformed features, we aim to estimate multipleextracted feature subsets. These were obtained by multiple learned linear embeddings. Multiple featuresubsets were used to estimate the transformations, which were ranked using multiple feature selectiontechniques. The derived extracted feature subsets were concatenated into a single data representationvector with strong discriminative properties.Experiments conducted on various benchmark datasets ranging from face images, handwritten digitimages, object images to text datasets showed promising results that outperformed the existing state-ofthe-art and competing methods

    A metabolomics approach for targeted isolation and production of bioactive secondary metabolites in microbial isolates from extreme environments

    Get PDF
    This thesis was previously held under moratorium from 22/08/2019 to 22/08/2021.Marine microorganisms produce unique secondary metabolites, which are responsible for a variety of biologically active molecules with a wide range of pharmaceutical properties. There is a lack of novel effective drugs for metabolic diseases, cancer and parasite infections as well as TNF-alpha inhibitors hence; underexplored marine bacteria could be an important source for new bioactive molecules. Two strains of Muricauda ruestringensis (SBT531 and SBT587) were isolated from geothermal intertidal pools in Iceland and two strains of Micromonospora sp. N17 and N74 (SBT 687 and SBT692) were isolated from the Mediterranean sponge Phorbas tenacior from the Santorini volcanic complex of Crete. Bioactive metabolites production in thermophile strains of M. ruestringensis (SBT531 and SBT587) and Micromonospora sp. (SBT687 and SBT692) were identified and isolated, by using a metabolomics approach to analyse the liquid chromatography-high resolution mass spectrometry (LC-HRMS) and nuclear magnetic resonance (NMR) data sets. The LC HRMS data was processed by using the modified version of Mzmine 2.10 software, dereplicated with an in-house EXCEL macro coupled to the AntiMarin and Dictionaryof Natural Products (DNP) database to be statistically evaluated by multivariate analysis in SIMCA v15.02. Orthogonal partial least squares discriminant analysis (OPLS-DA) in SIMCA was used to predict and pinpoint the biologically active secondary metabolites. Up-scaling was optimised to increase the production yield of the target metabolites. Additionally, specific bioassay screening determined the activity, if any, from the crude extracts, fractions and isolated compounds. The total ethyl acetate organic extracts of SBT531 and SBT587 were active inhibitors in target-based functional assays: alpha-glucosidase and protein-tyrosine phosphatase 1B (PTP1B) that are a therapeutic target for the treatment of diabetes and other metabolic syndromes. The fractionation of SBT531 afforded a series of bioactive alpha and beta-hydroxy acid derivatives and allowed the definition of a preliminary structure-activity relationship based on their relative potency. Aseanostatin P6 (13-methyltetradecanoic acid) was the major compound identified, followed by two derivatives, 3-hydroxy-13-methyltetradecanoic acid and 2-hydroxy-14-methylhexadecanoic acid. On the other hand, the initial organic crude extracts of SBT687 and SBT692 showed inhibition activity against sea lice and inhibition of TNF alpha respectively, however after further fermentation scale-up the fractions were missing the inhibition activity.Marine microorganisms produce unique secondary metabolites, which are responsible for a variety of biologically active molecules with a wide range of pharmaceutical properties. There is a lack of novel effective drugs for metabolic diseases, cancer and parasite infections as well as TNF-alpha inhibitors hence; underexplored marine bacteria could be an important source for new bioactive molecules. Two strains of Muricauda ruestringensis (SBT531 and SBT587) were isolated from geothermal intertidal pools in Iceland and two strains of Micromonospora sp. N17 and N74 (SBT 687 and SBT692) were isolated from the Mediterranean sponge Phorbas tenacior from the Santorini volcanic complex of Crete. Bioactive metabolites production in thermophile strains of M. ruestringensis (SBT531 and SBT587) and Micromonospora sp. (SBT687 and SBT692) were identified and isolated, by using a metabolomics approach to analyse the liquid chromatography-high resolution mass spectrometry (LC-HRMS) and nuclear magnetic resonance (NMR) data sets. The LC HRMS data was processed by using the modified version of Mzmine 2.10 software, dereplicated with an in-house EXCEL macro coupled to the AntiMarin and Dictionaryof Natural Products (DNP) database to be statistically evaluated by multivariate analysis in SIMCA v15.02. Orthogonal partial least squares discriminant analysis (OPLS-DA) in SIMCA was used to predict and pinpoint the biologically active secondary metabolites. Up-scaling was optimised to increase the production yield of the target metabolites. Additionally, specific bioassay screening determined the activity, if any, from the crude extracts, fractions and isolated compounds. The total ethyl acetate organic extracts of SBT531 and SBT587 were active inhibitors in target-based functional assays: alpha-glucosidase and protein-tyrosine phosphatase 1B (PTP1B) that are a therapeutic target for the treatment of diabetes and other metabolic syndromes. The fractionation of SBT531 afforded a series of bioactive alpha and beta-hydroxy acid derivatives and allowed the definition of a preliminary structure-activity relationship based on their relative potency. Aseanostatin P6 (13-methyltetradecanoic acid) was the major compound identified, followed by two derivatives, 3-hydroxy-13-methyltetradecanoic acid and 2-hydroxy-14-methylhexadecanoic acid. On the other hand, the initial organic crude extracts of SBT687 and SBT692 showed inhibition activity against sea lice and inhibition of TNF alpha respectively, however after further fermentation scale-up the fractions were missing the inhibition activity

    Segmentation and classification of cervical cell images

    Get PDF
    Ankara : The Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2010.Thesis (Master's) -- Bilkent University, 2010.Includes bibliographical references leaves 103-105Cervical cancer can be prevented if it is detected and treated early. Pap smear test is a manual screening procedure used to detect cervical cancer and precancerous changes in an uterine cervix. However, this procedure is costly and it may result in inaccurate diagnoses due to human error like intra- and interobserver variability. Therefore, a computer-assisted screening system will be very bene cial to prevent cervical cancer if it increases the reliability of diagnoses. In this thesis, we propose a computer-assisted diagnosis system which helps cyto-technicians by sorting cells in a Pap smear slide according to their abnormality degree. There are three main components of such a system. Firstly, cells along with their nuclei are located using a segmentation procedure on an image taken using a microscope. Then, features describing these segmented cells are extracted. Finally, the cells are sorted according to their abnormality degree based on the extracted features. Di erent from the related studies that require images of a single cervical cell, we propose a non-parametric generic segmentation algorithm that can also handle images of overlapping cells. We use thresholding as the rst phase to extract background regions for obtaining remaining cell regions. The second phase consists of segmenting the cell regions by a non-parametric hierarchical segmentation algorithm that uses the spectral and shape information as well as the gradient information. The last phase aims to partition the cell region into true structures of each nucleus and the whole cytoplasm area by classifying the nal segments as nucleus or cytoplasm region. We evaluate our segmentation method both quantitatively and qualitatively using two data sets.By proposing an unsupervised screening system, we aim to approach the problem in a di erent way when compared to the related studies that concentrate on classi cation. In order to rank the cells in a Pap slide, we rst perform hierarchical clustering on 14 di erent cell features. The initial ordering of the cells is determined as the leaf ordering of the constructed hierarchical tree. Then, this initial ordering is improved by applying an optimal leaf ordering algorithm. The experiments with ground truth data show the e ectiveness of the proposed approach under di erent experimental settings.Kale, AslıM.S

    Healthcare data mining from multi-source data

    Get PDF
    The "big data" challenge is changing the way we acquire, store, analyse, and draw conclusions from data. How we effectively and efficiently "mine" the data from possibly multiple sources and extract useful information is a critical question. Increasing research attention has been drawn to healthcare data mining, with an ultimate goal to improve the quality of care. The human body is complex and so too the data collected in treating it. Data noise that is often introduced via the collection process makes building Data Mining models a challenging task. This thesis focuses on the classification tasks of mining healthcare data, with the goal of improving the effectiveness of health risk prediction. In particular, we developed algorithms to address issues identified from real healthcare data, such as feature extraction, heterogeneity, label uncertainty, and large unlabeled data. The three main contributions of this research are as follows. First, we developed a new health index called Personal Health Index (PHI) that scores a person's health status based on the examination records of a given population. Second, we identified the key characteristics of the real datasets and issues that were associated with the data. Third, we developed classification algorithms to cope with those issues, particularly, the label uncertainty and large unlabeled data issues. This research takes one step forward towards scoring personal health based on mining increasingly large health records. Particularly, it pioneers exploring the mining of GHE data and tackles the associated challenges. It is our anticipation that in the near future, more robust data-mining-based health scoring systems will be available for healthcare professionals to understand people's health status and thus improve the quality of care
    corecore