4,257 research outputs found

    Differential gene expression graphs: A data structure for classification in DNA microarrays

    Get PDF
    This paper proposes an innovative data structure to be used as a backbone in designing microarray phenotype sample classifiers. The data structure is based on graphs and it is built from a differential analysis of the expression levels of healthy and diseased tissue samples in a microarray dataset. The proposed data structure is built in such a way that, by construction, it shows a number of properties that are perfectly suited to address several problems like feature extraction, clustering, and classificatio

    Multiple Instance Learning: A Survey of Problem Characteristics and Applications

    Full text link
    Multiple instance learning (MIL) is a form of weakly supervised learning where training instances are arranged in sets, called bags, and a label is provided for the entire bag. This formulation is gaining interest because it naturally fits various problems and allows to leverage weakly labeled data. Consequently, it has been used in diverse application fields such as computer vision and document classification. However, learning from bags raises important challenges that are unique to MIL. This paper provides a comprehensive survey of the characteristics which define and differentiate the types of MIL problems. Until now, these problem characteristics have not been formally identified and described. As a result, the variations in performance of MIL algorithms from one data set to another are difficult to explain. In this paper, MIL problem characteristics are grouped into four broad categories: the composition of the bags, the types of data distribution, the ambiguity of instance labels, and the task to be performed. Methods specialized to address each category are reviewed. Then, the extent to which these characteristics manifest themselves in key MIL application areas are described. Finally, experiments are conducted to compare the performance of 16 state-of-the-art MIL methods on selected problem characteristics. This paper provides insight on how the problem characteristics affect MIL algorithms, recommendations for future benchmarking and promising avenues for research

    Kidney Ailment Prediction under Data Imbalance

    Get PDF
    Chronic Kidney Disease (CKD) is the leading cause for kidney failure. It is a global health problem affecting approximately 10% of the world population and about 15% of US adults. Chronic Kidney Diseases do not generally show any disease specific symptoms in early stages thus it is hard to detect and prevent such diseases. Early detection and classification are the key factors in managing Chronic Kidney Diseases. In this thesis, we propose a new machine learning technique for Kidney Ailment Prediction. We focus on two key issues in machine learning, especially in its application to disease prediction. One is related to class imbalance problem. This occurs when at least one of the classes are represented by significantly smaller number of samples than the others in the training set. The problem with imbalanced dataset is that the classifiers tend to classify all samples as majority class, ignoring the minority class samples. The second issue is on the specific type of data to be used for a given problem. Here, we focused on predicting kidney diseases based on patient information extracted from laboratory and questionnaire data. Most recent approaches for predicting kidney diseases or other chronic diseases rely on the usage of prescription drugs. In this study, we focus on biomarker and anthropometry data of patients to analyze and predict kidney-related diseases. In this research, we adopted a learning approach which involves repeated random data sub-sampling to tackle the class imbalance problem. This technique divides the samples into multiple sub-samples, while keeping each training sub-sample completely balanced. We then trained classification models on the balanced data to predict the risk of kidney failure. Further, we developed an intelligent fusion mechanism to combine information from both the biomarker and anthropometry data sets for improved prediction accuracy and stability. Results are included to demonstrate the performance

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
    • ā€¦
    corecore