327 research outputs found

    Assesment of Stroke Risk Based on Morphological Ultrasound Image Analysis With Conformal Prediction

    Get PDF
    Non-invasive ultrasound imaging of carotid plaques allows for the development of plaque image analysis in order to assess the risk of stroke. In our work, we provide reliable confidence measures for the assessment of stroke risk, using the Conformal Prediction framework. This framework provides a way for assigning valid confidence measures to predictions of classical machine learning algorithms. We conduct experiments on a dataset which contains morphological features derived from ultrasound images of atherosclerotic carotid plaques, and we evaluate the results of four different Conformal Predictors (CPs). The four CPs are based on Artificial Neural Networks (ANNs), Support Vector Machines (SVMs), Naive Bayes classification (NBC), and k-Nearest Neighbours (k-NN). The results given by all CPs demonstrate the reliability and usefulness of the obtained confidence measures on the problem of stroke risk assessment

    TCM-RF : Hedging the predictions of Random Forest

    Get PDF
    The output of traditional classifier is point prediction without giving any confidence of it. To the contrary, Transductive Confidence Machine (TCM), which is a novel framework that provides a prediction result coupled with its accurate confidence. This method also can hedge the prediction in which the predicting accuracy will be controlled by predefined confidence level. In the framework of TCM, the efficiency of prediction depends on the strangeness function of samples. This paper incorporates Random forests (RF) into the framework of TCM and proposes new TCM algorithm named TCM-RF, in which the strangeness obtained by RF will be used to implement the confidence prediction. Compared with traditional TCM algorithms, our method benefits from the more precise and robust strangeness measure and takes advantage of random forest. Experiments indicate its effectiveness and robustness. In addition, our study demonstrated that using ensemble strategies to define sample strangeness may be a more principled way than using a single classifier. On the other hand, it also shows that the paradigm of hedging prediction can be applied to an ensemble classifier

    Applications Of Machine Learning In Biology And Medicine

    Get PDF
    Machine learning as a field is defined to be the set of computational algorithms that improve their performance by assimilating data. As such, the field as a whole has found applications in many diverse disciplines from robotics and communication in engineering to economics and finance, and also biology and medicine. It should not come as a surprise that many popular methods in use today have completely different origins. Despite this heterogeneity, different methods can be divided into standard tasks, such as supervised, unsupervised, semi-supervised and reinforcement learning. Although machine learning as a field can be formalized as methods trying to solve certain standard tasks, applying these tasks on datasets from different fields comes with certain caveats, and sometimes is fraught with challenges. In this thesis, we develop general procedures and novel solutions, dealing with practical problems that arise when modeling biological and medical data. Cost sensitive learning is an important area of research in machine learning which addresses the widespread and practical problem of dealing with different costs during the learning and deployment of classification algorithms. In many applications such as credit fraud detection, network intrusion and specifically medical diagnosis domains, prior class distributions are highly skewed, which makes the training examples very much unbalanced. Combining this with uneven misclassification costs renders standard machine learning approaches useless in learning an acceptable decision function. We experimentally show the benefits and shortcomings of various methods that convert cost blind learning algorithms to cost sensitive ones. Using the results and best practices found for cost sensitive learning, we design and develop a machine learning approach to ontology mapping. Next, we present a novel approach to deal with uncertainty in classification when costs are unknown or otherwise hard to assign. Support Vector Machines (SVM) are considered to be among the most successful approaches for classification. However prediction of instances near the decision boundary depends more on the specific parameter selection or noise in data, rather than a clear difference in features. In many applications such as medical diagnosis, these regions should be labeled as uncertain rather than assigned to any particular class. Furthermore, instances may belong to novel disease subtypes that are not from any previously known class. In such applications, declining to make a prediction could be beneficial when more powerful but expensive tests are available. We develop a novel approach for optimal selection of the threshold and show its successful application on three biological and medical datasets. The last part of this thesis provides novel solutions for handling high dimensional data. Although high-dimensional data is ubiquitously found in many disciplines, current life science research almost always involves high-dimensional genomics/proteomics data. The ``omics\u27\u27 data provide a wealth of information and have changed the research landscape in biology and medicine. However, these data are plagued with noise, redundancy and collinearity, which makes the discovery process very difficult and costly. Any method that can accurately detect irrelevant and noisy variables in omics data would be highly valuable. We present Robust Feature Selection (RFS), a randomized feature selection approach dedicated to low-sample high-dimensional data. RFS combines an embedded feature selection method with a randomization procedure for stability. Recent advances in sparse recovery and estimation methods have provided efficient and asymptotically consistent feature selection algorithms. However, these methods lack finite sample error control due to instability. Furthermore, the chances of correct recovery diminish with more collinearity among features. To overcome these difficulties, RFS uses a randomization procedure to provide an accurate and stable feature selection method. We thoroughly evaluate RFS by comparing it to a number of popular univariate and multivariate feature selection methods and show marked prediction accuracy improvement of a diagnostic signature, while preserving a good stability

    Confidence and Venn Machines and Their Applications to Proteomics

    Get PDF
    When a prediction is made in a classification or regression problem, it is useful to have additional information on how reliable this individual prediction is. Such predictions complemented with the additional information are also expected to be valid, i.e., to have a guarantee on the outcome. Recently developed frameworks of confidence machines, category-based confidence machines and Venn machines allow us to address these problems: confidence machines complement each prediction with its confidence and output region predictions with the guaranteed asymptotical error rate; Venn machines output multiprobability predictions which are valid in respect of observed frequencies. Another advantage of these frameworks is the fact that they are based on the i.i.d. assumption and do not depend on the probability distribution of examples. This thesis is devoted to further development of these frameworks. Firstly, novel designs and implementations of confidence machines and Venn machines are proposed. These implementations are based on random forest and support vector machine classifiers and inherit their ability to predict with high accuracy on a certain type of data. Experimental testing is carried out. Secondly, several algorithms with online validity are designed for proteomic data analysis. These algorithms take into account the nature of mass spectrometry experiments and special features of the data analysed. They also allow us to address medical problems: to make early diagnosis of diseases and to identify potential biomarkers. Extensive experimental study is performed on the UK Collaborative Trial of Ovarian Cancer Screening data sets. Finally, in theoretical research we extend the class of algorithms which output valid predictions in the online mode: we develop a new method of constructing valid prediction intervals for a statistical model different from the standard i.i.d. assumption used in confidence and Venn machines

    Systems Biology Knowledgebase for a New Era in Biology A Genomics:GTL Report from the May 2008 Workshop

    Full text link

    Burkitt lymphoma classification and MYC-associated non-Burkitt lymphoma investigation based on gene expression

    Get PDF
    Burkitt lymphoma and diffuse large B-cell lymphoma are two closely related types of lymphoma that are managed differently in clinical practice and the accurate diagnosis is a key point in treatment decisions. However based on current criteria combined with morphological, immunophenotypic and genetic characteristics, a significant number of cases exhibit overlapping features where diagnosis and treatment decisions are difficult to make. Especially, the prognosis have been reported significantly unfavourable in a subset of cases that are initially diagnosed as diffuse large B-cell lymphoma but bear MYC gene translocation, which is a defining feature of Burkitt lymphoma however can also be found in other lymphomas. Despite the adverse effect of MYC in aggressive lymphomas other than Burkitt lymphoma, the underlying mechanism and effective treatment is still unclear. Recent technological advances have made it possible to simultaneously investigate an enormous number of bio-molecules, and the scientific fields associated with measuring molecular data in such a high-throughput way are usually called “omics”. For example, genomics assesses thousands of DNA sequences and transcriptomics assays large numbers of transcripts in a single experiment. These techniques together with the rapidly emerging analytical methods in bioinformatics have introduced cancer research into a new era. The growing amount of omics data have significantly influenced the understanding of lymphomas and hold great promise in classifying subtypes, predicting treatment responses that will eventually lead to personalized therapy. Here in this study, we investigate the discrimination of Burkitt lymphoma and diffuse large B-cell lymphoma based on DNA microarray gene expression data, which has contributed most in molecular classification of lymphoma subtypes in the last decade. On the basis of two previous research level gene expression profiling classifiers, we developed a robust classifier that works effectively on different platforms and formalin fixed paraffin-embedded samples commonly used in routine clinic. The validation of the classifier on the samples from clinical patients achieves a high agreement with diagnosis made in a central haematopathology laboratory, and leads to a potential outcome indication in the patients presenting intermediate features. In addition, we explore the role of MYC in the above lymphomas. Our investigation emphasizes the inferior impact of high level MYC mRNA expression on patients’ outcome, and the functional analysis of MYC high expression associated genes show significantly enriched molecular mechanisms of proliferation and metabolic process. Moreover, the gene PRMT5 is found to be highly correlated with MYC expression which opens a possible therapeutic target for the treatment