4 research outputs found

    Generating ‘Omic Knowledge’: The Role of Informatics in High Content Screening

    Get PDF
    High Content Screening (HCS) and High Content Analysis (HCA) have emerged over the past 10 years as a powerful technology for both drug discovery and systems biology. Founded on the automated, quantitative image analysis of fluorescently labeled cells or engineered cell lines, HCS provides unparalleled levels of multi-parameter data on cellular events and is being widely adopted, with great benefits, in many aspects of life science from gaining a better understanding of disease processes, through better models of toxicity, to generating systems views of cellular processes. This paper looks at the role of informatics and bioinformatics in both enabling and driving HCS to further our understanding of both the genome and the cellome and looks into the future to see where such deep knowledge could take us

    Data mining of many-attribute data : investigating the interaction between feature selection strategy and statistical features of datasets

    Get PDF
    In many datasets, there is a very large number of attributes (e.g. many thousands). Such datasets can cause many problems for machine learning methods. Various feature selection (FS) strategies have been developed to address these problems. The idea of an FS strategy is to reduce the number of features in a dataset (e.g. from many thousands to a few hundred) so that machine learning and/or statistical analysis can be done much more quickly and effectively. Obviously, FS strategies attempt to select the features that are most important, considering the machine learning task to be done. The work presented in this dissertation concerns the comparison between several popular feature selection strategies, and, in particular, investigation of the interaction between feature selection strategy and simple statistical features of the dataset. The basic hypothesis, not investigated before, is that the correct choice of FS strategy for a particular dataset should be based on a simple (at least) statistical analysis of the dataset. First, we examined the performance of several strategies on a selection of datasets. Strategies examined were: four widely-used FS strategies (Correlation, Relief F, Evolutionary Algorithm, no-feature-selection), several feature bias (FB) strategies (in which the machine learning method considers all features, but makes use of bias values suggested by the FB strategy), and also combinations of FS and FB strategies. The results showed us that FB methods displayed strong capability on some datasets and that combined strategies were also often successful. Examining these results, we noted that patterns of performance were not immediately understandable. This led to the above hypothesis (one of the main contributions of the thesis) that statistical features of the dataset are an important consideration when choosing an FS strategy. We then investigated this hypothesis with several further experiments. Analysis of the results revealed that a simple statistical feature of a dataset, that can be easily pre-calculated, has a clear relationship with the performance Silang Luo PHD-06-2009 Page 2 of certain FS methods, and a similar relationship with differences in performance between certain pairs of FS strategies. In particular, Correlation based FS is a very widely-used FS technique based on the basic hypothesis that good feature sets contain features that are highly correlated with the class, yet uncorrelated with each other. By analysing the outcome of several FS strategies on different artificial datasets, the experiments suggest that CFS is never the best choice for poorly correlated data. Finally, considering several methods, we suggest tentative guidelines for choosing an FS strategy based on simply calculated measures of the dataset
    corecore