641 research outputs found

    Essential guidelines for computational method benchmarking

    Get PDF
    In computational biology and other sciences, researchers are frequently faced with a choice between several computational methods for performing data analyses. Benchmarking studies aim to rigorously compare the performance of different methods using well-characterized benchmark datasets, to determine the strengths of each method or to provide recommendations regarding suitable choices of methods for an analysis. However, benchmarking studies must be carefully designed and implemented to provide accurate, unbiased, and informative results. Here, we summarize key practical guidelines and recommendations for performing high-quality benchmarking analyses, based on our experiences in computational biology.Comment: Minor update

    Essential guidelines for computational method benchmarking

    Get PDF
    In computational biology and other sciences, researchers are frequently faced with a choice between several computational methods for performing data analyses. Benchmarking studies aim to rigorously compare the performance of different methods using well-characterized benchmark datasets, to determine the strengths of each method or to provide recommendations regarding suitable choices of methods for an analysis. However, benchmarking studies must be carefully designed and implemented to provide accurate, unbiased, and informative results. Here, we summarize key practical guidelines and recommendations for performing high-quality benchmarking analyses, based on our experiences in computational biology

    Prediction of ‘Nules Clementine’ mandarin susceptibility to rind breakdown disorder using Vis/NIR spectroscopy

    Get PDF
    The use of diffuse reflectance visible and near infrared (Vis/NIR) spectroscopy was explored as a non-destructive technique to predict ‘Nules Clementine’ mandarin fruit susceptibility to rind breakdown (RBD) disorder by detecting rind physico-chemical properties of 80 intact fruit harvested from different canopy positions. Vis/NIR spectra were obtained using a LabSpec® spectrophotometer. Reference physico-chemical data of the fruit were obtained after 8 weeks of storage at 8 °C using conventional methods and included RBD, hue angle, colour index, mass loss, rind dry matter, as well as carbohydrates (sucrose, glucose, fructose, total carbohydrates), and total phenolic acid concentrations. Principal component analysis (PCA) was applied to analyse spectral data to identify clusters in the PCA score plots and outliers. Partial least squares (PLS) regression was applied to spectral data after PCA to develop prediction models for each quality attribute. The spectra were subjected to a test set validation by dividing the data into calibration (n = 48) and test validation (n = 32) sets. An extra set of 40 fruit harvested from a different part of the orchard was used for external validation. PLS-discriminant analysis (PLS-DA) models were developed to sort fruit based on canopy position and RBD susceptibility. Fruit position within the canopy had a significant influence on rind biochemical properties. Outside fruit had higher rind carbohydrates, phenolic acids and dry matter content and lower RBD index than inside fruit. The data distribution in the PCA and PLS-DA models displayed four clusters that could easily be identified. These clusters allowed distinction between fruit from different preharvest treatments. NIR calibration and validation results demonstrated that colour index, dry matter, total carbohydrates and mass loss were predicted with significant accuracy, with residual predictive deviation (RPD) for prediction of 3.83, 3.58, 3.15 and 2.61, respectively. The good correlation between spectral information and carbohydrate content demonstrated the potential of Vis/NIR as a non-destructive tool to predict fruit susceptibility to RBD

    Distributed incremental fingerprint identification with reduced database penetration rate using a hierarchical classification based on feature fusion and selection

    Get PDF
    Fingerprint recognition has been a hot research topic along the last few decades, with many applications and ever growing populations to identify. The need of flexible, fast identification systems is therefore patent in such situations. In this context, fingerprint classification is commonly used to improve the speed of the identification. This paper proposes a complete identification system with a hierarchical classification framework that fuses the information of multiple feature extractors. A feature selection is applied to improve the classification accuracy. Finally, the distributed identification is carried out with an incremental search, exploring the classes according to the probability order given by the classifier. A single parameter tunes the trade-off between identification time and accuracy. The proposal is evaluated over two NIST databases and a large synthetic database, yielding penetration rates close to the optimal values that can be reached with classification, leading to low identification times with small or no accuracy loss

    Finding class C GPCR subtype-discriminating n-grams through feature selection

    Get PDF
    G protein-coupled receptors (GPCRs) are a large and heterogeneous superfamily of receptors that are key cell players for their role as extracellular signal transmitters. Class C GPCRs, in particular, are of great interest in pharmacology. The lack of knowledge about their full 3-D structure prompts the use of their primary amino acid sequences for the construction of robust classifiers, capable of discriminating their different subtypes. In this paper, we describe the use of feature selection techniques to build Support Vector Machine (SVM)-based classification models from selected receptor subsequences described as n-grams. We show that this approach to classification is useful for finding class C GPCR subtype-specific motifs.Peer ReviewedPostprint (author’s final draft

    Digging into acceptor splice site prediction : an iterative feature selection approach

    Get PDF
    Feature selection techniques are often used to reduce data dimensionality, increase classification performance, and gain insight into the processes that generated the data. In this paper, we describe an iterative procedure of feature selection and feature construction steps, improving the classification of acceptor splice sites, an important subtask of gene prediction. We show that acceptor prediction can benefit from feature selection, and describe how feature selection techniques can be used to gain new insights in the classification of acceptor sites. This is illustrated by the identification of a new, biologically motivated feature: the AG-scanning feature. The results described in this paper contribute both to the domain of gene prediction, and to research in feature selection techniques, describing a new wrapper based feature weighting method that aids in knowledge discovery when dealing with complex datasets

    Determining appropriate approaches for using data in feature selection

    Get PDF
    Feature selection is increasingly important in data analysis and machine learning in big data era. However, how to use the data in feature selection, i.e. using either ALL or PART of a dataset, has become a serious and tricky issue. Whilst the conventional practice of using all the data in feature selection may lead to selection bias, using part of the data may, on the other hand, lead to underestimating the relevant features under some conditions. This paper investigates these two strategies systematically in terms of reliability and effectiveness, and then determines their suitability for datasets with different characteristics. The reliability is measured by the Average Tanimoto Index and the Inter-method Average Tanimoto Index, and the effectiveness is measured by the mean generalisation accuracy of classification. The computational experiments are carried out on ten real-world benchmark datasets and fourteen synthetic datasets. The synthetic datasets are generated with a pre-set number of relevant features and varied numbers of irrelevant features and instances, and added with different levels of noise. The results indicate that the PART approach is more effective in reducing the bias when the size of a dataset is small but starts to lose its advantage as the dataset size increases

    Machine learning for automatic prediction of the quality of electrophysiological recordings

    Get PDF
    The quality of electrophysiological recordings varies a lot due to technical and biological variability and neuroscientists inevitably have to select “good” recordings for further analyses. This procedure is time-consuming and prone to selection biases. Here, we investigate replacing human decisions by a machine learning approach. We define 16 features, such as spike height and width, select the most informative ones using a wrapper method and train a classifier to reproduce the judgement of one of our expert electrophysiologists. Generalisation performance is then assessed on unseen data, classified by the same or by another expert. We observe that the learning machine can be equally, if not more, consistent in its judgements as individual experts amongst each other. Best performance is achieved for a limited number of informative features; the optimal feature set being different from one data set to another. With 80–90% of correct judgements, the performance of the system is very promising within the data sets of each expert but judgments are less reliable when it is used across sets of recordings from different experts. We conclude that the proposed approach is relevant to the selection of electrophysiological recordings, provided parameters are adjusted to different types of experiments and to individual experimenters

    Nondestructive measurement of fruit and vegetable quality

    Get PDF
    We review nondestructive techniques for measuring internal and external quality attributes of fruit and vegetables, such as color, size and shape, flavor, texture, and absence of defects. The different techniques are organized according to their physical measurement principle. We first describe each technique and then list some examples. As many of these techniques rely on mathematical models and particular data processing methods, we discuss these where needed. We pay particular attention to techniques that can be implemented online in grading lines

    Computational flow cytometry as a diagnostic tool in suspected-myelodysplastic syndromes

    Get PDF
    The diagnostic work-up of patients suspected for myelodysplastic syndromes is challenging and mainly relies on bone marrow morphology and cytogenetics. In this study, we developed and prospectively validated a fully computational tool for flow cytometry diagnostics in suspected-MDS. The computational diagnostic workflow consists of methods for pre-processing flow cytometry data, followed by a cell population detection method (FlowSOM) and a machine learning classifier (Random Forest). Based on a six tubes FC panel, the workflow obtained a 90% sensitivity and 93% specificity in an independent validation cohort. For practical advantages (e.g., reduced processing time and costs), a second computational diagnostic workflow was trained, solely based on the best performing single tube of the training cohort. This workflow obtained 97% sensitivity and 95% specificity in the prospective validation cohort. Both workflows outperformed the conventional, expert analyzed flow cytometry scores for diagnosis with respect to accuracy, objectivity and time investment (less than 2 min per patient)
    corecore