377 research outputs found

    Determining appropriate approaches for using data in feature selection

    Get PDF
    Feature selection is increasingly important in data analysis and machine learning in big data era. However, how to use the data in feature selection, i.e. using either ALL or PART of a dataset, has become a serious and tricky issue. Whilst the conventional practice of using all the data in feature selection may lead to selection bias, using part of the data may, on the other hand, lead to underestimating the relevant features under some conditions. This paper investigates these two strategies systematically in terms of reliability and effectiveness, and then determines their suitability for datasets with different characteristics. The reliability is measured by the Average Tanimoto Index and the Inter-method Average Tanimoto Index, and the effectiveness is measured by the mean generalisation accuracy of classification. The computational experiments are carried out on ten real-world benchmark datasets and fourteen synthetic datasets. The synthetic datasets are generated with a pre-set number of relevant features and varied numbers of irrelevant features and instances, and added with different levels of noise. The results indicate that the PART approach is more effective in reducing the bias when the size of a dataset is small but starts to lose its advantage as the dataset size increases

    Machine learning for automatic prediction of the quality of electrophysiological recordings

    Get PDF
    The quality of electrophysiological recordings varies a lot due to technical and biological variability and neuroscientists inevitably have to select “good” recordings for further analyses. This procedure is time-consuming and prone to selection biases. Here, we investigate replacing human decisions by a machine learning approach. We define 16 features, such as spike height and width, select the most informative ones using a wrapper method and train a classifier to reproduce the judgement of one of our expert electrophysiologists. Generalisation performance is then assessed on unseen data, classified by the same or by another expert. We observe that the learning machine can be equally, if not more, consistent in its judgements as individual experts amongst each other. Best performance is achieved for a limited number of informative features; the optimal feature set being different from one data set to another. With 80–90% of correct judgements, the performance of the system is very promising within the data sets of each expert but judgments are less reliable when it is used across sets of recordings from different experts. We conclude that the proposed approach is relevant to the selection of electrophysiological recordings, provided parameters are adjusted to different types of experiments and to individual experimenters

    Asynchronous processing for latent fingerprint identification on heterogeneous CPU-GPU systems

    Get PDF
    Latent fingerprint identification is one of the most essential identification procedures in criminal investigations. Addressing this task is challenging as (i) it requires analyzing massive databases in reasonable periods and (ii) it is commonly solved by combining different methods with very complex data-dependencies, which make fully exploiting heterogeneous CPU-GPU systems very complex. Most efforts in this context focus on improving the accuracy of the approaches and neglect reducing the processing time. Indeed, the most accurate approach was designed for one single thread. This work introduces the fastest methodology for latent fingerprint identification maintaining high accuracy called Asynchronous processing for Latent Fingerprint Identification (ALFI). ALFI fully exploits all the resources of CPU-GPU systems using asynchronous processing and fine-coarse parallelism for analyzing massive databases. Our approach reduces idle times in processing and exploits the inherent parallelism of comparing latent fingerprints to fingerprint impressions. We analyzed the performance of ALFI on Linux and Windows operating systems using the well-known NIST/FVC databases. Experimental results reveal that ALFI is in average 22x faster than the state-of-the-art algorithm, reaching a value of 44.7x for the best-studied case

    Computational flow cytometry as a diagnostic tool in suspected-myelodysplastic syndromes

    Get PDF
    The diagnostic work-up of patients suspected for myelodysplastic syndromes is challenging and mainly relies on bone marrow morphology and cytogenetics. In this study, we developed and prospectively validated a fully computational tool for flow cytometry diagnostics in suspected-MDS. The computational diagnostic workflow consists of methods for pre-processing flow cytometry data, followed by a cell population detection method (FlowSOM) and a machine learning classifier (Random Forest). Based on a six tubes FC panel, the workflow obtained a 90% sensitivity and 93% specificity in an independent validation cohort. For practical advantages (e.g., reduced processing time and costs), a second computational diagnostic workflow was trained, solely based on the best performing single tube of the training cohort. This workflow obtained 97% sensitivity and 95% specificity in the prospective validation cohort. Both workflows outperformed the conventional, expert analyzed flow cytometry scores for diagnosis with respect to accuracy, objectivity and time investment (less than 2 min per patient)

    Classification of motor imagery tasks for BCI with multiresolution analysis and multiobjective feature selection

    Get PDF
    Background: Brain-computer interfacing (BCI) applications based on the classification of electroencephalographic (EEG) signals require solving high-dimensional pattern classification problems with such a relatively small number of training patterns that curse of dimensionality problems usually arise. Multiresolution analysis (MRA) has useful properties for signal analysis in both temporal and spectral analysis, and has been broadly used in the BCI field. However, MRA usually increases the dimensionality of the input data. Therefore, some approaches to feature selection or feature dimensionality reduction should be considered for improving the performance of the MRA based BCI. Methods: This paper investigates feature selection in the MRA-based frameworks for BCI. Several wrapper approaches to evolutionary multiobjective feature selection are proposed with different structures of classifiers. They are evaluated by comparing with baseline methods using sparse representation of features or without feature selection. Results and conclusion: The statistical analysis, by applying the Kolmogorov-Smirnoff and Kruskal-Wallis tests to the means of the Kappa values evaluated by using the test patterns in each approach, has demonstrated some advantages of the proposed approaches. In comparison with the baseline MRA approach used in previous studies, the proposed evolutionary multiobjective feature selection approaches provide similar or even better classification performances, with significant reduction in the number of features that need to be computed

    Regularized logistic regression and multi-objective variable selection for classifying MEG data

    Get PDF
    This paper addresses the question of maximizing classifier accuracy for classifying task-related mental activity from Magnetoencelophalography (MEG) data. We propose the use of different sources of information and introduce an automatic channel selection procedure. To determine an informative set of channels, our approach combines a variety of machine learning algorithms: feature subset selection methods, classifiers based on regularized logistic regression, information fusion, and multiobjective optimization based on probabilistic modeling of the search space. The experimental results show that our proposal is able to improve classification accuracy compared to approaches whose classifiers use only one type of MEG information or for which the set of channels is fixed a priori

    The role of chloroplast movement in C4 photosynthesis: a theoretical analysis using a three-dimensional reaction-diffusion model for maize

    Get PDF
    18 Pág.Chloroplasts movement within mesophyll cells in C4 plants is hypothesized to enhance the CO2 concentrating mechanism, but this is difficult to verify experimentally. A three-dimensional (3D) leaf model can help analyse how chloroplast movement influences the operation of the CO2 concentrating mechanism. The first volumetric reaction-diffusion model of C4 photosynthesis that incorporates detailed 3D leaf anatomy, light propagation, ATP and NADPH production, and CO2, O2 and bicarbonate concentration driven by diffusional and assimilation/emission processes was developed. It was implemented for maize leaves to simulate various chloroplast movement scenarios within mesophyll cells: the movement of all mesophyll chloroplasts towards bundle sheath cells (aggregative movement) and movement of only those of interveinal mesophyll cells towards bundle sheath cells (avoidance movement). Light absorbed by bundle sheath chloroplasts relative to mesophyll chloroplasts increased in both cases. Avoidance movement decreased light absorption by mesophyll chloroplasts considerably. Consequently, total ATP and NADPH production and net photosynthetic rate increased for aggregative movement and decreased for avoidance movement compared with the default case of no chloroplast movement at high light intensities. Leakiness increased in both chloroplast movement scenarios due to the imbalance in energy production and demand in mesophyll and bundle sheath cells. These results suggest the need to design strategies for coordinated increases in electron transport and Rubisco activities for an efficient CO2 concentrating mechanism at very high light intensities.The work is supported by the Research Council of KU Leuven (project C1/16/002) and the Research Fund Flanders (project G.0645.13). Wageningen based authors have contributed to this work within the program BioSolar Cells. FJC was funded through the Spanish fellowship Ramon y Cajal (RYC2021-035064-I).Peer reviewe

    Feature selection in the reconstruction of complex network representations of spectral data

    Get PDF
    Complex networks have been extensively used in the last decade to characterize and analyze complex systems, and they have been recently proposed as a novel instrument for the analysis of spectra extracted from biological samples. Yet, the high number of measurements composing spectra, and the consequent high computational cost, make a direct network analysis unfeasible. We here present a comparative analysis of three customary feature selection algorithms, including the binning of spectral data and the use of information theory metrics. Such algorithms are compared by assessing the score obtained in a classification task, where healthy subjects and people suffering from different types of cancers should be discriminated. Results indicate that a feature selection strategy based on Mutual Information outperforms the more classical data binning, while allowing a reduction of the dimensionality of the data set in two orders of magnitud

    GC content of early metazoan genes and its impact on gene expression levels in mammalian cell lines

    Get PDF
    With the genomes available for many animal clades, including the early-branching metazoans, one can readily study the functional conservation of genes across a diversity of animal lineages. Ectopic expression of an animal protein in, for instance, a mammalian cell line is a generally used strategy in structure–function analysis. However, this might turn out to be problematic in case of distantly related species. Here we analyzed the GC content of the coding sequences of basal animals and show its impact on gene expression levels in human cell lines, and, importantly, how this expression efficiency can be improved. Optimization of the GC3 content in the coding sequences of cadherin, alpha-catenin, and paracaspase of Trichoplax adhaerens dramatically increased the expression of these basal animal genes in human cell lines
    • …
    corecore