29 research outputs found

    Wisdom of the crowd from unsupervised dimension reduction

    Get PDF
    Wisdom of the crowd, the collective intelligence derived from responses of multiple human or machine individuals to the same questions, can be more accurate than each individual, and improve social decision-making and prediction accuracy. This can also integrate multiple programs or datasets, each as an individual, for the same predictive questions. Crowd wisdom estimates each individual's independent error level arising from their limited knowledge, and finds the crowd consensus that minimizes the overall error. However, previous studies have merely built isolated, problem-specific models with limited generalizability, and mainly for binary (yes/no) responses. Here we show with simulation and real-world data that the crowd wisdom problem is analogous to one-dimensional unsupervised dimension reduction in machine learning. This provides a natural class of crowd wisdom solutions, such as principal component analysis and Isomap, which can handle binary and also continuous responses, like confidence levels, and consequently can be more accurate than existing solutions. They can even outperform supervised-learning-based collective intelligence that is calibrated on historical performance of individuals, e.g. penalized linear regression and random forest. This study unifies crowd wisdom and unsupervised dimension reduction, and thereupon introduces a broad range of highly-performing and widely-applicable crowd wisdom methods. As the costs for data acquisition and processing rapidly decrease, this study will promote and guide crowd wisdom applications in the social and natural sciences, including data fusion, meta-analysis, crowd-sourcing, and committee decision making.Comment: 12 pages, 4 figures. Supplementary in sup folder of source files. 5 sup figures, 2 sup table

    Four simple recommendations to encourage best practices in research software [version 1; referees: awaiting peer review]

    Get PDF
    Scientific research relies on computer software, yet software is not always developed following practices that ensure its quality and sustainability. This manuscript does not aim to propose new software development best practices, but rather to provide simple recommendations that encourage the adoption of existing best practices. Software development best practices promote better quality software, and better quality software improves the reproducibility and reusability of research. These recommendations are designed around Open Source values, and provide practical suggestions that contribute to making research software and its source code more discoverable, reusable and transparent. This manuscript is aimed at developers, but also at organisations, projects, journals and funders that can increase the quality and sustainability of research software by encouraging the adoption of these recommendations. Keyword

    Deep Randomized Neural Networks

    Get PDF
    Randomized Neural Networks explore the behavior of neural systems where the majority of connections are fixed, either in a stochastic or a deterministic fashion. Typical examples of such systems consist of multi-layered neural network architectures where the connections to the hidden layer(s) are left untrained after initialization. Limiting the training algorithms to operate on a reduced set of weights inherently characterizes the class of Randomized Neural Networks with a number of intriguing features. Among them, the extreme efficiency of the resulting learning processes is undoubtedly a striking advantage with respect to fully trained architectures. Besides, despite the involved simplifications, randomized neural systems possess remarkable properties both in practice, achieving state-of-the-art results in multiple domains, and theoretically, allowing to analyze intrinsic properties of neural architectures (e.g. before training of the hidden layers’ connections). In recent years, the study of Randomized Neural Networks has been extended towards deep architectures, opening new research directions to the design of effective yet extremely efficient deep learning models in vectorial as well as in more complex data domains. This chapter surveys all the major aspects regarding the design and analysis of Randomized Neural Networks, and some of the key results with respect to their approximation capabilities. In particular, we first introduce the fundamentals of randomized neural models in the context of feed-forward networks (i.e., Random Vector Functional Link and equivalent models) and convolutional filters, before moving to the case of recurrent systems (i.e., Reservoir Computing networks). For both, we focus specifically on recent results in the domain of deep randomized systems, and (for recurrent models) their application to structured domains

    Deep Randomized Neural Networks

    Get PDF
    Randomized Neural Networks explore the behavior of neural systems where the majority of connections are fixed, either in a stochastic or a deterministic fashion. Typical examples of such systems consist of multi-layered neural network architectures where the connections to the hidden layer(s) are left untrained after initialization. Limiting the training algorithms to operate on a reduced set of weights inherently characterizes the class of Randomized Neural Networks with a number of intriguing features. Among them, the extreme efficiency of the resulting learning processes is undoubtedly a striking advantage with respect to fully trained architectures. Besides, despite the involved simplifications, randomized neural systems possess remarkable properties both in practice, achieving state-of-the-art results in multiple domains, and theoretically, allowing to analyze intrinsic properties of neural architectures (e.g. before training of the hidden layers' connections). In recent years, the study of Randomized Neural Networks has been extended towards deep architectures, opening new research directions to the design of effective yet extremely efficient deep learning models in vectorial as well as in more complex data domains. This chapter surveys all the major aspects regarding the design and analysis of Randomized Neural Networks, and some of the key results with respect to their approximation capabilities. In particular, we first introduce the fundamentals of randomized neural models in the context of feed-forward networks (i.e., Random Vector Functional Link and equivalent models) and convolutional filters, before moving to the case of recurrent systems (i.e., Reservoir Computing networks). For both, we focus specifically on recent results in the domain of deep randomized systems, and (for recurrent models) their application to structured domains

    Easy and efficient ensemble gene set testing with EGSEA.

    Get PDF
    Gene set enrichment analysis is a popular approach for prioritising the biological processes perturbed in genomic datasets. The Bioconductor project hosts over 80 software packages capable of gene set analysis. Most of these packages search for enriched signatures amongst differentially regulated genes to reveal higher level biological themes that may be missed when focusing only on evidence from individual genes. With so many different methods on offer, choosing the best algorithm and visualization approach can be challenging. The EGSEA package solves this problem by combining results from up to 12 prominent gene set testing algorithms to obtain a consensus ranking of biologically relevant results.This workflow demonstrates how EGSEA can extend limma-based differential expression analyses for RNA-seq and microarray data using experiments that profile 3 distinct cell populations important for studying the origins of breast cancer. Following data normalization and set-up of an appropriate linear model for differential expression analysis, EGSEA builds gene signature specific indexes that link a wide range of mouse or human gene set collections obtained from MSigDB, GeneSetDB and KEGG to the gene expression data being investigated. EGSEA is then configured and the ensemble enrichment analysis run, returning an object that can be queried using several S4 methods for ranking gene sets and visualizing results via heatmaps, KEGG pathway views, GO graphs, scatter plots and bar plots. Finally, an HTML report that combines these displays can fast-track the sharing of results with collaborators, and thus expedite downstream biological validation. EGSEA is simple to use and can be easily integrated with existing gene expression analysis pipelines for both human and mouse data

    A guide to creating design matrices for gene expression experiments.

    Get PDF
    Differential expression analysis of genomic data types, such as RNA-sequencing experiments, use linear models to determine the size and direction of the changes in gene expression. For RNA-sequencing, there are several established software packages for this purpose accompanied with analysis pipelines that are well described. However, there are two crucial steps in the analysis process that can be a stumbling block for many -- the set up an appropriate model via design matrices and the set up of comparisons of interest via contrast matrices. These steps are particularly troublesome because an extensive catalogue for design and contrast matrices does not currently exist. One would usually search for example case studies across different platforms and mix and match the advice from those sources to suit the dataset they have at hand. This article guides the reader through the basics of how to set up design and contrast matrices. We take a practical approach by providing code and graphical representation of each case study, starting with simpler examples (e.g. models with a single explanatory variable) and move onto more complex ones (e.g. interaction models, mixed effects models, higher order time series and cyclical models). Although our work has been written specifically with a limma-style pipeline in mind, most of it is also applicable to other software packages for differential expression analysis, and the ideas covered can be adapted to data analysis of other high-throughput technologies. Where appropriate, we explain the interpretation and differences between models to aid readers in their own model choices. Unnecessary jargon and theory is omitted where possible so that our work is accessible to a wide audience of readers, from beginners to those with experience in genomics data analysis

    Disulfide connectivity prediction with extreme learning machines

    No full text
    Our paper emphasizes the relevance of Extreme Learning Machine (ELM) in Bioinformatics applications by addressing the problem of predicting the disulfide connectivity from protein sequences. We test different activation functions of the hidden neurons and we show that for the task at hand the Radial Basis Functions are the best performing. We also show that the ELM approach performs better than the Back Propagation learning algorithm both in terms of generalization accuracy and running time. Moreover, we find that for the problem of the prediction of the disulfide connectivity it is possible to increase the predicting performance by initializing the Radial Basis Function kernels with a k-mean clustering algorithm. Finally, the ELM procedure is not only very fast but the final predicting networks can achieve an accuracy of 0.51 and 0.45, per-bonds and per-pattern, respectively. Our ELM results are in line with the state of the art predictors addressing the same problem

    Enhanced pedotransfer functions with support vector machines to predict water retention of calcareous soil

    No full text
    Knowledge of soil hydraulic properties is of major importance for land management in dry-land areas. The most important properties are the soil–water retention curve (SWRC) and hydraulic conductivity characteristics. Direct measurement of the SWRC is time and cost prohibitive. Pedotransfer functions (PTFs) use data mining tools to predict SWRC. Modern data mining techniques enable accurate predictions and good generalization of SWRC data. In this research we explore whether the use of support vector machines (SVMs) could improve the accuracy of prediction of SWRC. The novelty of our work is in the application of SVM data mining techniques, which are seldom used in soil research, to a limited dataset from Syria. The soil studied is calcareous and the climate is arid, for which no PTFs have been developed. Seventy-two undisturbed soil samples were taken from four different agro-climatic zones of Syria. The soil water contents at eight matric potentials were determined and selected as output variables. The data were split into two subsets: a training set with 54 samples for model calibration or PTF development and a test set with 18 samples for PTF validation. An overview of the theoretical foundation of this new approach and the use of specific kernel functions is given. Then, the model parameters were optimized with ninefold cross-validation and a grid search method. The predictions of the SVM-based PTFs were analysed with the coefficient of determination (R2) and root mean square error (RMSE). Our results showed that the accuracy of SVM was better in terms of RMSE and R2 than multiple linear regression (MLR) and the artificial neural network (ANN). The results support previous findings that the SVM approach performs better than MLR and the ANN. Furthermore, improvements in predictions of SWRC with the three data mining techniques were obtained by replacing the more conventional organic matter in the PTF with the plastic limit (PL). Therefore, SVM and PL markedly improved the accuracy of prediction of SWRC for calcareous soil
    corecore