29 research outputs found
Wisdom of the crowd from unsupervised dimension reduction
Wisdom of the crowd, the collective intelligence derived from responses of
multiple human or machine individuals to the same questions, can be more
accurate than each individual, and improve social decision-making and
prediction accuracy. This can also integrate multiple programs or datasets,
each as an individual, for the same predictive questions. Crowd wisdom
estimates each individual's independent error level arising from their limited
knowledge, and finds the crowd consensus that minimizes the overall error.
However, previous studies have merely built isolated, problem-specific models
with limited generalizability, and mainly for binary (yes/no) responses. Here
we show with simulation and real-world data that the crowd wisdom problem is
analogous to one-dimensional unsupervised dimension reduction in machine
learning. This provides a natural class of crowd wisdom solutions, such as
principal component analysis and Isomap, which can handle binary and also
continuous responses, like confidence levels, and consequently can be more
accurate than existing solutions. They can even outperform
supervised-learning-based collective intelligence that is calibrated on
historical performance of individuals, e.g. penalized linear regression and
random forest. This study unifies crowd wisdom and unsupervised dimension
reduction, and thereupon introduces a broad range of highly-performing and
widely-applicable crowd wisdom methods. As the costs for data acquisition and
processing rapidly decrease, this study will promote and guide crowd wisdom
applications in the social and natural sciences, including data fusion,
meta-analysis, crowd-sourcing, and committee decision making.Comment: 12 pages, 4 figures. Supplementary in sup folder of source files. 5
sup figures, 2 sup table
Four simple recommendations to encourage best practices in research software [version 1; referees: awaiting peer review]
Scientific research relies on computer software, yet software is not always developed following practices that ensure its quality and sustainability. This manuscript does not aim to propose new software development best practices, but rather to provide simple recommendations that encourage the adoption of existing best practices. Software development best practices promote better quality software, and better quality software improves the reproducibility and reusability of research. These recommendations are designed around Open Source values, and provide practical suggestions that contribute to making research software and its source code more discoverable, reusable and transparent. This manuscript is aimed at developers, but also at organisations, projects, journals and funders that can increase the quality and sustainability of research software by encouraging the adoption of these recommendations.
Keyword
Deep Randomized Neural Networks
Randomized Neural Networks explore the behavior of neural systems where the majority of connections are fixed, either in a stochastic or a deterministic fashion. Typical examples of such systems consist of multi-layered neural network architectures where the connections to the hidden layer(s) are left untrained after initialization. Limiting the training algorithms to operate on a reduced set of weights inherently characterizes the class of Randomized Neural Networks with a number of intriguing features. Among them, the extreme efficiency of the resulting learning processes is undoubtedly a striking advantage with respect to fully trained architectures. Besides, despite the involved simplifications, randomized neural systems possess remarkable properties both in practice, achieving state-of-the-art results in multiple domains, and theoretically, allowing to analyze intrinsic properties of neural architectures (e.g. before training of the hidden layersâ connections). In recent years, the study of Randomized Neural Networks has been extended towards deep architectures, opening new research directions to the design of effective yet extremely efficient deep learning models in vectorial as well as in more complex data domains. This chapter surveys all the major aspects regarding the design and analysis of Randomized Neural Networks, and some of the key results with respect to their approximation capabilities. In particular, we first introduce the fundamentals of randomized neural models in the context of feed-forward networks (i.e., Random Vector Functional Link and equivalent models) and convolutional filters, before moving to the case of recurrent systems (i.e., Reservoir Computing networks). For both, we focus specifically on recent results in the domain of deep randomized systems, and (for recurrent models) their application to structured domains
Deep Randomized Neural Networks
Randomized Neural Networks explore the behavior of neural systems where the
majority of connections are fixed, either in a stochastic or a deterministic
fashion. Typical examples of such systems consist of multi-layered neural
network architectures where the connections to the hidden layer(s) are left
untrained after initialization. Limiting the training algorithms to operate on
a reduced set of weights inherently characterizes the class of Randomized
Neural Networks with a number of intriguing features. Among them, the extreme
efficiency of the resulting learning processes is undoubtedly a striking
advantage with respect to fully trained architectures. Besides, despite the
involved simplifications, randomized neural systems possess remarkable
properties both in practice, achieving state-of-the-art results in multiple
domains, and theoretically, allowing to analyze intrinsic properties of neural
architectures (e.g. before training of the hidden layers' connections). In
recent years, the study of Randomized Neural Networks has been extended towards
deep architectures, opening new research directions to the design of effective
yet extremely efficient deep learning models in vectorial as well as in more
complex data domains. This chapter surveys all the major aspects regarding the
design and analysis of Randomized Neural Networks, and some of the key results
with respect to their approximation capabilities. In particular, we first
introduce the fundamentals of randomized neural models in the context of
feed-forward networks (i.e., Random Vector Functional Link and equivalent
models) and convolutional filters, before moving to the case of recurrent
systems (i.e., Reservoir Computing networks). For both, we focus specifically
on recent results in the domain of deep randomized systems, and (for recurrent
models) their application to structured domains
Easy and efficient ensemble gene set testing with EGSEA.
Gene set enrichment analysis is a popular approach for prioritising the biological processes perturbed in genomic datasets. The Bioconductor project hosts over 80 software packages capable of gene set analysis. Most of these packages search for enriched signatures amongst differentially regulated genes to reveal higher level biological themes that may be missed when focusing only on evidence from individual genes. With so many different methods on offer, choosing the best algorithm and visualization approach can be challenging. The EGSEA package solves this problem by combining results from up to 12 prominent gene set testing algorithms to obtain a consensus ranking of biologically relevant results.This workflow demonstrates how EGSEA can extend limma-based differential expression analyses for RNA-seq and microarray data using experiments that profile 3 distinct cell populations important for studying the origins of breast cancer. Following data normalization and set-up of an appropriate linear model for differential expression analysis, EGSEA builds gene signature specific indexes that link a wide range of mouse or human gene set collections obtained from MSigDB, GeneSetDB and KEGG to the gene expression data being investigated. EGSEA is then configured and the ensemble enrichment analysis run, returning an object that can be queried using several S4 methods for ranking gene sets and visualizing results via heatmaps, KEGG pathway views, GO graphs, scatter plots and bar plots. Finally, an HTML report that combines these displays can fast-track the sharing of results with collaborators, and thus expedite downstream biological validation. EGSEA is simple to use and can be easily integrated with existing gene expression analysis pipelines for both human and mouse data
A guide to creating design matrices for gene expression experiments.
Differential expression analysis of genomic data types, such as RNA-sequencing experiments, use linear models to determine the size and direction of the changes in gene expression. For RNA-sequencing, there are several established software packages for this purpose accompanied with analysis pipelines that are well described. However, there are two crucial steps in the analysis process that can be a stumbling block for many -- the set up an appropriate model via design matrices and the set up of comparisons of interest via contrast matrices. These steps are particularly troublesome because an extensive catalogue for design and contrast matrices does not currently exist. One would usually search for example case studies across different platforms and mix and match the advice from those sources to suit the dataset they have at hand. This article guides the reader through the basics of how to set up design and contrast matrices. We take a practical approach by providing code and graphical representation of each case study, starting with simpler examples (e.g. models with a single explanatory variable) and move onto more complex ones (e.g. interaction models, mixed effects models, higher order time series and cyclical models). Although our work has been written specifically with a limma-style pipeline in mind, most of it is also applicable to other software packages for differential expression analysis, and the ideas covered can be adapted to data analysis of other high-throughput technologies. Where appropriate, we explain the interpretation and differences between models to aid readers in their own model choices. Unnecessary jargon and theory is omitted where possible so that our work is accessible to a wide audience of readers, from beginners to those with experience in genomics data analysis
Disulfide connectivity prediction with extreme learning machines
Our paper emphasizes the relevance of Extreme Learning Machine (ELM) in Bioinformatics applications by addressing the problem of predicting the disulfide connectivity from protein sequences. We test different activation functions of the hidden neurons and we show that for the task at hand the Radial Basis Functions are the best performing. We also show that the ELM approach performs better than the Back Propagation learning algorithm both in terms of generalization accuracy and running time. Moreover, we find that for the problem of the prediction of the disulfide connectivity it is possible to increase the predicting performance by initializing the Radial Basis Function kernels with a k-mean clustering algorithm. Finally, the ELM procedure is not only very fast but the final predicting networks can achieve an accuracy of 0.51 and 0.45, per-bonds and per-pattern, respectively. Our ELM results are in line with the state of the art predictors addressing the same problem
Enhanced pedotransfer functions with support vector machines to predict water retention of calcareous soil
Knowledge of soil hydraulic properties is of major importance for land management in dry-land areas. The most important properties are the soilâwater retention curve (SWRC) and hydraulic conductivity characteristics. Direct measurement of the SWRC is time and cost prohibitive. Pedotransfer functions (PTFs) use data mining tools to predict SWRC. Modern data mining techniques enable accurate predictions and good generalization of SWRC data. In this research we explore whether the use of support vector machines (SVMs) could improve the accuracy of prediction of SWRC. The novelty of our work is in the application of SVM data mining techniques, which are seldom used in soil research, to a limited dataset from Syria. The soil studied is calcareous and the climate is arid, for which no PTFs have been developed. Seventy-two undisturbed soil samples were taken from four different agro-climatic zones of Syria. The soil water contents at eight matric potentials were determined and selected as output variables. The data were split into two subsets: a training set with 54 samples for model calibration or PTF development and a test set with 18 samples for PTF validation. An overview of the theoretical foundation of this new approach and the use of specific kernel functions is given. Then, the model parameters were optimized with ninefold cross-validation and a grid search method. The predictions of the SVM-based PTFs were analysed with the coefficient of determination (R2) and root mean square error (RMSE). Our results showed that the accuracy of SVM was better in terms of RMSE and R2 than multiple linear regression (MLR) and the artificial neural network (ANN). The results support previous findings that the SVM approach performs better than MLR and the ANN. Furthermore, improvements in predictions of SWRC with the three data mining techniques were obtained by replacing the more conventional organic matter in the PTF with the plastic limit (PL). Therefore, SVM and PL markedly improved the accuracy of prediction of SWRC for calcareous soil