421 research outputs found
Determining appropriate approaches for using data in feature selection
Feature selection is increasingly important in data analysis and machine learning in big data era. However, how to use the data in feature selection, i.e. using either ALL or PART of a dataset, has become a serious and tricky issue. Whilst the conventional practice of using all the data in feature selection may lead to selection bias, using part of the data may, on the other hand, lead to underestimating the relevant features under some conditions. This paper investigates these two strategies systematically in terms of reliability and effectiveness, and then determines their suitability for datasets with different characteristics. The reliability is measured by the Average Tanimoto Index and the Inter-method Average Tanimoto Index, and the effectiveness is measured by the mean generalisation accuracy of classification. The computational experiments are carried out on ten real-world benchmark datasets and fourteen synthetic datasets. The synthetic datasets are generated with a pre-set number of relevant features and varied numbers of irrelevant features and instances, and added with different levels of noise. The results indicate that the PART approach is more effective in reducing the bias when the size of a dataset is small but starts to lose its advantage as the dataset size increases
Infrastructure expansion challenges sustainable development in Papua New Guinea
The island of New Guinea hosts the third largest expanse of tropical rainforest on the planet. Papua New Guineaâcomprising the eastern half of the islandâplans to nearly double its national road network (from 8,700 to 15,000 km) over the next three years, to spur economic growth. We assessed these plans using fine-scale biophysical and environmental data. We identified numerous environmental and socioeconomic risks associated with these projects, including the dissection of 54 critical biodiversity habitats and diminished forest connectivity across large expanses of the island. Key habitats of globally endangered species including Goodfellow's tree-kangaroo (Dendrolagus goodfellowi), Matchie's tree kangaroo (D. matschiei), and several birds of paradise would also be bisected by roads and opened up to logging, hunting, and habitat conversion. Many planned roads would traverse rainforests and carbon-rich peatlands, contradicting Papua New Guinea's international commitments to promote low-carbon development and forest conservation for climate-change mitigation. Planned roads would also create new deforestation hotspots via rapid expansion of logging, mining, and oil-palm plantations. Our study suggests that several planned road segments in steep and high-rainfall terrain would be extremely expensive in terms of construction and maintenance costs. This would create unanticipated economic challenges and public debt. The net environmental, social, and economic risks of several planned projectsâsuch as the Epo-Kikori link, Madang-Baiyer link, Wau-Malalaua link, and some other planned projects in the Western and East Sepik Provincesâcould easily outstrip their overall benefits. Such projects should be reconsidered under broader environmental, economic, and social grounds, rather than short-term economic considerations
The Dexi-SH* model for a multivariate assessment of agro-ecological sustainability of dairy grazing systems
Dexi-SH* is an ex ante multivariate model for assessing the sustainability of dairy cows grazing systems. This model is composed of three sub-models that evaluate the impact of the systems on: (i) biotic resources; (ii) abiotic resources, and (iii) pollution risks. The structuring of the hierarchical tree was inspired by that of the Masc model. The choice of criteria and their aggregation modalities were discussed within a multi-disciplinary group of scientists. For each cluster, a utility function was established in order to determine weighting and priority functions between criteria. The model can take local and regional conditions and standards into account by adjusting criterion categories to the agroecological context, and the specific views of the decision makers by changing the weighting of criteria
Identification of disease-causing genes using microarray data mining and gene ontology
Background: One of the best and most accurate methods for identifying disease-causing genes is monitoring gene expression values in different samples using microarray technology. One of the shortcomings of microarray data is that they provide a small quantity of samples with respect to the number of genes. This problem reduces the classification accuracy of the methods, so gene selection is essential to improve the predictive accuracy and to identify potential marker genes for a disease. Among numerous existing methods for gene selection, support vector machine-based recursive feature elimination (SVMRFE) has become one of the leading methods, but its performance can be reduced because of the small sample size, noisy data and the fact that the method does not remove redundant genes.
Methods: We propose a novel framework for gene selection which uses the advantageous features of conventional methods and addresses their weaknesses. In fact, we have combined the Fisher method and SVMRFE to utilize the advantages of a filtering method as well as an embedded method. Furthermore, we have added a redundancy reduction stage to address the weakness of the Fisher method and SVMRFE. In addition to gene expression values, the proposed method uses Gene Ontology which is a reliable source of information on genes. The use of Gene Ontology can compensate, in part, for the limitations of microarrays, such as having a small number of samples and erroneous measurement results.
Results: The proposed method has been applied to colon, Diffuse Large B-Cell Lymphoma (DLBCL) and prostate cancer datasets. The empirical results show that our method has improved classification performance in terms of accuracy, sensitivity and specificity. In addition, the study of the molecular function of selected genes strengthened the hypothesis that these genes are involved in the process of cancer growth.
Conclusions: The proposed method addresses the weakness of conventional methods by adding a redundancy reduction stage and utilizing Gene Ontology information. It predicts marker genes for colon, DLBCL and prostate cancer with a high accuracy. The predictions made in this study can serve as a list of candidates for subsequent wet-lab verification and might help in the search for a cure for cancers
The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures
Motivation: Biomarker discovery from high-dimensional data is a crucial
problem with enormous applications in biology and medicine. It is also
extremely challenging from a statistical viewpoint, but surprisingly few
studies have investigated the relative strengths and weaknesses of the plethora
of existing feature selection methods. Methods: We compare 32 feature selection
methods on 4 public gene expression datasets for breast cancer prognosis, in
terms of predictive performance, stability and functional interpretability of
the signatures they produce. Results: We observe that the feature selection
method has a significant influence on the accuracy, stability and
interpretability of signatures. Simple filter methods generally outperform more
complex embedded or wrapper methods, and ensemble feature selection has
generally no positive effect. Overall a simple Student's t-test seems to
provide the best results. Availability: Code and data are publicly available at
http://cbio.ensmp.fr/~ahaury/
Overview of Current Directions in Boredom Research
The concluding chapter of this book represents a collaborative effort between the editors and all contributing authors, resulting in a comprehensive overview of the current directions in boredom research. Summaries of each chapter not only underscore the multitude of perspectives on boredom but also elucidate the diverse approaches employed in its study. Furthermore, the chapter directs attention to both known and unknown aspects of boredom, providing a foundation for future research in the field
Gene selection with multiple ordering criteria
BACKGROUND: A microarray study may select different differentially expressed gene sets because of different selection criteria. For example, the fold-change and p-value are two commonly known criteria to select differentially expressed genes under two experimental conditions. These two selection criteria often result in incompatible selected gene sets. Also, in a two-factor, say, treatment by time experiment, the investigator may be interested in one gene list that responds to both treatment and time effects. RESULTS: We propose three layer ranking algorithms, point-admissible, line-admissible (convex), and Pareto, to provide a preference gene list from multiple gene lists generated by different ranking criteria. Using the public colon data as an example, the layer ranking algorithms are applied to the three univariate ranking criteria, fold-change, p-value, and frequency of selections by the SVM-RFE classifier. A simulation experiment shows that for experiments with small or moderate sample sizes (less than 20 per group) and detecting a 4-fold change or less, the two-dimensional (p-value and fold-change) convex layer ranking selects differentially expressed genes with generally lower FDR and higher power than the standard p-value ranking. Three applications are presented. The first application illustrates a use of the layer rankings to potentially improve predictive accuracy. The second application illustrates an application to a two-factor experiment involving two dose levels and two time points. The layer rankings are applied to selecting differentially expressed genes relating to the dose and time effects. In the third application, the layer rankings are applied to a benchmark data set consisting of three dilution concentrations to provide a ranking system from a long list of differentially expressed genes generated from the three dilution concentrations. CONCLUSION: The layer ranking algorithms are useful to help investigators in selecting the most promising genes from multiple gene lists generated by different filter, normalization, or analysis methods for various objectives
Optimally splitting cases for training and testing high dimensional classifiers
<p>Abstract</p> <p>Background</p> <p>We consider the problem of designing a study to develop a predictive classifier from high dimensional data. A common study design is to split the sample into a training set and an independent test set, where the former is used to develop the classifier and the latter to evaluate its performance. In this paper we address the question of what proportion of the samples should be devoted to the training set. How does this proportion impact the mean squared error (MSE) of the prediction accuracy estimate?</p> <p>Results</p> <p>We develop a non-parametric algorithm for determining an optimal splitting proportion that can be applied with a specific dataset and classifier algorithm. We also perform a broad simulation study for the purpose of better understanding the factors that determine the best split proportions and to evaluate commonly used splitting strategies (1/2 training or 2/3 training) under a wide variety of conditions. These methods are based on a decomposition of the MSE into three intuitive component parts.</p> <p>Conclusions</p> <p>By applying these approaches to a number of synthetic and real microarray datasets we show that for linear classifiers the optimal proportion depends on the overall number of samples available and the degree of differential expression between the classes. The optimal proportion was found to depend on the full dataset size (n) and classification accuracy - with higher accuracy and smaller <it>n </it>resulting in more assigned to the training set. The commonly used strategy of allocating 2/3rd of cases for training was close to optimal for reasonable sized datasets (<it>n </it>â„ 100) with strong signals (i.e. 85% or greater full dataset accuracy). In general, we recommend use of our nonparametric resampling approach for determing the optimal split. This approach can be applied to any dataset, using any predictor development method, to determine the best split.</p
- âŠ