202,663 research outputs found
Identifying a Minimal Class of Models for High-dimensional Data
Abstract Model selection consistency in the high-dimensional regression setting can be achieved only if strong assumptions are fulfilled. We therefore suggest to pursue a different goal, which we call a minimal class of models. The minimal class of models includes models that are similar in their prediction accuracy but not necessarily in their elements. We suggest a random search algorithm to reveal candidate models. The algorithm implements simulated annealing while using a score for each predictor that we suggest to derive using a combination of the lasso and the elastic net. The utility of using a minimal class of models is demonstrated in the analysis of two data sets
A survey of outlier detection methodologies
Outlier detection has been used for centuries to detect and, where appropriate, remove anomalous observations from data. Outliers arise due to mechanical faults, changes in system behaviour, fraudulent behaviour, human error, instrument error or simply through natural deviations in populations. Their detection can identify system faults and fraud before they escalate with potentially catastrophic consequences. It can identify errors and remove their contaminating effect on the data set and as such to purify the data for processing. The original outlier detection methods were arbitrary but now, principled and systematic techniques are used, drawn from the full gamut of Computer Science and Statistics. In this paper, we introduce a survey of contemporary techniques for outlier detection. We identify their respective motivations and distinguish their advantages and disadvantages in a comparative review
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
What are the Best Hierarchical Descriptors for Complex Networks?
This work reviews several hierarchical measurements of the topology of
complex networks and then applies feature selection concepts and methods in
order to quantify the relative importance of each measurement with respect to
the discrimination between four representative theoretical network models,
namely Erd\"{o}s-R\'enyi, Barab\'asi-Albert, Watts-Strogatz as well as a
geographical type of network. The obtained results confirmed that the four
models can be well-separated by using a combination of measurements. In
addition, the relative contribution of each considered feature for the overall
discrimination of the models was quantified in terms of the respective weights
in the canonical projection into two dimensions, with the traditional
clustering coefficient, hierarchical clustering coefficient and neighborhood
clustering coefficient resulting particularly effective. Interestingly, the
average shortest path length and hierarchical node degrees contributed little
for the separation of the four network models.Comment: 9 pages, 4 figure
- …