202,663 research outputs found

    Identifying a Minimal Class of Models for High-dimensional Data

    Get PDF
    Abstract Model selection consistency in the high-dimensional regression setting can be achieved only if strong assumptions are fulfilled. We therefore suggest to pursue a different goal, which we call a minimal class of models. The minimal class of models includes models that are similar in their prediction accuracy but not necessarily in their elements. We suggest a random search algorithm to reveal candidate models. The algorithm implements simulated annealing while using a score for each predictor that we suggest to derive using a combination of the lasso and the elastic net. The utility of using a minimal class of models is demonstrated in the analysis of two data sets

    A survey of outlier detection methodologies

    Get PDF
    Outlier detection has been used for centuries to detect and, where appropriate, remove anomalous observations from data. Outliers arise due to mechanical faults, changes in system behaviour, fraudulent behaviour, human error, instrument error or simply through natural deviations in populations. Their detection can identify system faults and fraud before they escalate with potentially catastrophic consequences. It can identify errors and remove their contaminating effect on the data set and as such to purify the data for processing. The original outlier detection methods were arbitrary but now, principled and systematic techniques are used, drawn from the full gamut of Computer Science and Statistics. In this paper, we introduce a survey of contemporary techniques for outlier detection. We identify their respective motivations and distinguish their advantages and disadvantages in a comparative review

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    What are the Best Hierarchical Descriptors for Complex Networks?

    Full text link
    This work reviews several hierarchical measurements of the topology of complex networks and then applies feature selection concepts and methods in order to quantify the relative importance of each measurement with respect to the discrimination between four representative theoretical network models, namely Erd\"{o}s-R\'enyi, Barab\'asi-Albert, Watts-Strogatz as well as a geographical type of network. The obtained results confirmed that the four models can be well-separated by using a combination of measurements. In addition, the relative contribution of each considered feature for the overall discrimination of the models was quantified in terms of the respective weights in the canonical projection into two dimensions, with the traditional clustering coefficient, hierarchical clustering coefficient and neighborhood clustering coefficient resulting particularly effective. Interestingly, the average shortest path length and hierarchical node degrees contributed little for the separation of the four network models.Comment: 9 pages, 4 figure
    • …
    corecore