204 research outputs found

    Stable Feature Selection for Biomarker Discovery

    Full text link
    Feature selection techniques have been used as the workhorse in biomarker discovery applications for a long time. Surprisingly, the stability of feature selection with respect to sampling variations has long been under-considered. It is only until recently that this issue has received more and more attention. In this article, we review existing stable feature selection methods for biomarker discovery using a generic hierarchal framework. We have two objectives: (1) providing an overview on this new yet fast growing topic for a convenient reference; (2) categorizing existing methods under an expandable framework for future research and development

    The Probability of Intransitivity in Dice and Close Elections

    Get PDF
    We study the phenomenon of intransitivity in models of dice and voting. First, we follow a recent thread of research for nn-sided dice with pairwise ordering induced by the probability, relative to 1/21/2, that a throw from one die is higher than the other. We build on a recent result of Polymath showing that three dice with i.i.d. faces drawn from the uniform distribution on {1,…,n}\{1,\ldots,n\} and conditioned on the average of faces equal to (n+1)/2(n+1)/2 are intransitive with asymptotic probability 1/41/4. We show that if dice faces are drawn from a non-uniform continuous mean zero distribution conditioned on the average of faces equal to 00, then three dice are transitive with high probability. We also extend our results to stationary Gaussian dice, whose faces, for example, can be the fractional Brownian increments with Hurst index H∈(0,1)H\in(0,1). Second, we pose an analogous model in the context of Condorcet voting. We consider nn voters who rank kk alternatives independently and uniformly at random. The winner between each two alternatives is decided by a majority vote based on the preferences. We show that in this model, if all pairwise elections are close to tied, then the asymptotic probability of obtaining any tournament on the kk alternatives is equal to 2−k(k−1)/22^{-k(k-1)/2}, which markedly differs from known results in the model without conditioning. We also explore the Condorcet voting model where methods other than simple majority are used for pairwise elections. We investigate some natural definitions of "close to tied" for general functions and exhibit an example where the distribution over tournaments is not uniform under those definitions.Comment: Ver3: 45 pages, additional details and clarifications; Ver2: 43 pages, additional co-author, major revision; Ver1: 23 page

    Stacked Generalizations in Imbalanced Fraud Data Sets using Resampling Methods

    Full text link
    This study uses stacked generalization, which is a two-step process of combining machine learning methods, called meta or super learners, for improving the performance of algorithms in step one (by minimizing the error rate of each individual algorithm to reduce its bias in the learning set) and then in step two inputting the results into the meta learner with its stacked blended output (demonstrating improved performance with the weakest algorithms learning better). The method is essentially an enhanced cross-validation strategy. Although the process uses great computational resources, the resulting performance metrics on resampled fraud data show that increased system cost can be justified. A fundamental key to fraud data is that it is inherently not systematic and, as of yet, the optimal resampling methodology has not been identified. Building a test harness that accounts for all permutations of algorithm sample set pairs demonstrates that the complex, intrinsic data structures are all thoroughly tested. Using a comparative analysis on fraud data that applies stacked generalizations provides useful insight needed to find the optimal mathematical formula to be used for imbalanced fraud data sets.Comment: 19 pages, 3 figures, 8 table

    Current Challenges in the Application of Algorithms in Multi-institutional Clinical Settings

    Get PDF
    The Coronavirus disease pandemic has highlighted the importance of artificial intelligence in multi-institutional clinical settings. Particularly in situations where the healthcare system is overloaded, and a lot of data is generated, artificial intelligence has great potential to provide automated solutions and to unlock the untapped potential of acquired data. This includes the areas of care, logistics, and diagnosis. For example, automated decision support applications could tremendously help physicians in their daily clinical routine. Especially in radiology and oncology, the exponential growth of imaging data, triggered by a rising number of patients, leads to a permanent overload of the healthcare system, making the use of artificial intelligence inevitable. However, the efficient and advantageous application of artificial intelligence in multi-institutional clinical settings faces several challenges, such as accountability and regulation hurdles, implementation challenges, and fairness considerations. This work focuses on the implementation challenges, which include the following questions: How to ensure well-curated and standardized data, how do algorithms from other domains perform on multi-institutional medical datasets, and how to train more robust and generalizable models? Also, questions of how to interpret results and whether there exist correlations between the performance of the models and the characteristics of the underlying data are part of the work. Therefore, besides presenting a technical solution for manual data annotation and tagging for medical images, a real-world federated learning implementation for image segmentation is introduced. Experiments on a multi-institutional prostate magnetic resonance imaging dataset showcase that models trained by federated learning can achieve similar performance to training on pooled data. Furthermore, Natural Language Processing algorithms with the tasks of semantic textual similarity, text classification, and text summarization are applied to multi-institutional, structured and free-text, oncology reports. The results show that performance gains are achieved by customizing state-of-the-art algorithms to the peculiarities of the medical datasets, such as the occurrence of medications, numbers, or dates. In addition, performance influences are observed depending on the characteristics of the data, such as lexical complexity. The generated results, human baselines, and retrospective human evaluations demonstrate that artificial intelligence algorithms have great potential for use in clinical settings. However, due to the difficulty of processing domain-specific data, there still exists a performance gap between the algorithms and the medical experts. In the future, it is therefore essential to improve the interoperability and standardization of data, as well as to continue working on algorithms to perform well on medical, possibly, domain-shifted data from multiple clinical centers

    v. 76, issue 19, April 24, 2009

    Get PDF
    • …
    corecore