150,610 research outputs found

    ํ† ๋Ÿฌ์Šค ๊ณต๊ฐ„ ์ƒ์—์„œ์˜ ์ ํ•ฉ์˜ˆ์ธก ๊ธฐ๋ฐ˜ ์˜ˆ์ธก ๋ฐ ํด๋Ÿฌ์Šคํ„ฐ๋ง์„ ์œ„ํ•œ ํƒ€์›ํ˜• k-ํ‰๊ท  ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ์ดˆ๋ชจ์ˆ˜ ์„ ํƒ ์ „๋žต

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ์ž์—ฐ๊ณผํ•™๋Œ€ํ•™ ํ†ต๊ณ„ํ•™๊ณผ, 2022. 8. ์ •์„ฑ๊ทœ.Protein structure data consist of several dihedral angles, lying on a multidimensional torus. Analyzing such data has been and continues to be key in understanding functional properties of proteins. However, most of the existing statistical methods assume that data are on Euclidean spaces, and thus they are improper to deal with angular data. In this paper, we introduce a novel approach specialized to analyzing multivariate angular data, based on elliptical k-means algorithm. Our approach enables the construction of conformal prediction sets and predictive clustering based on mixture model estimates. Moreover, we also introduce a novel hyperparameter selection strategy for predictive clustering, with improved stability and computational efficiency. We demonstrate our achievements with the package ClusTorus, one of our implementations, in clustering protein dihedral angles from two real data sets.๋‹จ๋ฐฑ์งˆ ๊ตฌ์กฐ ๋ฐ์ดํ„ฐ๋Š” ๋‹ค์ฐจ์› ํ† ๋Ÿฌ์Šค ์ƒ์˜ ๊ฐ๋„๋“ค๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ํŠน์„ฑ์„ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๋Š” ๋‹จ๋ฐฑ์งˆ์˜ ๊ธฐ๋Šฅ์  ํŠน์„ฑ์„ ํŒŒ์•…ํ•˜๋Š” ๋ฐ์— ์ค‘์š”ํ•œ ์—ด์‡ ๊ฐ€ ๋˜์–ด์™”๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋Œ€๋ถ€๋ถ„์˜ ํ†ต๊ณ„์  ๋ฐฉ๋ฒ•๋ก ๋“ค์€ ์œ ํด๋ฆฌ๋“œ ๊ณต๊ฐ„์„ ๊ฐ€์ •ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค์ฐจ์› ๊ฐ๋„ ๋ฐ์ดํ„ฐ์— ๋ถ€์ ํ•ฉํ•˜๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ํƒ€์›ํ˜• k-ํ‰๊ท  ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ ํ™œ์šฉํ•˜์—ฌ ๋‹ค์ฐจ์› ๊ฐ๋„ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•˜๋Š” ๋ฒ•์„ ์†Œ๊ฐœํ•œ๋‹ค. ํŠนํžˆ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ ํ•ฉ์˜ˆ์ธก์ง‘ํ•ฉ์„ ๊ตฌ์„ฑํ•˜๊ณ  ํ˜ผํ•ฉ ๋ชจํ˜• ์ถ”์ •์„ ํ†ตํ•œ ์˜ˆ์ธก ํด๋Ÿฌ์Šคํ„ฐ๋ง ๋ฐฉ๋ฒ•๋ก ์„ ์†Œ๊ฐœ ํ•œ๋‹ค. ๋˜ํ•œ ์•ˆ์ •์„ฑ๊ณผ ๊ณ„์‚ฐ ํšจ์œจ์„ฑ์„ ํ™•๋ณดํ•œ ์ƒˆ๋กœ์šด ์ดˆ๋ชจ์ˆ˜ ์„ ํƒ ์ „๋žต์„ ์ œ์‹œํ•œ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ๋ณธ ๋…ผ๋ฌธ์˜ ๋ฐฉ๋ฒ•๋ก ์„ ๊ตฌํ˜„ํ•œ R ํŒจํ‚ค์ง€ ClusTorus๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์‹ค์ œ ๋ฐ์ดํ„ฐ์…‹์— ์ ์šฉํ•œ ์˜ˆ์‹œ๋ฅผ ์†Œ๊ฐœํ•œ๋‹ค.1 Introduction 1 2 Conformal prediction 5 2.1 Conformal prediction framework 5 2.2 Inductive conformal prediction 6 2.3 Conformity scores from mixtures of multivariate von Mises 7 3 Parameter estimation for multivariate von Mises 12 3.1 Elliptical k-means algorithm 12 3.2 Constraints for mixture models 14 4 Clustering by conformal prediction 15 5 Hyperparameter selection 18 6 Clustering data on T^4 22 7 Summary and discussion 28 ์ฐธ๊ณ ๋ฌธํ—Œ 29 Abstract 34์„

    Evaluation of clustering algorithms for gene expression data

    Get PDF
    BACKGROUND: Cluster analysis is an integral part of high dimensional data analysis. In the context of large scale gene expression data, a filtered set of genes are grouped together according to their expression profiles using one of numerous clustering algorithms that exist in the statistics and machine learning literature. A closely related problem is that of selecting a clustering algorithm that is "optimal" in some sense from a rather impressive list of clustering algorithms that currently exist. RESULTS: In this paper, we propose two validation measures each with two parts: one measuring the statistical consistency (stability) of the clusters produced and the other representing their biological functional congruence. Smaller values of these indices indicate better performance for a clustering algorithm. We illustrate this approach using two case studies with publicly available gene expression data sets: one involving a SAGE data of breast cancer patients and the other involving a time course cDNA microarray data on yeast. Six well known clustering algorithms UPGMA, K-Means, Diana, Fanny, Model-Based and SOM were evaluated. CONCLUSION: No single clustering algorithm may be best suited for clustering genes into functional groups via expression profiles for all data sets. The validation measures introduced in this paper can aid in the selection of an optimal algorithm, for a given data set, from a collection of available clustering algorithms

    Fast Recognition of birds in offshore wind farms based on an improved deep learning model

    Full text link
    The safety of wind turbines is a prerequisite for the stable operation of offshore wind farms. However, bird damage poses a direct threat to the safe operation of wind turbines and wind turbine blades. In addition, millions of birds are killed by wind turbines every year. In order to protect the ecological environment and maintain the safe operation of offshore wind turbines, and to address the problem of the low detection capability of current target detection algorithms in low-light environments such as at night, this paper proposes a method to improve the network performance by integrating the CBAM attention mechanism and the RetinexNet network into YOLOv5. First, the training set images are fed into the YOLOv5 network with integrated CBAM attention module for training, and the optimal weight model is stored. Then, low-light images are enhanced and denoised using Decom-Net and Enhance-Net, and the accuracy is tested on the optimal weight model. In addition, the k-means++ clustering algorithm is used to optimise the anchor box selection method, which solves the problem of unstable initial centroids and achieves better clustering results. Experimental results show that the accuracy of this model in bird detection tasks can reach 87.40%, an increase of 21.25%. The model can detect birds near wind turbines in real time and shows strong stability in night, rainy and shaky conditions, proving that the model can ensure the safe and stable operation of wind turbines

    Clustering Stability: An Overview

    Get PDF
    A popular method for selecting the number of clusters is based on stability arguments: one chooses the number of clusters such that the corresponding clustering results are "most stable". In recent years, a series of papers has analyzed the behavior of this method from a theoretical point of view. However, the results are very technical and difficult to interpret for non-experts. In this paper we give a high-level overview about the existing literature on clustering stability. In addition to presenting the results in a slightly informal but accessible way, we relate them to each other and discuss their different implications

    How Many Topics? Stability Analysis for Topic Models

    Full text link
    Topic modeling refers to the task of discovering the underlying thematic structure in a text corpus, where the output is commonly presented as a report of the top terms appearing in each topic. Despite the diversity of topic modeling algorithms that have been proposed, a common challenge in successfully applying these techniques is the selection of an appropriate number of topics for a given corpus. Choosing too few topics will produce results that are overly broad, while choosing too many will result in the "over-clustering" of a corpus into many small, highly-similar topics. In this paper, we propose a term-centric stability analysis strategy to address this issue, the idea being that a model with an appropriate number of topics will be more robust to perturbations in the data. Using a topic modeling approach based on matrix factorization, evaluations performed on a range of corpora show that this strategy can successfully guide the model selection process.Comment: Improve readability of plots. Add minor clarification

    Stable Feature Selection for Biomarker Discovery

    Full text link
    Feature selection techniques have been used as the workhorse in biomarker discovery applications for a long time. Surprisingly, the stability of feature selection with respect to sampling variations has long been under-considered. It is only until recently that this issue has received more and more attention. In this article, we review existing stable feature selection methods for biomarker discovery using a generic hierarchal framework. We have two objectives: (1) providing an overview on this new yet fast growing topic for a convenient reference; (2) categorizing existing methods under an expandable framework for future research and development

    Identifying hidden contexts

    Get PDF
    In this study we investigate how to identify hidden contexts from the data in classification tasks. Contexts are artifacts in the data, which do not predict the class label directly. For instance, in speech recognition task speakers might have different accents, which do not directly discriminate between the spoken words. Identifying hidden contexts is considered as data preprocessing task, which can help to build more accurate classifiers, tailored for particular contexts and give an insight into the data structure. We present three techniques to identify hidden contexts, which hide class label information from the input data and partition it using clustering techniques. We form a collection of performance measures to ensure that the resulting contexts are valid. We evaluate the performance of the proposed techniques on thirty real datasets. We present a case study illustrating how the identified contexts can be used to build specialized more accurate classifiers
    • โ€ฆ
    corecore