2 research outputs found

    Learning from a Class Imbalanced Public Health Dataset: a Cost-based Comparison of Classifier Performance

    Get PDF
    Public health care systems routinely collect health-related data from the population. This data can be analyzed using data mining techniques to find novel, interesting patterns, which could help formulate effective public health policies and interventions. The occurrence of chronic illness is rare in the population and the effect of this class imbalance, on the performance of various classifiers was studied. The objective of this work is to identify the best classifiers for class imbalanced health datasets through a cost-based comparison of classifier performance. The popular, open-source data mining tool WEKA, was used to build a variety of core classifiers as well as classifier ensembles, to evaluate the classifiersโ€™ performance. The unequal misclassification costs were represented in a cost matrix, and cost-benefit analysis was also performed. ย In another experiment, various sampling methods such as under-sampling, over-sampling, and SMOTE was performed to balance the class distribution in the dataset, and the costs were compared. The Bayesian classifiers performed well with a high recall, low number of false negatives and were not affected by the class imbalance. Results confirm that total cost of Bayesian classifiers can be further reduced using cost-sensitive learning methods. Classifiers built using the random under-sampled dataset showed a dramatic drop in costs and high classification accuracy

    ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ๋‹จ์ผ ๊ฑฐ๋ฆฌ ๊ณต๊ฐ„ ๋‚ด GPCR ๋‹จ๋ฐฑ์งˆ๊ตฐ ๊ณ„์ธต ๊ตฌ์กฐ์˜ ๋™์‹œ์  ๋ชจ๋ธ๋ง ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(์„์‚ฌ)--์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› :๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€,2019. 8. ๊น€์„ .G ๋‹จ๋ฐธ์งˆ ์—ฐ๊ฒฐ ์ˆ˜์šฉ์ฒด(GPCR)์€ ๊ณ„์ธต ๊ตฌ์กฐ๋กœ ํ˜•์„ฑ๋œ ๋‹ค์–‘ํ•œ ๋‹จ๋ฐฑ์งˆ๊ตฐ์œผ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. ๋‹จ๋ฐฑ์งˆ ์„œ์—ด์„ ํ†ตํ•œ GPCR์— ๋Œ€ํ•œ ๊ณ„์‚ฐ์ ์ธ ๋ชจ๋ธ๋ง์€ ๊ตฐ(family), ์•„๊ตฐ(subfamily), ์ค€์•„๊ตฐ(sub-subfamily)์˜ ๊ฐ ๊ณ„์ธต์—์„œ ๋…๋ฆฝ์ ์œผ๋กœ ์‹คํ–‰๋˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ด๋ฃจ์–ด์ ธ์™”๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Ÿฌํ•œ ์ ‘๊ทผ ๋ฐฉ์‹๋“ค์€ ๋‹จ์ ˆ๋œ ๋ชจ๋ธ๋“ค์„ ํ†ตํ•˜์—ฌ ๋‹จ๋ฐฑ์งˆ ๋‚ด์˜ ์ •๋ณด๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ๋•Œ๋ฌธ์— GPCR ์ข…๋ฅ˜ ์‚ฌ์ด์˜ ๊ด€๊ณ„๋Š” ๊ณ ๋ คํ•˜์ง€ ๋ชปํ•œ๋‹ค๋Š” ํ•œ๊ณ„๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ๋”ฅ๋Ÿฌ๋‹์„ ์ด์šฉํ•˜์—ฌ GPCR์˜ ๊ณ„์ธต ๊ตฌ์กฐ์—์„œ ๋‚˜ํƒ€๋‚˜๋Š” ํŠน์ง•๋“ค์„ ๋‹จ์ผํ•œ ๋ชจ๋ธ๋กœ ๋™์‹œ์ ์œผ๋กœ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค. ๋˜ํ•œ ๊ณ„์ธต์ ์ธ ๊ด€๊ณ„๋“ค์„ ํ•˜๋‚˜์˜ ๋ฒกํ„ฐ ๊ณต๊ฐ„์— ๊ฑฐ๋ฆฌ๋ฅผ ํ†ตํ•ด ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๊ธฐ ์œ„ํ•œ ์†์‹คํ•จ์ˆ˜๋„ ์ œ์‹œํ•œ๋‹ค. ์ด ์—ฐ๊ตฌ๋Š” GPCR ์ˆ˜์šฉ์ฒด๋“ค์˜ ์—ฌ๋Ÿฌ ๊ณ„์ธต์—์„œ ๊ณตํ†ต์ ์œผ๋กœ ๋‚˜ํƒ€๋‚˜๋Š” ํŠน์ง•๋“ค์„ ํ•™์Šตํ•˜๊ณ  ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋‹ค๋ฃจ๊ณ  ์žˆ๋‹ค. ์—ฌ๋Ÿฌ ์‹ฌํ™”์ ์ธ ์‹คํ—˜๋“ค์„ ํ†ตํ•˜์—ฌ ์šฐ๋ฆฌ๋Š” ๊ธฐ์ˆ ์ ์ธ ์ธก๋ฉด๊ณผ ์ƒ๋ฌผํ•™์ ์ธ ์ธก๋ฉด์—์„œ ๋‹จ๋ฐฑ์งˆ ๊ฐ„ ๊ณ„์ธต์ ์ธ ๊ด€๊ณ„๊ฐ€ ์„ฑ๊ณต์ ์œผ๋กœ ํ•™์Šต์ด ๋˜์—ˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์˜€๋‹ค. ์ฒซ๋ฒˆ์งธ๋กœ, ์šฐ๋ฆฌ๋Š” ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ์— ๊ณ„์ธต์  ๊ตฐ์ง‘ํ™”(hierarchical clustering) ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ ์šฉํ•จ์œผ๋กœ์จ ๊ณ„ํ†ต์ˆ˜(phylogenetic tree)๋ฅผ ๋งŒ๋“ค์—ˆ๊ณ , ๊ตฐ์ง‘ ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ์‹ค์ œ ๊ณ„์ธต ๊ตฌ์กฐ์™€์˜ ์ˆ˜์น˜์ ์ธ ๋น„๊ต๋ฅผ ํ†ตํ•˜์—ฌ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋ฅผ ํ†ตํ•ด ๊ณ„ํ†ตํ•™์  ํŠน์ง•์— ๋Œ€ํ•œ ์œ ์ถ”๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์˜€๋‹ค. ๋‘๋ฒˆ์งธ๋กœ, ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ์˜ ๊ตฐ์ง‘ํ™” ๊ฒฐ๊ณผ์— ๋‹ค์ค‘ ์„œ์—ด ์ •๋ ฌ(multiple sequence alignment)๋ฅผ ์ ์šฉ์‹œํ‚ด์œผ๋กœ์จ ์ƒ๋ฌผํ•™์ ์œผ๋กœ ์œ ์˜๋ฏธํ•œ ์„œ์—ด์  ํŠน์„ฑ๋“ค์„ ์ฐพ์•„๋‚ผ ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์˜€๋‹ค. ์ด๋Š” ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ ๋ถ„์„์ด GPCR ๋‹จ๋ฐฑ์งˆ ์—ฐ๊ตฌ์— ์žˆ์–ด ํšจ์œจ์ ์ธ ์ฒซ๊ฑธ์Œ์ด ๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค. ์ด๋Ÿฌํ•œ ๊ฒฐ๊ณผ๋Š” ์—ฌ๋Ÿฌ ๊ณ„์ธต์œผ๋กœ ์ด๋ฃจ์–ด์ง„ ๋‹จ๋ฐฑ์งˆ๊ตฐ์— ๋Œ€ํ•œ ๋™์‹œ์ ์ธ ๋ชจ๋ธ๋ง์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ๊ฒƒ์„ ๋งํ•˜๊ณ  ์žˆ๋‹ค.G protein-coupled receptors (GPCRs) belong to diverse families of proteins that can be defined at multiple levels. Computational modeling of GPCR families from the sequences has been performed separately at each level of family, sub-family, and sub-subfamily. However, relationships between classes are ignored in these approaches as they process the information in the sequences with a group of disconnected models. In this work, we propose a deep learning network to simultaneously learn representations in the GPCR hierarchy with a unified model and a loss term to express hierarchical relations in terms of distances in a single embedding space. The model introduces a method to learn and construct shared representations across hierarchies of the protein family. In extensive experiments, we showed that hierarchical relations between sequences are successfully captured in our model in both of technical and biological aspect. First, we showed that phylogenetic information in the sequences can be inferred from the vectors by constructing phylogenetic tree using hierarchical clustering algorithm and by quantitatively analyzing the quality of clustering results compared to the real label information. Second, inspection on embedding vectors is demonstrated to be a effective first step to-ward an analysis of GPCR proteins by showing that biologically significant sequence features can be revealed from multiple sequence alignments on clustering results on embedding vectors. Our work showed that simultaneous modeling of protein families with multiple hierarchies is possible.Abstract i Chapter โ… . Introduction 1 1.1 Background 1 1.2 Motivation 3 Chapter โ…ก. Methods 7 2.1 Data Preparation 7 2.1.1 Dataset 7 2.1.2 Data representation 7 2.2 Model architecture 8 2.2.1 Feature extractor with CNN 8 2.2.2 Embedding layer 8 2.2.3 Output layer 9 2.3 Loss function 10 2.3.1 Softmax loss 10 2.3.2 Center loss 10 2.3.3 Overall loss 12 2.4 Training procedure 13 2.5 Evaluation metric 14 2.5.1 Silhouette score 14 2.5.2 Adjusted mutual information score 15 Chapter โ…ข. Results 17 3.1 Evaluation on hierarchical structure 17 3.1.1 Preservation of distances 17 3.1.2 Phylogenetic tree reconstruction 20 3.1.3 Quantitative evaluation on clustering results 21 3.2 Sequence analysis with embedding vectors 26 3.2.1 Technical analysis 26 3.2.2 Biological analysis 28 3.3 Classification accuracy 30 Chapter โ…ฃ. Conclusion 32 References 35Maste
    corecore