8 research outputs found
A sequential approach with multiple trees in large chemical databases
축차적 설계가 검증할 화합물들이 증가하는 거대한 화학적 데이터 베이스에서 효율적이라는 것이 밝혀져 있다. 또한, 다중나무예측치들이 한 개의 나무 예측치보다 더 정확하다는것도 알려져 있다. 우리는 이러한 축차적 설계와 다중나무들을 이용하여, 다중나무를 갖고 축차적으로 접근하는 방법을 제안한다. 그리고, 상이성의 척도로서 BNNN과 Tanimoto, 2개의 방법들의 정확성을 비교하고자 한다. 이러한 방법들을 적용할 53203개의 화합물들의 효능과 화학적 구조가 있는 데이터를 이용한다.;It has been shown that the sequential design is efficient in large chemical databases with increasing numbers of compounds to be tested. Also, it is known that the multiple trees are more accurate than the greedy tree. We propose a sequential approach with multiple trees using the sequential design and the multiple trees. And, we compare the accuracy of the two methods, BNNN and Tanimoto for the measure of similarity. We use the data in the potency and chemical structure of 53203 compounds to apply those methods.CHAPTER 1 INTRODUCTI0N 1
CHAPTER 2 DATASET 3
CHAPTER 3 BACKGROUND 5
3.1 Recursive partitioning 5
3.2 Multiple Trees and Combining 7
CHAPTER 4 SEQUENTIAL ALGORITHM 8
4.1 Desingning a sequential screening scheme-Multiple Trees 8
Ⅰ) The initial sample size N1=2500 8
Ⅱ) Beam size b=5 ,pool size p=1000 8
Ⅲ) Additional sample size N2=2500, Design of additional stages D2=50/50 9
Ⅳ) Generating multiple trees(b=5,p=1000) 9
4.2 Designing a sequential screening scheme-Greedy Tree 10
Ⅰ) The initial sample size N1=2500 10
Ⅱ) Beam size b=5 ,pool size p=1000 10
Ⅲ) Additional sample size N2=2500, Design of additional stages D2=50/50 10
Ⅳ) Generating multiple trees(b=1,p=1) 11
4.3 BNNN and Tanimoto coefficient 12
CHAPTER 5 COMPARISON 13
5.1 Hit ratio and Gain ratio 13
5.2 Comparison 15
5.3 Relative hit ratio and Comparison 19
CHAPTER 6 DISCUSSION 24
REFERENCES 25
논문초록 26
감사의 글 2
Detecting significant effects based on a half-normal probability plot
In analyzing data from unreplicated factorial designs, the half-normal probability plot is a commonly used method to screen the few vital effects. Recently, numerous methods have been proposed to overcome the subjective interpretation on this plot. We review three methods: Lenth’s method (1989), the Step-down Lenth’s method proposed by Ye et al. (2001) and the LGB method proposed by Lawson et al. (1998). We compare their performance to identify active effects using a simulation study. It turns out that the performance depends on the number of active effects.
For a small number of active effects, the LGB is more effective in identifying the active effects than the others. On the other hand, the LGB method is not doing well when the number of active effects is large. The LGB method is to fit a simple least-squares line without intercept to the inliers, which are determined by Lenth’s method. The effects exceeding the prediction interval based on the fitted line are judged to be significant. In the case when the number of active effects is large, there might be a problem with classifying the inliers and outliers.
Thus, improving the accuracy of classifying the effects into inliers and outliers, we propose a modified method in which more outliers could be classified by adaptation of two methods : Carling’s (2000) method for adjusted boxplot, and Lenth’s method. If there exists no outlier or a wide range of the inliers determined by Lenth’s method, we could find more outliers by Carling’s method.
Also, we propose an integrated method which utilizes all those three methods mentioned. A conservative approach could declare the intersection of those active effects by each method to be significant. An aggressive approach could declare the union of those active effects by each method to be significant. We can categorize the significant effects as four-color stages: Green, Blue, Orange and Gray. All the use of these approaches depends on whether the experiment-wise error rate to be controlled or not.
We conduct a simulation study based on 10,000 sets of experimental data in unreplicated 2^4 design with the number of active effects being 1, 2, 3, 4, 5 and 6. We have considered both cases (1) all having the same magnitude from 0.5 to 4 in 0.5 increments, and (2) all having a different magnitude.
For a comparative purpose, we use three efficiency measures of power ; (1) Power denoting the expected fraction of active effects that are declared active, (2) Power I denoting the proportion of detecting all active effects allowing misidentifying inactive effects as active, and (3) Power II denoting the proportion of exactly detecting all active effects only. We compare the efficiency of those three methods and our proposed methods by simulation study. We show that the proposed methods seem to perform better than the existing methods in some sense.
;반복없는 요인설계를 분석할 때, 유의한 효과들을 선별하기 위해 주로 반정규 확률 그림을 이용한다. 반정규 확률 그림에서 원점을 지나는 직선으로부터 멀리 떨어진 효과를 유의하다고 판단하는데, 그림으로만 보고 판단하기에는 주관적일 수 있어 최근에 통계량을 이용한 객관적인 판단을 할 수 있는 연구들이 많이 진행되고 있다. 본 연구에서도 객관적인 통계량을 이용하여 유의한 효과를 선별할 수 있는 2가지 방법론을 제시하고자 한다.
첫 번째 제안하는 방법론은 Lawson et.al (1998)이 제안한 LGB 방법론의 수정안이다. 유의한 효과의 개수가 많을 경우 LGB 방법론의 한계가 있어 그 문제점을 파악하고 수정된 LGB 방법론을 제안한다.
두 번째는 Lenth(1989), Ye et.al(2001) 및 Lawson et.al(1998)이 제안하는 방법론들의 결과를 통합하는 방안이다. 한가지 방법론으로만 유의한 효과를 찾기에는 한계가 있어 여러 방법들의 정보를 모두 이용하여 가장 최적의 결론을 얻고자 한다.
본 연구에서는 제안하는 2가지 방법론들이 다른 기존의 방법론보다 우월함을 보이기 위해 시뮬레이션을 통해 Power 값들을 비교하였다.I. Introduction 1
II. Review of Lenth, SD_Lenth and LGB methods 4
A. Review 4
B. Examples 7
C. Simulation study 13
D. Conclusion 19
III. Proposed Method 1 : The Modified LGB 20
A. Modified LGB 20
B. Simulated critical values 22
C. Examples 24
D. Simulation study 27
E. Conclusion 30
IV. Proposed Method 2 : The Integrated Method 32
A. Integrated method 32
B. Examples 34
C. Simulation study 38
D. Conclusion 44
V. Bibliography 45
Appendix 49
Abstract(inKorean) 7
