    Efficient Two-Stage Group Testing Algorithms for Genetic Screening

    Efficient two-stage group testing algorithms that are particularly suited for rapid and less-expensive DNA library screening and other large scale biological group testing efforts are investigated in this paper. The main focus is on novel combinatorial constructions in order to minimize the number of individual tests at the second stage of a two-stage disjunctive testing procedure. Building on recent work by Levenshtein (2003) and Tonchev (2008), several new infinite classes of such combinatorial designs are presented.Comment: 14 pages; to appear in "Algorithmica". Part of this work has been presented at the ICALP 2011 Group Testing Workshop; arXiv:1106.368

    Group Testing with Random Pools: optimal two-stage algorithms

    We study Probabilistic Group Testing of a set of N items each of which is defective with probability p. We focus on the double limit of small defect probability, p>1, taking either p->0 after NN\to\infty or p=1/Nβp=1/N^{\beta} with β(0,1/2)\beta\in(0,1/2). In both settings the optimal number of tests which are required to identify with certainty the defectives via a two-stage procedure, Tˉ(N,p)\bar T(N,p), is known to scale as NplogpNp|\log p|. Here we determine the sharp asymptotic value of Tˉ(N,p)/(Nplogp)\bar T(N,p)/(Np|\log p|) and construct a class of two-stage algorithms over which this optimal value is attained. This is done by choosing a proper bipartite regular graph (of tests and variable nodes) for the first stage of the detection. Furthermore we prove that this optimal value is also attained on average over a random bipartite graph where all variables have the same degree, while the tests have Poisson-distributed degrees. Finally, we improve the existing upper and lower bound for the optimal number of tests in the case p=1/Nβp=1/N^{\beta} with β[1/2,1)\beta\in[1/2,1).Comment: 12 page

    Multiple testing problems in classical clinical trial and adaptive designs

    Multiplicity issues arise prevalently in a variety of situations in clinical trials and statistical methods for multiple testing have gradually gained importance with the increasing number of complex clinical trial designs. In general, two types of multiple testing can be performed (Dmitrienko et al., 2009): union-intersection testing (UIT) and intersection-union testing (IUT). The UIT is of the interest in this dissertation. Thus, the familywise error rate (FWER) is required to be controlled in the strong sense. A number of methods have been developed for controlling the FWER, including single-step and stepwise procedures. In single-step approaches, such as the simple Bonferroni method, the rejection decision of a hypothesis does not depend on the decision of any other hypotheses. Single-step approaches can be improved in terms of power through stepwise approaches, while also controlling for the desired error rate. Besides, it is also possible to improve those procedures by a parametric approach. In the first project, we developed a new and powerful single-step progressive parametric multiple (SPPM) testing procedure for correlated normal test statistics. Through simulation studies, we demonstrate that SPPM improves power substantially when the correlation is moderate and/or the magnitude of eect sizes are similar. Group sequential designs (GSD) are clinical trials allowing interim looks with the possibility of early terminations due to ecacy, harm or futility, which can reduce the overall costs and timelines for the development of a new drug. However, repeated looks of data also have multiplicity issues and could inflate the type I error rate. The proper treatments to the error inflation have been discussed widely (Pocock, 1977), (O'Brien and Fleming, 1979), (Wang and Tsiatis, 1987), (Lan and DeMets, 1983). Most literature about GSD focuses on a single endpoint. GSD with multiple endpoints however, has also received considerable attention. The main focus of our second project is a GSD with multiple primary endpoints, in which the trial is to evaluate whether at least one of the endpoints is statistically signicant. In this study design, multiplicity issues arise from repeated interims and multiple endpoints. Therefore, the appropriate adjustments must be made to control the Type I error rate. Our second purpose here is to show that the combination of multiple endpoint and repeated interim analyses can lead to a more powerful design. Via the multivariate normal distribution, a method that allows for simultaneously consideration of interim analyses and all clinical endpoints was proposed. The new approach is derived from the closure principle, thus it can control type I error rate strongly. We evaluate the power under dierent scenarios and show that it compares favorably to other methods when correlation among endpoints is non-zero. In the group sequential design framework, another interesting topic is multiple arm multiple stage design (MAMS), where multiple arms are involved in the trial at the beginning with the flexibility about treatment selection or stopping decisions during the interim analyses. One of major hurdles of MAMS is the computational cost with the increasing number of arms and interim looks. Various designs were implemented to overcome this diculty (Thall et al., 1988; Schaid et al., 1990; Follmann et al., 1994; Stallard and Todd, 2003; Stallard and Friede, 2008; Magirr et al., 2012; Wason et al., 2017), but also control the FWER with the potential inflation from the multiple arm comparisons and multiple interim tests. Here, we consider a more flexible drop-the-loser design allowing the safety information in the treatment selection without a pre-specied dropping-arms mechanism and it still retains reasonable high power. The two dierent types of stopping boundaries are proposed for such a design. A sample size is also adjustable if the winner arm is dropped due to the safety considerations

    Using Bayesian Statistics in Confirmatory Clinical Trials in the Regulatory Setting

    Bayesian statistics plays a pivotal role in advancing medical science by enabling healthcare companies, regulators, and stakeholders to assess the safety and efficacy of new treatments, interventions, and medical procedures. The Bayesian framework offers a unique advantage over the classical framework, especially when incorporating prior information into a new trial with quality external data, such as historical data or another source of co-data. In recent years, there has been a significant increase in regulatory submissions using Bayesian statistics due to its flexibility and ability to provide valuable insights for decision-making, addressing the modern complexity of clinical trials where frequentist trials are inadequate. For regulatory submissions, companies often need to consider the frequentist operating characteristics of the Bayesian analysis strategy, regardless of the design complexity. In particular, the focus is on the frequentist type I error rate and power for all realistic alternatives. This tutorial review aims to provide a comprehensive overview of the use of Bayesian statistics in sample size determination in the regulatory environment of clinical trials. Fundamental concepts of Bayesian sample size determination and illustrative examples are provided to serve as a valuable resource for researchers, clinicians, and statisticians seeking to develop more complex and innovative designs

    Two-step estimation of simultaneous equation panel data models with censored endogenous variables

    This paper presents some two-step estimators for a wide range of parametric panel data models with censored endogenous variables and sample selection bias. Our approach is to derive estimates of the unobserved heterogeneity responsible for the endogeneity/selection bias to include as additional explanatory variables in the primary equation. These are obtained through a decomposition of the reduced form residuals. The panel nature of the data allows adjustment, and testing, for two forms of endogeneity and/or sample selection bias. Furthermore, it incorporates roles for dynamics and state dependence in the reduced form. Finally, we provide an empirical illustration which features our procedure and highlights the ability to test several of the underlying assumptions.Estimation;Panel Data;statistics

    Forgetting Exceptions is Harmful in Language Learning

    We show that in language learning, contrary to received wisdom, keeping exceptional training instances in memory can be beneficial for generalization accuracy. We investigate this phenomenon empirically on a selection of benchmark natural language processing tasks: grapheme-to-phoneme conversion, part-of-speech tagging, prepositional-phrase attachment, and base noun phrase chunking. In a first series of experiments we combine memory-based learning with training set editing techniques, in which instances are edited based on their typicality and class prediction strength. Results show that editing exceptional instances (with low typicality or low class prediction strength) tends to harm generalization accuracy. In a second series of experiments we compare memory-based learning and decision-tree learning methods on the same selection of tasks, and find that decision-tree learning often performs worse than memory-based learning. Moreover, the decrease in performance can be linked to the degree of abstraction from exceptions (i.e., pruning or eagerness). We provide explanations for both results in terms of the properties of the natural language processing tasks and the learning algorithms.Comment: 31 pages, 7 figures, 10 tables. uses 11pt, fullname, a4wide tex styles. Pre-print version of article to appear in Machine Learning 11:1-3, Special Issue on Natural Language Learning. Figures on page 22 slightly compressed to avoid page overloa

    CoCalc as a Learning Tool for Neural Network Simulation in the Special Course "Foundations of Mathematic Informatics"

    The role of neural network modeling in the learning content of the special course "Foundations of Mathematical Informatics" was discussed. The course was developed for the students of technical universities - future IT-specialists and directed to breaking the gap between theoretic computer science and it's applied applications: software, system and computing engineering. CoCalc was justified as a learning tool of mathematical informatics in general and neural network modeling in particular. The elements of technique of using CoCalc at studying topic "Neural network and pattern recognition" of the special course "Foundations of Mathematic Informatics" are shown. The program code was presented in a CoffeeScript language, which implements the basic components of artificial neural network: neurons, synaptic connections, functions of activations (tangential, sigmoid, stepped) and their derivatives, methods of calculating the network's weights, etc. The features of the Kolmogorov-Arnold representation theorem application were discussed for determination the architecture of multilayer neural networks. The implementation of the disjunctive logical element and approximation of an arbitrary function using a three-layer neural network were given as an examples. According to the simulation results, a conclusion was made as for the limits of the use of constructed networks, in which they retain their adequacy. The framework topics of individual research of the artificial neural networks is proposed.Comment: 16 pages, 3 figures, Proceedings of the 13th International Conference on ICT in Education, Research and Industrial Applications. Integration, Harmonization and Knowledge Transfer (ICTERI, 2018