23 research outputs found

    New algorithms and methods for protein and DNA sequence comparison

    Get PDF

    Understanding and visualising the variation in HLA

    Get PDF
    The HLA region is a segment of the human genome containing immune system genes, which orchestrate the defence against infection. A key aspect distinguishing the HLA genes from other genes in the human genome is their extensive levels of variation. This variation increases the depth and breadth of the weaponry used against pathogen infections; and helps impede the spread of infection within families and communities. Analysing and understanding these high levels of variation gives an insight into how the human immune system has developed and the breadth of variation seen in the human population This is possible using data generated over the last twenty years, from the sequencing of millions of individuals across the world, and these sequences deposited in public databases. This gives an unprecedented opportunity to compare more than 15,000 sequences and distinguish aspects of the variation that are important for immune functions, from those that are not. This thesis examines methods used to assess variation and develops new methods to both catalogue and visualise the data. Initial analysis focusses on the antigen recognition domain of the HLA class I genes and is expanded to analyse the full sequence of both the HLA class I and II genes. The analysis of non-human primate orthologs to HLA is also investigated. The analysis reveals the extensive levels of variation at both the nucleotide and protein levels and identifies mechanisms responsible for generating this variation. This allows evolutionary lineages to be identified, and the identification of a minimum set of 42 core HLA class I alleles and 47 HLA class II alleles. This also allows estimates of the total numbers of HLA alleles in the worldwide population, with the potential for 2-3 million alleles of a single gene in HLA class I and up to 1.7 million in HLA class II

    Guidelines for Assessment and Instruction in Statistics Education (GAISE) College Report 2016

    Get PDF
    In 2005 the American Statistical Association (ASA) endorsed the Guidelines for Assessment and Instruction in Statistics Education (GAISE) College Report. This report has had a profound impact on the teaching of introductory statistics in two- and four-year institutions, and the six recommendations put forward in the report have stood the test of time. Much has happened within the statistics education community and beyond in the intervening 10 years, making it critical to re-evaluate and update this important report. For readers who are unfamiliar with the original GAISE College Report or who are new to the statistics education community, the full version of the 2005 report can be found at http://www.amstat.org/education/gaise/GaiseCollege_full.pdf and a brief history of statistics education can be found in Appendix A of this new report. The revised GAISE College Report takes into account the many changes in the world of statistics education and statistical practice since 2005 and suggests a direction for the future of introductory statistics courses. Our work has been informed by outreach to the statistics education community and by reference to the statistics education literature

    Revealing Perceptual Proxies in Comparative Data Visualization

    Get PDF
    Data Visualization has long been shaped by empirical evidence of the efficacies of different encodings, such as length, position, or area, in conveying quantities. Less is known, however, about what may affect comparison of multiple data series, which generally involves extraction of higher-order values, such as means, ranges, and correlations. In this work, we investigate such factors and the underlying visual processes that may account for them. We begin with a case study motivating the research, in which we modify Krona, a Bioinformatics visualization system, to support several types of comparison. Next, we empirically examine the influence of “arrangement”—that is, whether charts are shown side-by-side, stacked vertically, overlaid, etc.—on comparative tasks, in a series of psychophysical experiments. The results suggest a complex interaction of factors, with different comparative arrangements providing benefits for different combinations of tasks and encodings. For example, overlaid charts make detecting differences easier but comparing means or ranges more difficult. While these results offer some guidance to designers, the number of interactions makes it infeasible to provide broad rankings of arrangements, as has been done previously for encodings. Our subsequent efforts thus work toward understanding the visual processes that underlie the extraction of statistical summaries needed for comparison. It has recently been proposed that simpler shortcuts, called Perceptual Proxies, are used by the visual system to estimate these values. We investigate proxies for bar charts in experiments using an “adversarial” framework, in which the ranking of two charts along a task metric (e.g. mean) is opposite their ranking along a proxy metric (e.g. convex hull area). The strongest evidence we find is for use of a “centroid” proxy to estimate means in bar charts. Finally, we attempt to use using human-guided optimization to construct charts de novo, without assuming specific proxies. This work contributes both to perceptual psychology, by offering evidence for underlying visual processes that may be involved in the interpretation of comparative visualizations, and to data visualization, by providing new research methods and straightforward design guidance on how best to lay out charts to support certain tasks

    A systematic exploration of uncertainty in interactive systems

    Get PDF
    Uncertainty is an inherent part of our everyday life. Humans have to deal with uncertainty every time they make a decision. The importance of uncertainty additionally increases in the digital world. Machine learning and predictive algorithms introduce statistical uncertainty to digital information. In addition, the rising number of sensors in our surroundings increases the amount of statistically uncertain data, as sensor data is prone to measurement errors. Hence, there is an emergent need for practitioners and researchers in Human-Computer Interaction to explore new concepts and develop interactive systems able to handle uncertainty. Such systems should not only support users in entering uncertainty in their input, but additionally present uncertainty in a comprehensible way. The main contribution of this thesis is the exploration of the role of uncertainty in interactive systems and how novel input and output methods can support researchers and designers to efficiently and clearly communicate uncertainty. By using empirical methods of Human-Computer Interaction and a systematic approach, we present novel input and output methods that support the comprehensive communication of uncertainty in interactive systems. We further integrate our results in a simulation tool for end-users. Based on related work, we create a systematic overview of sources of uncertainty in interactive systems to support the quantification of uncertainty and identify relevant research areas. The overview can help practitioners and researchers to identify uncertainty in interactive systems and either reduce or communicate it. We then introduce new concepts for the input of uncertain data. We enhance standard input controls, develop specific slider controls and tangible input controls, and collect physiological measurements. We also compare different representations for the output of uncertainty to make recommendations for their usage. Furthermore, we analyze how humans interpret uncertain data und make suggestions on how to avoid misinterpretation and statistically wrong judgements. We embed the insights gained from the results of this thesis in an end-user simulation tool to make it available for future research. The tool is intended to be a starting point for future research on uncertainty in interactive systems and foster communicating uncertainty and building trust in the system. Overall, our work shows that user interfaces can be enhanced to effectively support users with the input and output of statistically uncertain information

    Sequence analysis of enzymes of the shikimate pathway: Development of a novel multiple sequence alignment algorithm

    Get PDF
    The possibility of homology modelling the shikimate pathway enzymes, 3-dehydroquinate synthase (el), 3-dehydroquinase (e2), shikimate dehydrogenase (e3), shikimate kinase (e4) and 5-enolpyruvylshikimate 3 -phosphate (EPSP) synthase (e5) is investigated. The sequences of these enzymes are analysed and the results found indicate that for four of these proteins, el, e2, e3, and e5, no structural homologues exist. Developing a model structure by homology modelling is therefore not possible. For shikimate kinase, statistically significant alignments are found to two proteins with known structures, adenylate kinase and H-ras p21 protein. These are also judged to be biologically significant alignments. However, the alignments obtained show too little sequence identity to permit homology modelling based on primary sequence data alone. An ab initio based methodology is next applied, with the initial step being careful evaluation of multiple sequence alignments of the shikimate pathway enzymes. Altering the parameters of the available multiple sequence alignment algorithms, produces a large range of differing alignments, with no objective way to choose a single alignment or construct a composite from the many produced for each shikimate pathway enzyme. This problem with obtaining a reliable alignment for the shikimate pathway enzyme will occur in other low sequence identity protein families, and is addressed by the development of a novel multiple sequence alignment method, Mix'n'Match. Mix'n'Match is based on finding alternating Strongly Conserved Regions (SCRs) and Loosely Conserved Regions (LCRs) in the protein sequences. The SCRs are used as 'anchors' in the alignment and are calculated from analysis of several different multiple alignments, made using varying criteria. After divided the sequences into Strongly Conserved Regions (SCRs) and Loosely Conserved Regions (LCRs), the 'best' alignment for each LCR is chosen, independently of the other LCRs, from a selection of possibilities in the multiple alignments. To help make this choice for each LCR, the secondary structure is predicted and sliown alongside each different possible alignment. One advantage of this method over automatic, non-interactive, methods, is that the final alignment is not dependent on the choice of a single set of scoring parameters. Another is that, by allowing interactive choice and by taking account of secondary structural information, the final alignment is based more on biological, rather than mathematical factors. This method can produce better alignments than any of the initial automatic multiple alignment methods used. The SCRs identified by Mix'n'Match, are found to show good correlation with the actual secondary structural elements present in the enzyme families used to test the method. Analysis of the Mix'n'Match alignment and consensus secondary structure predictions for shikimate kinase, suggest a closer match with the actual secondary structure of adenylate kinase, than is found between their amino acid sequences. These proteins appear to share functional, sequence and secondary structural homology. The proposal is made that a model structure of shikimate kinase, based on the structure of adenylate kinase, could be constructed using homology modelling techniques

    Evolutionary genomics : statistical and computational methods

    Get PDF
    This open access book addresses the challenge of analyzing and understanding the evolutionary dynamics of complex biological systems at the genomic level, and elaborates on some promising strategies that would bring us closer to uncovering of the vital relationships between genotype and phenotype. After a few educational primers, the book continues with sections on sequence homology and alignment, phylogenetic methods to study genome evolution, methodologies for evaluating selective pressures on genomic sequences as well as genomic evolution in light of protein domain architecture and transposable elements, population genomics and other omics, and discussions of current bottlenecks in handling and analyzing genomic data. Written for the highly successful Methods in Molecular Biology series, chapters include the kind of detail and expert implementation advice that lead to the best results. Authoritative and comprehensive, Evolutionary Genomics: Statistical and Computational Methods, Second Edition aims to serve both novices in biology with strong statistics and computational skills, and molecular biologists with a good grasp of standard mathematical concepts, in moving this important field of study forward
    corecore