4,499 research outputs found

    A double classification tree search algorithm for index SNP selection

    Get PDF
    BACKGROUND: In population-based studies, it is generally recognized that single nucleotide polymorphism (SNP) markers are not independent. Rather, they are carried by haplotypes, groups of SNPs that tend to be coinherited. It is thus possible to choose a much smaller number of SNPs to use as indices for identifying haplotypes or haplotype blocks in genetic association studies. We refer to these characteristic SNPs as index SNPs. In order to reduce costs and work, a minimum number of index SNPs that can distinguish all SNP and haplotype patterns should be chosen. Unfortunately, this is an NP-complete problem, requiring brute force algorithms that are not feasible for large data sets. RESULTS: We have developed a double classification tree search algorithm to generate index SNPs that can distinguish all SNP and haplotype patterns. This algorithm runs very rapidly and generates very good, though not necessarily minimum, sets of index SNPs, as is to be expected for such NP-complete problems. CONCLUSIONS: A new algorithm for index SNP selection has been developed. A webserver for index SNP selection is available a

    Discovery of a large set of SNP and SSR genetic markers by high-throughput sequencing of pepper (Capsicum annuum)

    Get PDF
    Genetic markers based on single nucleotide polymorphisms (SNPs) are in increasing demand for genome mapping and fingerprinting of breeding populations in crop plants. Recent advances in high-throughput sequencing provide the opportunity for whole-genome resequencing and identification of allelic variants by mapping the reads to a reference genome. However, for many species, such as pepper (Capsicum annuum), a reference genome sequence is not yet available. To this end, we sequenced the C. annuum cv. "Yolo Wonder" transcriptome using Roche 454 pyrosequencing and assembled de novo 23,748 isotigs and 60,370 singletons. Mapping of 10,886,425 reads obtained by the Illumina GA II sequencing of C. annuum cv. "Criollo de Morclos 334" to the "Yolo Wonder" transcriptome allowed for SNP identification. By setting a threshold value that allows selecting reliable SNPs with minimal loss of information, 11,849 reliable SNPs spread across 5919 isotigs were identified. In addition, 853 single sequence repeats were obtained. This information has been made available online

    Inventory drivers in a pharmaceutical supply chain

    Get PDF
    In recent years, inventory reduction has been a key objective of pharmaceutical companies, especially within cost optimization initiatives. Pharmaceutical supply chains are characterized by volatile and unpredictable demands –especially in emergent markets-, high service levels, and complex, perishable finished-good portfolios, which makes keeping reasonable amounts of stock a true challenge. However, a one-way strategy towards zero-inventory is in reality inapplicable, due to the strategic nature and importance of the products being commercialised. Therefore, pharmaceutical supply chains are in need of new inventory strategies in order to remain competitive. Finished-goods inventory management in the pharmaceutical industry is closely related to the manufacturing systems and supply chain configurations that companies adopt. The factors considered in inventory management policies, however, do not always cover the full supply chain spectrum in which companies operate. This paper works under the pre-assumption that, in fact, there is a complex relationship between the inventory configurations that companies adopt and the factors behind them. The intention of this paper is to understand the factors driving high finished-goods inventory levels in pharmaceutical supply chains and assist supply chain managers in determining which of them can be influenced in order to reduce inventories to an optimal degree. Reasons for reducing inventory levels are found in high inventory holding and scrap related costs; in addition to lost sales for not being able to serve the customers with the adequate shelf life requirements. The thesis conducts a single case study research in a multi-national pharmaceutical company, which is used to examine typical inventory configurations and the factors affecting these configurations. This paper presents a framework that can assist supply chain managers in determining the most important inventory drivers in pharmaceutical supply chains. The findings in this study suggest that while external and downstream supply chain factors are recognized as being critical to pursue inventory optimization initiatives, pharmaceutical companies are oriented towards optimizing production processes and meeting regulatory requirements while still complying with high service levels, being internal factors the ones prevailing when making inventory management decisions. Furthermore, this paper investigates, through predictive modelling techniques, how various intrinsic and extrinsic factors influence the inventory configurations of the case study company. The study shows that inventory configurations are relatively unstable over time, especially in configurations that present high safety stock levels; and that production features and product characteristics are important explanatory factors behind high inventory levels. Regulatory requirements also play an important role in explaining the high strategic inventory levels that pharmaceutical companies hold

    Features Ranking Techniques for Single Nucleotide Polymorphism Data

    Get PDF
    Identifying biomarkers like single nucleotide polymorphisms (SNPs) is an important topic in biomedical applications. Such SNPs can be associated with an individual’s metabolism of drugs, which make these SNPs targets for drug therapy, and useful in personalized medicine applications. Yet another important application is that SNPs can be associated with an individual’s genetic predisposition to develop a disease. Identifying these associations allow proactive steps to be taken to hinder, delay or eliminate the disease. However, the problem is challenging; data are high dimensional and incomplete, and features (SNPs) are correlated. The goal of this thesis is to propose features ranking methods to reduce the number of selected features and the computational cost required to select these features in a binary classification task. The main idea of the hypothesis is that specific values within a feature might be useful in predicting specific classes, while other values are not. In this context, three heuristic methods are applied to select the best features. The methods are applied to the Wellcome Trust Case Control Consortium (WTCCC1) dataset, and evaluated on Texas A&M University Qatar’s High Performance Computing platform. The results show that the classification accuracy achieved by the proposed methods is comparable to the baseline. However, one of the proposed methods reduced the execution time of the feature selection and the number of features required to achieve similar accuracy in the baseline by 40% and 47% respectively

    Feature subset selection using support-vector machines by averaging over probabilistic genotype data

    Get PDF
    Despite the grand promises of the postgenomic era, such as personalized prevention, diagnosis, drugs, and treatments, the landscape of biomedicine looks more and more complex. The fullfillment of these promises for diseases significant in public health requires new approaches to induction for statistical and causal inferences from observations and interventions. Within the biomedical world an important response to this challenge is the mapping and relatively cheap measuring of the genetic variations, such as single nucleotide polymorphisms (SNPs). The recent mapping of the genetic variations has opened a new dimension in the postgenomic research at all phenotypic levels, such as genomic, proteomic, and clinical, and it has sparked a series of Genetic Association Studies (GAS), based on the application of machine learning and data mining techniques. To overcome such problems, different strategies are being investigated within the research community. The aim of this thesis work is to contribute to the progress in this field giving a step forward towards the solution. I have investigated the suitable machine learning and data mining algorithms for this task and the state of the art of the currently available implementations of them intended for biomedical research applications. As a result I have proposed a solution strategy, and chosen and extended the functionality of the Java-ML library, an open source machine learning library written in Java, implementing some missing algorithms and functionality that necessary for the proposed approach. This thesis work is structured into three main blocks. Section 3 “An approach to the use of machine learning techniques with genotype data” addresses the faced problem and the proposed solution. It begins with the definition of some introductory GAS concepts and the description of the solution strategy and elaborates in subsequent subsections on the description of the theoretical underpinnings of the algorithms setting up the solution. Specifically, the first subsection, “The feature selection problem in the bioinformatics domain”, justifies the necessity of reducing the dimensionality of data sets in order to allow for acceptable performance in the application of machine learning techniques to the broader field of bioinformatics implications and establishes a comparative taxonomy of the currently available techniques. In the second subsection, entitled “Feature selection using support-vector machines”, the idea behind support-vector machines classifiers and their application to feature subset selection is defined while the third subsection, “Ranking fusion as averaging technique: Markov chain based algorithms”, describes the ranking fusion algorithms which implementation has been chosen for the combination of the feature subsets obtained from different data sets. Section 4 “Analysis of available tools for experimental design” analyses the available suitable tools for experimental design in GAS based on machine learning techniques. In this sense in the first subsection, “Advantages of high level languages for machine learning algorithms”, the convenience of using high level languages for the kind of applications we are working in is discussed. In the second subsection, “Machine learning algorithms implementations in Java”, the election of the Java language is justified followed by an analysis of the currently available implementations of machine learning algorithms in this language that are worthwhile to be considered for our purposes, namely WEKA, RapidMiner and Java-ML. In Section 5 “Implemented extensions to the Java-ML library” a description of the functionalities that have been added to enable a framework suitable for the design of GAS experiments in order to test the proposed approach is provided. The “Missing values imputation: the dataset.tools package” subsection focuses on data sets handling functionalities while the “Averaging through ranking fusion: rankingfusion and rankingfusion.scoring package” subsection details the ranking fusion algorithms implementations. Finally the “How to use the code” subsection is a tutorial on how to use both the library and its extension for the development of applications. In addition to these main blocks, a final section called “Future Work” reflects how the developed work can be used by GAS domain experts to evaluate the usefulness of the proposed technique.Ingeniería de Telecomunicació

    NOVEL COMPUTATIONAL METHODS FOR SEQUENCING DATA ANALYSIS: MAPPING, QUERY, AND CLASSIFICATION

    Get PDF
    Over the past decade, the evolution of next-generation sequencing technology has considerably advanced the genomics research. As a consequence, fast and accurate computational methods are needed for analyzing the large data in different applications. The research presented in this dissertation focuses on three areas: RNA-seq read mapping, large-scale data query, and metagenomics sequence classification. A critical step of RNA-seq data analysis is to map the RNA-seq reads onto a reference genome. This dissertation presents a novel splice alignment tool, MapSplice3. It achieves high read alignment and base mapping yields and is able to detect splice junctions, gene fusions, and circular RNAs comprehensively at the same time. Based on MapSplice3, we further extend a novel lightweight approach called iMapSplice that enables personalized mRNA transcriptional profiling. As huge amount of RNA-seq has been shared through public datasets, it provides invaluable resources for researchers to test hypotheses by reusing existing datasets. To meet the needs of efficiently querying large-scale sequencing data, a novel method, called SeqOthello, has been developed. It is able to efficiently query sequence k-mers against large-scale datasets and finally determines the existence of the given sequence. Metagenomics studies often generate tens of millions of reads to capture the presence of microbial organisms. Thus efficient and accurate algorithms are in high demand. In this dissertation, we introduce MetaOthello, a probabilistic hashing classifier for metagenomic sequences. It supports efficient query of a taxon using its k-mer signatures

    Likelihood-free model choice

    Get PDF
    Fan, and Beaumont (2017). Beyond exposing the potential pitfalls of ABC approximations to posterior probabilities, the review emphasizes mostly the solution proposed by [25] on the use of random forests for aggregating summary statistics and for estimating the posterior probability of the most likely model via a secondary random forest

    Discrete Algorithms for Analysis of Genotype Data

    Get PDF
    Accessibility of high-throughput genotyping technology makes possible genome-wide association studies for common complex diseases. When dealing with common diseases, it is necessary to search and analyze multiple independent causes resulted from interactions of multiple genes scattered over the entire genome. The optimization formulations for searching disease-associated risk/resistant factors and predicting disease susceptibility for given case-control study have been introduced. Several discrete methods for disease association search exploiting greedy strategy and topological properties of case-control studies have been developed. New disease susceptibility prediction methods based on the developed search methods have been validated on datasets from case-control studies for several common diseases. Our experiments compare favorably the proposed algorithms with the existing association search and susceptibility prediction methods
    corecore