7 research outputs found

    Pivot Selection for Median String Problem

    Full text link
    The Median String Problem is W[1]-Hard under the Levenshtein distance, thus, approximation heuristics are used. Perturbation-based heuristics have been proved to be very competitive as regards the ratio approximation accuracy/convergence speed. However, the computational burden increase with the size of the set. In this paper, we explore the idea of reducing the size of the problem by selecting a subset of representative elements, i.e. pivots, that are used to compute the approximate median instead of the whole set. We aim to reduce the computation time through a reduction of the problem size while achieving similar approximation accuracy. We explain how we find those pivots and how to compute the median string from them. Results on commonly used test data suggest that our approach can reduce the computational requirements (measured in computed edit distances) by 88\% with approximation accuracy as good as the state of the art heuristic. This work has been supported in part by CONICYT-PCHA/Doctorado Nacional/2014−631400742014-63140074 through a Ph.D. Scholarship; Universidad Cat\'{o}lica de la Sant\'{i}sima Concepci\'{o}n through the research project DIN-01/2016; European Union's Horizon 2020 under the Marie Sk\l odowska-Curie grant agreement 690941690941; Millennium Institute for Foundational Research on Data (IMFD); FONDECYT-CONICYT grant number 11704971170497; and for O. Pedreira, Xunta de Galicia/FEDER-UE refs. CSI ED431G/01 and GRC: ED431C 2017/58

    Learning Differentiable Programs with Admissible Neural Heuristics

    Get PDF
    We study the problem of learning differentiable functions expressed as programs in a domain-specific language. Such programmatic models can offer benefits such as composability and interpretability; however, learning them requires optimizing over a combinatorial space of program "architectures". We frame this optimization problem as a search in a weighted graph whose paths encode top-down derivations of program syntax. Our key innovation is to view various classes of neural networks as continuous relaxations over the space of programs, which can then be used to complete any partial program. This relaxed program is differentiable and can be trained end-to-end, and the resulting training loss is an approximately admissible heuristic that can guide the combinatorial search. We instantiate our approach on top of the A-star algorithm and an iteratively deepened branch-and-bound search, and use these algorithms to learn programmatic classifiers in three sequence classification tasks. Our experiments show that the algorithms outperform state-of-the-art methods for program learning, and that they discover programmatic classifiers that yield natural interpretations and achieve competitive accuracy

    Learning Differentiable Programs with Admissible Neural Heuristics

    Get PDF
    We study the problem of learning differentiable functions expressed as programs in a domain-specific language. Such programmatic models can offer benefits such as composability and interpretability; however, learning them requires optimizing over a combinatorial space of program "architectures". We frame this optimization problem as a search in a weighted graph whose paths encode top-down derivations of program syntax. Our key innovation is to view various classes of neural networks as continuous relaxations over the space of programs, which can then be used to complete any partial program. This relaxed program is differentiable and can be trained end-to-end, and the resulting training loss is an approximately admissible heuristic that can guide the combinatorial search. We instantiate our approach on top of the A-star algorithm and an iteratively deepened branch-and-bound search, and use these algorithms to learn programmatic classifiers in three sequence classification tasks. Our experiments show that the algorithms outperform state-of-the-art methods for program learning, and that they discover programmatic classifiers that yield natural interpretations and achieve competitive accuracy.Comment: 9 pages, published in NeurIPS 202

    Machine Learning Approaches for Cancer Analysis

    Get PDF
    In addition, we propose many machine learning models that serve as contributions to solve a biological problem. First, we present Zseq, a linear time method that identifies the most informative genomic sequences and reduces the number of biased sequences, sequence duplications, and ambiguous nucleotides. Zseq finds the complexity of the sequences by counting the number of unique k-mers in each sequence as its corresponding score and also takes into the account other factors, such as ambiguous nucleotides or high GC-content percentage in k-mers. Based on a z-score threshold, Zseq sweeps through the sequences again and filters those with a z-score less than the user-defined threshold. Zseq is able to provide a better mapping rate; it reduces the number of ambiguous bases significantly in comparison with other methods. Evaluation of the filtered reads has been conducted by aligning the reads and assembling the transcripts using the reference genome as well as de novo assembly. The assembled transcripts show a better discriminative ability to separate cancer and normal samples in comparison with another state-of-the-art method. Studying the abundance of select mRNA species throughout prostate cancer progression may provide some insight into the molecular mechanisms that advance the disease. In the second contribution of this dissertation, we reveal that the combination of proper clustering, distance function and Index validation for clusters are suitable in identifying outlier transcripts, which show different trending than the majority of the transcripts, the trending of the transcript is the abundance throughout different stages of prostate cancer. We compare this model with standard hierarchical time-series clustering method based on Euclidean distance. Using time-series profile hierarchical clustering methods, we identified stage-specific mRNA species termed outlier transcripts that exhibit unique trending patterns as compared to most other transcripts during disease progression. This method is able to identify those outliers rather than finding patterns among the trending transcripts compared to the hierarchical clustering method based on Euclidean distance. A wet-lab experiment on a biomarker (CAM2G gene) confirmed the result of the computational model. Genes related to these outlier transcripts were found to be strongly associated with cancer, and in particular, prostate cancer. Further investigation of these outlier transcripts in prostate cancer may identify them as potential stage-specific biomarkers that can predict the progression of the disease. Breast cancer, on the other hand, is a widespread type of cancer in females and accounts for a lot of cancer cases and deaths in the world. Identifying the subtype of breast cancer plays a crucial role in selecting the best treatment. In the third contribution, we propose an optimized hierarchical classification model that is used to predict the breast cancer subtype. Suitable filter feature selection methods and new hybrid feature selection methods are utilized to find discriminative genes. Our proposed model achieves 100% accuracy for predicting the breast cancer subtypes using the same or even fewer genes. Studying breast cancer survivability among different patients who received various treatments may help understand the relationship between the survivability and treatment therapy based on gene expression. In the fourth contribution, we have built a classifier system that predicts whether a given breast cancer patient who underwent some form of treatment, which is either hormone therapy, radiotherapy, or surgery will survive beyond five years after the treatment therapy. Our classifier is a tree-based hierarchical approach that partitions breast cancer patients based on survivability classes; each node in the tree is associated with a treatment therapy and finds a predictive subset of genes that can best predict whether a given patient will survive after that particular treatment. We applied our tree-based method to a gene expression dataset that consists of 347 treated breast cancer patients and identified potential biomarker subsets with prediction accuracies ranging from 80.9% to 100%. We have further investigated the roles of many biomarkers through the literature. Studying gene expression through various time intervals of breast cancer survival may provide insights into the recovery of the patients. Discovery of gene indicators can be a crucial step in predicting survivability and handling of breast cancer patients. In the fifth contribution, we propose a hierarchical clustering method to separate dissimilar groups of genes in time-series data as outliers. These isolated outliers, genes that trend differently from other genes, can serve as potential biomarkers of breast cancer survivability. In the last contribution, we introduce a method that uses machine learning techniques to identify transcripts that correlate with prostate cancer development and progression. We have isolated transcripts that have the potential to serve as prognostic indicators and may have significant value in guiding treatment decisions. Our study also supports PTGFR, NREP, scaRNA22, DOCK9, FLVCR2, IK2F3, USP13, and CLASP1 as potential biomarkers to predict prostate cancer progression, especially between stage II and subsequent stages of the disease

    Flow time series clustering for demand pattern recognition in drinking water distribution systems: New insights about the most adequate methods

    Get PDF
    This study presents a proposal of clustering methodologies for demand pattern recognition using network flow data collected from a large set of drinking water distribution networks in Portugal. Most of the existing studies about clustering in flow time series rely on hierarchical or k-Means clustering algorithms with inelastic measures distances. This study explores alternative clustering algorithms, distance measures, comparison time windows, internal index metrics and clustering prototypes. The performance of the alternative clustering methodology was assessed in terms of multiple internal index metrics and the characterization of the cluster centroids. The methods with the best performance were Partition Algorithm with DTW distance, PAM prototype with 15 minutes time window and the Partition Algorithm with GAK distance, PAM prototype and 15 minutes time window because they allow a clear partition of flow time series in three clusters. The first method identifies a night consumption pattern, a typical weekend pattern and a typical working day pattern, whereas the second one identifies a pattern with small variability between night and daily consumption. To improve knowledge extraction, in terms of typical and anomalous existing patterns, additional clustering operations were performed with the flow data set that belongs to the cluster with small variability between night and daily consumption. New clusters were identified and characterized regarding weekday, geographical location, and dry months and wet months, showing that patterns associated with garden irrigation are independent of the period of the day and season of the year, which indicates an inefficient water use.Este estudo apresenta uma proposta de metodologias de clustering para reconhecimento de padrões de consumo usando um conjunto de dados de caudal coletados em redes de distribuição de água em Portugal. A maioria dos estudos existentes sobre clustering em séries temporais de caudal baseia-se em algoritmos de clustering hierárquicos ou de k-Means com medidas de distâncias inelásticas. Este estudo explora alternativas de algoritmos de clustering, medidas de distância, janelas temporais de comparação, medidas de índice interno e protótipos de clustering. O desempenho das metodologias de clustering foi avaliado em termos de medidas de índice interno e também através da caracterização dos centroides dos clusters. As metodologias com melhor desempenho foram o Algoritmo de Partição com distância DTW, protótipo PAM e janela de temporal de 15 minutos e o Algoritmo de Partição com distância GAK, protótipo PAM e janela de temporal de 15 minutos, pois permitiram a formação três clusters. O primeiro método identifica um padrão de consumo noturno, um padrão típico de fim-de-semana e um padrão típico de dia útil, enquanto o segundo método destaca-se por apresentar um padrão com pequena variabilidade entre o consumo noturno e diurno. Para melhorar a extração de conhecimento, operações adicionais de clustering foram realizadas ao conjunto de dados que pertence ao cluster com pequena variabilidade entre consumo noturno e diurno. Novos clusters foram identificados e caracterizados, mostrando que os padrões associados à irrigação são independentes do período do dia e da época do ano, o que indica um uso ineficiente da água

    Solving hard subgraph problems in parallel

    Get PDF
    This thesis improves the state of the art in exact, practical algorithms for finding subgraphs. We study maximum clique, subgraph isomorphism, and maximum common subgraph problems. These are widely applicable: within computing science, subgraph problems arise in document clustering, computer vision, the design of communication protocols, model checking, compiler code generation, malware detection, cryptography, and robotics; beyond, applications occur in biochemistry, electrical engineering, mathematics, law enforcement, fraud detection, fault diagnosis, manufacturing, and sociology. We therefore consider both the ``pure'' forms of these problems, and variants with labels and other domain-specific constraints. Although subgraph-finding should theoretically be hard, the constraint-based search algorithms we discuss can easily solve real-world instances involving graphs with thousands of vertices, and millions of edges. We therefore ask: is it possible to generate ``really hard'' instances for these problems, and if so, what can we learn? By extending research into combinatorial phase transition phenomena, we develop a better understanding of branching heuristics, as well as highlighting a serious flaw in the design of graph database systems. This thesis also demonstrates how to exploit two of the kinds of parallelism offered by current computer hardware. Bit parallelism allows us to carry out operations on whole sets of vertices in a single instruction---this is largely routine. Thread parallelism, to make use of the multiple cores offered by all modern processors, is more complex. We suggest three desirable performance characteristics that we would like when introducing thread parallelism: lack of risk (parallel cannot be exponentially slower than sequential), scalability (adding more processing cores cannot make runtimes worse), and reproducibility (the same instance on the same hardware will take roughly the same time every time it is run). We then detail the difficulties in guaranteeing these characteristics when using modern algorithmic techniques. Besides ensuring that parallelism cannot make things worse, we also increase the likelihood of it making things better. We compare randomised work stealing to new tailored strategies, and perform experiments to identify the factors contributing to good speedups. We show that whilst load balancing is difficult, the primary factor influencing the results is the interaction between branching heuristics and parallelism. By using parallelism to explicitly offset the commitment made to weak early branching choices, we obtain parallel subgraph solvers which are substantially and consistently better than the best sequential algorithms
    corecore