454 research outputs found

    Even faster elastic-degenerate string matching via fast matrix multiplication

    Get PDF
    An elastic-degenerate (ED) string is a sequence of n sets of strings of total length N, which was recently proposed to model a set of similar sequences. The ED string matching (EDSM) problem is to find all occurrences of a pattern of length m in an ED text. The EDSM problem has recently received some attention in the combinatorial pattern matching community, and an O(nm1.5 √(log m) + N)-time algorithm is known [Aoyama et al., CPM 2018]. The standard assumption in the prior work on this question is that N is substantially larger than both n and m, and thus we would like to have a linear dependency on the former. Under this assumption, the natural open problem is whether we can decrease the 1.5 exponent in the time complexity, similarly as in the related (but, to the best of our knowledge, not equivalent) word break problem [Backurs and Indyk, FOCS 2016].Our starting point is a conditional lower bound for the EDSM problem. We use the popular combinatorial Boolean matrix multiplication (BMM) conjecture stating that there is no truly subcubic combinatorial algorithm for BMM [Abboud and Williams, FOCS 2014]. By designing an appropriate reduction we show that a combinatorial algorithm solving the EDSM problem in O(nm1.5−∊ + N) time, for any ∊ > 0, refutes this conjecture. Of course, the notion of combinatorial algorithms is not clearly defined, so our reduction should be understood as an indication that decreasing the exponent requires fast matrix multiplication.Two standard tools used in algorithms on strings are string periodicity and fast Fourier transform. Our main technical contribution is that we successfully combine these tools with fast matrix multiplication to design a non-combinatorial O(nm1.381 + N)-time algorithm for EDSM. To the best of our knowledge, we are the first to do so.</p

    MALVA: Genotyping by Mapping-free ALlele Detection of Known VAriants

    Get PDF
    The amount of genetic variation discovered in human populations is growing rapidly leading to challenging computational tasks, such as variant calling. Standard methods for addressing this problem include read mapping, a computationally expensive procedure; thus, mapping-free tools have been proposed in recent years. These tools focus on isolated, biallelic SNPs, providing limited support for multi-allelic SNPs and short insertions and deletions of nucleotides (indels). Here we introduce MALVA, a mapping-free method to genotype an individual from a sample of reads. MALVA is the first mapping-free tool able to genotype multi-allelic SNPs and indels, even in high-density genomic regions, and to effectively handle a huge number of variants. MALVA requires one order of magnitude less time to genotype a donor than alignment-based pipelines, providing similar accuracy. Remarkably, on indels, MALVA provides even better results than the most widely adopted variant discovery tools. Biological Sciences; Genetics; Genomics; Bioinformatic

    Constructing strings avoiding forbidden substrings

    Get PDF
    We consider the problem of constructing strings over an alphabet Σ that start with a given prefix u, end with a given suffix v, and avoid occurrences of a given set of forbidden substrings. In the decision version of the problem, given a set Sk of forbidden substrings, each of length k, over Σ, we are asked to decide whether there exists a string x over Σ such that u is a prefix of x, v is a suffix of x, and no s ϵ Sk occurs in x. Our first result is an O(|u| + |v| + k|Sk|)-time algorithm to decide this problem. In the more general optimization version of the problem, given a set S of forbidden arbitrary-length substrings over Σ, we are asked to construct a shortest string x over S such that u is a prefix of x, v is a suffix of x, and no s ϵ S occurs in x. Our second result is an O(|u| + |v| + ||S|| · |Σ|)-time algorithm to solve this problem, where ||S|| denotes the total length of the elements of S. Interestingly, our results can be directly applied to solve the reachability and shortest path problems in complete de Bruijn graphs in the presence of forbidden edges or of forbidden paths. Our algorithms are motivated by data privacy, and in particular, by the data sanitization process. In the context of strings, sanitization consists in hiding forbidden substrings from a given string by introducing the least amount of spurious information. We consider the following problem. Given a string w of length n over Σ, an integer k, and a set Sk of forbidden substrings, each of length k, over Σ, construct a shortest string y over Σ such that no s ϵ Sk occurs in y and the sequence of all other length-k fragments occurring in w is a subsequence of the sequence of the length-k fragments occurring in y. Our third result is an O(nk|Sk| · |Σ|)-time algorithm to solve this problem

    Hide and mine in strings: Hardness, algorithms, and experiments

    Get PDF
    Data sanitization and frequent pattern mining are two well-studied topics in data mining. Our work initiates a study on the fundamental relation between data sanitization and frequent pattern mining in the context of sequential (string) data. Current methods for string sanitization hide confidential patterns. This, however, may lead to spurious patterns that harm the utility of frequent pattern mining. The main computational problem is to minimize this harm. Our contribution here is as follows. First, we present several hardness results, for different variants of this problem, essentially showing that these variants cannot be solved or even be approximated in polynomial time. Second, we propose integer linear programming formulations for these variants and algorithms to solve them, which work in polynomial time under realistic assumptions on the input parameters. We complement the integer linear programming algorithms with a greedy heuristic. Third, we present an extensive experimental study, using both synthetic and real-world datasets, that demonstrates the effectiveness and efficiency of our methods. Beyond sanitization, the process of missing value replacement may also lead to spurious patterns. Interestingly, our results apply in this context as well

    Triplet-based similarity score for fully multilabeled trees with poly-occurring labels

    Get PDF
    Motivation: The latest advances in cancer sequencing, and the availability of a wide range of methods to infer the evolutionary history of tumors, have made it important to evaluate, reconcile and cluster different tumor phylogenies. Recently, several notions of distance or similarities have been proposed in the literature, but none of them has emerged as the golden standard. Moreover, none of the known similarity measures is able to manage mutations occurring multiple times in the tree, a circumstance often occurring in real cases. Results: To overcome these limitations, in this article, we propose MP3, the first similarity measure for tumor phylogenies able to effectively manage cases where multiple mutations can occur at the same time and mutations can occur multiple times. Moreover, a comparison of MP3 with other measures shows that it is able to classify correctly similar and dissimilar trees, both on simulated and on real data

    A universal error measure for input predictions applied to online graph problems

    Get PDF
    We introduce a novel measure for quantifying the error in input predictions. The error is based on a minimum-cost hyperedge cover in a suitably defined hypergraph and provides a general template which we apply to online graph problems. The measure captures errors due to absent predicted requests as well as unpredicted actual requests; hence, predicted and actual inputs can be of arbitrary size. We achieve refined performance guarantees for previously studied network design problems in the online-list model, such as Steiner tree and facility location. Further, we initiate the study of learning-augmented algorithms for online routing problems, such as the online traveling salesperson problem and the online dial-a-ride problem, where (transportation) requests arrive over time (online-time model). We provide a general algorithmic framework and we give error-dependent performance bounds that improve upon known worst-case barriers, when given accurate predictions, at the cost of slightly increased worst-case bounds when given predictions of arbitrary quality

    Comparing Degenerate Strings

    Get PDF
    Uncertain sequences are compact representations of sets of similar strings. They highlight common segments by collapsing them, and explicitly represent varying segments by listing all possible options. A generalized degenerate string (GD string) is a type of uncertain sequence. Formally, a GD string S is a sequence of n sets of strings of total size N, where the ith set contains strings of the same length ki but this length can vary between different sets. We denote by W the sum of these lengths k0, k1,... , kn-1. Our main result is an (N + M)-time algorithm for deciding whether two GD strings of total sizes N and M, respectively, over an integer alphabet, have a non-empty intersection. This result is based on a combinatorial result of independent interest: although the intersection of two GD strings can be exponential in the total size of the two strings, it can be represented in linear space. We then apply our string comparison tool to devise a simple algorithm for computing all palindromes in S in (min{W, n2}N)-time. We complement this upper bound by showing a similar conditional lower bound for computing maximal palindromes in S. We also show that a result, which is essentially the same as our string comparison linear-time algorithm, can be obtained by employing an automata-based approach

    Rapid Efficacy of riSankizumab in pretibial psoriasis invOLVEment: RESOLVE

    Get PDF
    Background: Despite extraordinary improvements in the management of psoriasis in recent times, some areas of the body, such as the pretibial area, still show an unsatisfactory response and a more significant impact on patient quality of life. This multicentre study focuses on psoriasis affecting sensitive areas (particularly the pretibial area), its impact on quality of life and the therapeutic response to risankizumab. Methods: This multicentre prospective observational study recruited patients with moderate-to-severe psoriasis with pretibial area involvement. All patients underwent treatment with risankizumab (150 mg every 3 weeks), and efficacy was assessed after 24 weeks. Results: The study included 128 patients with a mean age of 51 years, suffering from moderate-to-severe psoriasis with involvement of the pretibial area with median total Psoriasis Area Severity Index score of 17.05 and Dermatology Life Quality Index of 16.27. The group was further divided into two sub-groups: the 'mother patch' group, in whom the very first psoriatic plaque appeared in the pretibial region (45 patients), and the 'non-mother patch' group, in whom the psoriatic lesion in the pretibial region was present but not as the first manifestation (83 patients). In order to better assess the involvement of psoriasis in the pretibial area, the pretibial plaque lesion severity index was also calculated at baseline in all patients: extent 2.75, erythema 2.64, infiltration 2.45 and desquamation 2.38. All participants in this study showed a good therapeutic response, with a reduction in all scores. Conclusions: The pretibial area is becoming an object of therapeutic interest due to some resistance to clearance and the consequent impairment of patient quality of life. This study showed that risankizumab can give favourable therapeutic results not only in patients with moderate-to-severe psoriasis with involvement of the difficult-to-treat areas but particularly in patients with recalcitrant plaques in the pretibial area

    Different Factors Affecting Human ANP Amyloid Aggregation and Their Implications in Congestive Heart Failure

    Get PDF
    Atrial Natriuretic Peptide (ANP)-containing amyloid is frequently found in the elderly heart. No data exist regarding ANP aggregation process and its link to pathologies. Our aims were: i) to experimentally prove the presumptive association of Congestive Heart Failure (CHF) and Isolated Atrial Amyloidosis (IAA); ii) to characterize ANP aggregation, thereby elucidating IAA implication in the CHF pathogenesis.A significant prevalence (85\%) of IAA was immunohistochemically proven ex vivo in biopsies from CHF patients. We investigated in vitro (using Congo Red, Thioflavin T, SDS-PAGE, transmission electron microscopy, infrared spectroscopy) ANP fibrillogenesis, starting from α-ANP as well as the ability of dimeric β-ANP to promote amyloid formation. Different conditions were adopted, including those reproducing β-ANP prevalence in CHF. Our results defined the uncommon rapidity of α-ANP self-assembly at acidic pH supporting the hypothesis that such aggregates constitute the onset of a fibrillization process subsequently proceeding at physiological pH. Interestingly, CHF-like conditions induced the production of the most stable and time-resistant ANP fibrils suggesting that CHF affected people may be prone to develop IAA.We established a link between IAA and CHF by ex vivo examination and assessed that β-ANP is, in vitro, the seed of ANP fibrils. Our results indicate that β-ANP plays a crucial role in ANP amyloid deposition under physiopathological CHF conditions. Overall, our findings indicate that early IAA-related ANP deposition may occur in CHF and suggest that these latter patients should be monitored for the development of cardiac amyloidosis
    • …
    corecore