74 research outputs found
Motif Discovery with Compact Approaches - Design and Applications
In the post-genomic era, the ability to predict the behavior, the function, or the structure of biological entities, as well as interactions among them, plays a fundamental role in the discovery of information to help biologists to explain biological mechanisms.
In this context, appropriate characterization of the structures under analysis, and the exploitation of combinatorial properties of sequences, are crucial steps towards the development of efficient algorithms and data structures to be able to perform the analysis of biological sequences.
Similarity is a fundamental concept in Biology. Several functional and structural properties, and evolutionary mechanisms, can be predicted comparing new elements with already classified elements, or comparing elements with a similar structure of function to infer the common mechanism that is at the basis of the observed similar behavior. Such elements are commonly called motifs.
Comparison-based methods for sequence analysis find their application in several biological contexts, such as identification of transcription factor binding sites, finding structural and functional similarities in proteins, and phylogeny. Therefore the development of adequate methodologies for motif discovery is of paramount interests for several fields in computational biology.
In motif discovery in biosequences, it is common to assume that statistically significant candidates are those that are likely to hide some biologically significant property. For this purpose all the possible candidates are ranked according to some statistics on words (frequency, over/under representation, etc.). Then they are presented in output for further inspection by a biologist, who identifies the most promising subsequences, and tests them in laboratory to confirm their biological significance.
Therefore, when designing algorithms for motif discovery, besides obviously aim at time and space efficiency, particular attention should be devoted to the output representation.
In fact, even considering fixed length strings, the size of the candidate set become exponential if exhaustive enumeration is applied. This is already true when only exact matches are considered as candidate occurrences, and worsen if some kind of variability (for example a fixed number of mismatches is allowed). Alternatively, heuristics could be used, however without the warranty of finding the optimal solution.
Computational power of nowadays computers can partially reduce these effects, in particular for short length candidates. However, if the size of the output is too big to be analyzed by human inspection the risk is to provide biologists with very fast, but useless tools.
A possible solution relies on compact approaches. Compact approaches are based on the partition of the search space into classes. The classes must be designed in such a way that the score used to rank the candidates has a monotone behavior within each class. This allows the identification of a representative of each class, which is the element with the highest score. Consequently, it suffices to compute, and report in output, the score only for the representatives. In fact, we are guaranteed that for each element that has not been ranked there is another one (the representative of the class it belongs to) that is at least equally significant. The final user can then be presented with an output that has the size of the partition, rather than the size of the candidate space, with obvious advantages for the human-based analysis that follows the computer-based filtering of the pattern discovery algorithm.
Compact approaches find applications both in searching and discovery frameworks. They can also be applied to several motif models: exact patterns, patterns with given mismatch distribution, patterns with unknown mismatch distribution, profiles (i.e. matrices), and under both i.i.d. and Markov distributions.
The purpose of this chapter is to describe the basis of compact approaches, to provide the readers with the conceptual tools for applying compact approaches to the design of their algorithm for biosequence analysis. Moreover, examples of compact approaches that have been successfully developed for several motif models (e.g. exact words, co-occurrences, words with mismatches, etc) will be explained, and experimental results to discuss their power will be presented
MissMax: Alignment-free sequence comparison with mismatches through filtering and heuristics
BACKGROUND: Measuring sequence similarity is central for many problems in bioinformatics. In several contexts alignment-free techniques based on exact occurrences of substrings are faster, but also less accurate, than alignment-based approaches. Recently, several studies attempted to bridge the accuracy gap with the introduction of approximate matches in the definition of composition-based similarity measures. RESULTS: In this work we present MissMax, an exact algorithm for the computation of the longest common substring with mismatches between each suffix of a sequence x and a sequence y. This collection of statistics is useful for the computation of two similarity measures: the longest and the average common substring with k mismatches. As a further contribution we provide a “relaxed” version of MissMax that does not guarantee the exact solution, but it is faster in practice and still very precise
Fast Spaced Seed Hashing
Hashing k-mers is a common function across many bioinformatics applications and it is widely used for indexing, querying and rapid similarity search. Recently, spaced seeds, a special type of pattern that accounts for errors or mutations, are routinely used instead of k-mers. Spaced seeds allow to improve the sensitivity, with respect to k-mers, in many applications, however the hashing of spaced seeds increases substantially the computational time. Hence, the ability to speed up hashing operations of spaced seeds would have a major impact in the field, making spaced seed applications not only accurate, but also faster and more efficient.
In this paper we address the problem of efficient spaced seed hashing. The proposed algorithm exploits the similarity of adjacent spaced seed hash values in an input sequence in order to efficiently compute the next hash. We report a series of experiments on NGS reads hashing using several spaced seeds. In the experiments, our algorithm can compute the hashing values of spaced seeds with a speedup, with respect to the traditional approach, between 1.6x to 5.3x, depending on the structure of the spaced seed
Efficient algorithms for the discovery of gapped factors
Background: The discovery of surprisingly frequent patterns is of paramount interest in bioinformatics and computational biology. Among the patterns considered, those consisting of pairs of solid words that co-occur within a prescribed maximum distance-or gapped factors- emerge in a variety of contexts of DNA and protein sequence analysis. A few algorithms and tools have been developed in connection with specific formulations of the problem, however, none can handle comprehensively each of the multiple ways in which the distance between the two terms in a pair may be defined. Results: This paper presents efficient algorithms and tools for the extraction of all pairs of words up to an arbitrarily large length that co-occur surprisingly often in close proximity within a sequence. Whereas the number of such pairs in a sequence of n characters can be Θ(n 4), it is shown that an exhaustive discovery process can be carried out in O(n 2)orO(n 3), depending on the way distance is measured. This is made possible by a prudent combination of properties of pattern maximality and monotonicity of scores, which lead to reduce the number of word pairs to be weighed explicitly, while still producing also the scores attained by any of the pairs not explicitly considered. We applied our approach to the discovery of spaced dyads in DNA sequences. Conclusions: Experiments on biological datasets prove that the method is effective and much faster than exhaustive enumeration of candidate patterns. Software is available freely by academic users via the web interfac
MetaProb: Accurate metagenomic reads binning based on probabilistic sequence signatures
Abstract
Motivation
Sequencing technologies allow the sequencing of microbial communities directly from the environment without prior culturing. Taxonomic analysis of microbial communities, a process referred to as binning, is one of the most challenging tasks when analyzing metagenomic reads data. The major problems are the lack of taxonomically related genomes in existing reference databases, the uneven abundance ratio of species and the limitations due to short read lengths and sequencing errors.
Results
MetaProb is a novel assembly-assisted tool for unsupervised metagenomic binning. The novelty of MetaProb derives from solving a few important problems: how to divide reads into groups of independent reads, so that k-mer frequencies are not overestimated; how to convert k-mer counts into probabilistic sequence signatures, that will correct for variable distribution of k-mers, and for unbalanced groups of reads, in order to produce better estimates of the underlying genome statistic; how to estimate the number of species in a dataset. We show that MetaProb is more accurate and efficient than other state-of-the-art tools in binning both short reads datasets (F-measure 0.87) and long reads datasets (F-measure 0.97) for various abundance ratios. Also, the estimation of the number of species is more accurate than MetaCluster. On a real human stool dataset MetaProb identifies the most predominant species, in line with previous human gut studies.
Availability and Implementation
https://bitbucket.org/samu661/metaprob
Contacts
[email protected] or [email protected]
Supplementary information
Supplementary data are available at Bioinformatics online.
</jats:sec
A multistep bioinformatic approach detects putative regulatory elements in gene promoters
BACKGROUND: Searching for approximate patterns in large promoter sequences frequently produces an exceedingly high numbers of results. Our aim was to exploit biological knowledge for definition of a sheltered search space and of appropriate search parameters, in order to develop a method for identification of a tractable number of sequence motifs. RESULTS: Novel software (COOP) was developed for extraction of sequence motifs, based on clustering of exact or approximate patterns according to the frequency of their overlapping occurrences. Genomic sequences of 1 Kb upstream of 91 genes differentially expressed and/or encoding proteins with relevant function in adult human retina were analyzed. Methodology and results were tested by analysing 1,000 groups of putatively unrelated sequences, randomly selected among 17,156 human gene promoters. When applied to a sample of human promoters, the method identified 279 putative motifs frequently occurring in retina promoters sequences. Most of them are localized in the proximal portion of promoters, less variable in central region than in lateral regions and similar to known regulatory sequences. COOP software and reference manual are freely available upon request to the Authors. CONCLUSION: The approach described in this paper seems effective for identifying a tractable number of sequence motifs with putative regulatory role
Plasma Total Cysteine and Cardiovascular Risk Burden: Action and Interaction
We hypothesized that redox analysis could provide sensitive markers of the oxidative pathway associated to the presence of an increasing number of cardiovascular risk factors (RFs), independently of type. We classified 304 subjects without cardiovascular disease into 4 groups according to the total number of RFs (smoking, hypertension, hypercholesterolaemia, hyperhomocysteinaemia, diabetes, obesity, and their combination). Oxidative stress was evaluated by measuring plasma total and reduced homocysteine, cysteine (Cys), glutathione, cysteinylglycine, blood reduced glutathione, and malondialdehyde. Twenty-seven percent of subjects were in group 0 RF, 26% in 1 RF, 31% in 2 RF, and 16% in ≥3 RF. By multivariable ordinal regression analysis, plasma total Cys was associated to a higher number of RF (OR = 1.068; 95% CI = 1.027–1.110, P = 0.002). Total RF burden is associated with increased total Cys levels. These findings support a prooxidant effect of Cys in conjunction with RF burden, and shed light on the pathophysiologic role of redox state unbalance in preclinical atherosclerosis
Surgical Antimicrobial Prophylaxis in Patients of Neonatal and Pediatric Age Undergoing Orthopedic and Hand Surgery: A RAND/UCLA Appropriateness Method Consensus Study
Surgical site infections (SSIs) represent a potential complication in any type of surgery and can occur up to one year after the procedure in the case of implant placement. In the field of orthopedic and hand surgery, the rate of SSIs is a relevant issue, considering the need for the placement of synthesis devices and the type of some interventions (e.g., exposed fractures). This work aims to provide guidance on the management of peri-operative antibiotic prophylaxis for the pediatric and neonatal population undergoing orthopedic and hand surgery in order to standardize the management of patients and to reduce, on the one hand, the risk of SSI and, on the other, the development of antimicrobial resistance. The following scenarios were considered: (1) bloodless fracture reduction; (2) reduction of unexposed fracture and grade I and II exposed fracture; (3) reduction of grade III exposed fracture or traumatic amputation; (4) cruel fracture reduction with percutaneous synthesis; (5) non-traumatic amputation; (6) emergency intact skin trauma surgery and elective surgery without synthetic media placement; (7) elective orthopedic surgery with prosthetic and/or synthetic media placement and spinal surgery; (8) clean elective hand surgery with and without bone involvement, without use of synthetic means; (9) surgery of the hand on an elective basis with bone involvement and/or with use of synthetic means. This manuscript has been made possible by the multidisciplinary contribution of experts belonging to the most important Italian scientific societies and represents, in our opinion, the most complete and up-to-date collection of recommendations regarding the behavior to be adopted in the peri-operative setting in neonatal and pediatric orthopedic and hand surgery. The specific scenarios developed are aimed at guiding the healthcare professional in practice to ensure the better and standardized management of neonatal and pediatric patients, together with an easy consultation
- …