521 research outputs found

    Automated linear motif discovery from protein interaction network

    Get PDF
    Master'sMASTER OF SCIENC

    Beyond Hypergraph Dualization

    Get PDF
    International audienceThis problem concerns hypergraph dualization and generalization to poset dualization. A hypergraph H = (V, E) consists of a finite collection E of sets over a finite set V , i.e. E ⊆ P(V) (the powerset of V). The elements of E are called hyperedges, or simply edges. A hypergraph is said simple if none of its edges is contained within another. A transversal (or hitting set) of H is a set T ⊆ V that intersects every edge of E. A transversal is minimal if it does not contain any other transversal as a subset. The set of all minimal transversal of H is denoted by T r(H). The hypergraph (V, T r(H)) is called the transversal hypergraph of H. Given a simple hypergraph H, the hypergraph dualization problem (Trans-Enum for short) concerns the enumeration without repetitions of T r(H). The Trans-Enum problem can also be formulated as a dualization problem in posets. Let (P, ≤) be a poset (i.e. ≤ is a reflexive, antisymmetric, and transitive relation on the set P). For A ⊆ P , ↓ A (resp. ↑ A) is the downward (resp. upward) closure of A under the relation ≤ (i.e. ↓ A is an ideal and ↑ A a filter of (P, ≤)). Two antichains (B + , B −) of P are said to be dual if ↓ B + ∪ ↑ B − = P and ↓ B + ∩ ↑ B − = ∅. Given an implicit description of a poset P and an antichain B + (resp. B −) of P , the poset dualization problem (Dual-Enum for short) enumerates the set B − (resp. B +), denoted by Dual(B +) = B − (resp. Dual(B −) = B +). Notice that the function dual is self-dual or idempotent, i.e. Dual(Dual(B)) = B

    Teaching Google search techniques in an L2 academic writing context

    Get PDF
    This mixed-method study examines the effectiveness of teaching Google search techniques (GSTs) to Korean EFL college students in an intermediate-level academic English writing course. 18 students participated in a 4-day GST workshop consisting of an overview session of the web as corpus and Google as a concordancer, and three training sessions targeting the use of quotation marks (“”) and a wildcard (*). Each session contained a pre-test, a 30-minute training, and a post-test, and each training session focused on one of the three key writing points: articles, collocations, and paraphrasing. Two questionnaires for demographic information and GST learning experiences were conducted. The results showed a statistically significant effect for the overall gain score. In particular, participants’ use of articles greatly improved after the training—in contrast to their use of collocations and paraphrasing. Lack of grammar and vocabulary knowledge seemed to hinder their data-driven learning, especially for collocation use and paraphrasing. The questionnaire data showed that all students found the GSTs beneficial, mostly because they were easy to use for confirmation and correction. Overall, both quantitative and qualitative data suggest that teachers’ meticulous guidance and vigilant individualized feedback are necessary to facilitate L2 self-directed Google-informed writing

    Pattern Discovery from Biosequences

    Get PDF
    In this thesis we have developed novel methods for analyzing biological data, the primary sequences of the DNA and proteins, the microarray based gene expression data, and other functional genomics data. The main contribution is the development of the pattern discovery algorithm SPEXS, accompanied by several practical applications for analyzing real biological problems. For performing these biological studies that integrate different types of biological data we have developed a comprehensive web-based biological data analysis environment Expression Profiler (http://ep.ebi.ac.uk/)

    Recognition of short functional motifs in protein sequences

    Get PDF
    The main goal of this study was to develop a method for computational de novo prediction of short linear motifs (SLiMs) in protein sequences that would provide advantages over existing solutions for the users. The users are typically biological laboratory researchers, who want to elucidate the function of a protein that is possibly mediated by a short motif. Such a process can be subcellular localization, secretion, post-translational modification or degradation of proteins. Conducting such studies only with experimental techniques is often associated with high costs and risks of uncertainty. Preliminary prediction of putative motifs with computational methods, them being fast and much less expensive, provides possibilities for generating hypotheses and therefore, more directed and efficient planning of experiments. To meet this goal, I have developed HH-MOTiF – a web-based tool for de novo discovery of SLiMs in a set of protein sequences. While working on the project, I have also detected patterns in sequence properties of certain SLiMs that make their de novo prediction easier. As some of these patterns are not yet described in the literature, I am sharing them in this thesis. While evaluating and comparing motif prediction results, I have identified conceptual gaps in theoretical studies, as well as existing practical solutions for comparing two sets of positional data annotating the same set of biological sequences. To close this gap and to be able to carry out in-depth performance analyses of HH-MOTiF in comparison to other predictors, I have developed a corresponding statistical method, SLALOM (for StatisticaL Analysis of Locus Overlap Method). It is currently available as a standalone command line tool

    A Novel Tree Structure for Pattern Matching in Biological Sequences

    Get PDF
    This dissertation proposes a novel tree structure, Error Tree (ET), to more efficiently solve the Approximate Pattern Matching problem, a fundamental problem in bioinformatics and information retrieval. The problem involves different matching measures such as the Hamming distance, edit distance, and wildcard matching. The input is usually a text of length n over a fixed alphabet of size Σ, a pattern P of length m, and an integer k. The output is those subsequences in the text that are at a distance ≤ k from P by Hamming distance, edit distance, or wildcard matching. An immediate application of the approximate pattern matching is the Planted Motif Search, an important problem in many biological applications such as finding promoters, enhancers, locus control regions, transcription factors, etc. The (l, d)-Planted Motif Search is defined as the following: Given n sequences over an alphabet of size Σ, each of length m, and two integers l and d, find a motif M of length l, where in each sequence there is at least an l-mer (substring of length l) at a Hamming distance of ≤ d from M. Based on the ET structure, our algorithm ET-Motif solves this problem efficiently in time and space. The thesis also discusses how the ET structure may add efficiency when it comes to Genome Assembly and DNA Sequence Compression. Current high-throughput sequencing technologies generate millions or billions of short reads (100-1000 bases) that are sequenced from a genome of millions or billions bases long. The De novo Genome Assembly problem is to assemble the original genome as long and accurate as possible. Although high quality assemblies can be obtained by assembling multiple paired-end libraries with both short and long insert sizes, the latter is costly to generate. Moreover, the recent GAGE-B study showed that a remarkably good assembly quality can be obtained for bacterial genomes by state-of-the-art assemblers run on a single short-insert library with a very high coverage. This thesis introduces a novel Hierarchical Genome Assembly (HGA) method that takes further advantage of such high coverage by independently assembling disjoint subsets of reads, combining assemblies of the subsets, and finally re-assembling the combined contigs along with the original reads. We empirically evaluate this methodology for eight leading assemblers using seven GAGE-B bacterial datasets consisting of 100bp Illumina HiSeq and 250bp Illumina MiSeq reads with coverage ranging from 100x-∼200x. The results show that HGA leads to a significant improvement in the quality of the assembly for all evaluated assemblers and datasets. Still, the problem involves a major step which is overlapping the ends of the reads together and allowing few mismatches (i.e. the approximate matching problem). This requires computing the overlaps between the ends of all-against-all reads. The computation of such overlaps when allowing mismatches is intensive. The ET structure may further speed up this step. Lastly, due to the significant amount of DNA data generated by the Next- Generation-Sequencing machines, there is an increasing need to compress such data to reduce the storage space and transmission time. The Huffman encoding that incorporates DNA sequence characteristics proves to better compress DNA data. Different implementations of Huffman trees, centering on the selection of frequent repeats, are introduced in this thesis. Experimental results demonstrate improvement on the compression ratios for five genomes with lengths ranging from 5Mbp to 50Mbp, compared with the use of a standard Huffman tree algorithm. Hence, the thesis suggests an improvement on all DNA sequence compression algorithms that employ the conventional Huffman encoding. Moreover, approximate repeats can be compressed and further improve the results by encoding the Hamming or edit distance between these repeats. However, computing such distances requires additional costs in both time and space. These costs can be reduced by using the ET structure
    corecore