5 research outputs found

    The Longest Common Subsequence via Generalized Suffix Trees

    Get PDF
    Given two strings S1 and S 2, finding the longest common subsequence (LCS) is a classical problem in computer science. Many algorithms have been proposed to find the longest common subsequence between two strings. The most common and widely used method is the dynamic programming approach, which runs in quadratic time and takes quadratic space. Other algorithms have been introduced later to solve the LCS problem in less time and space. In this work, we present a new algorithm to find the longest common subsequence using the generalized suffix tree and directed acyclic graph.;The Generalized suffix tree (GST) is the combined suffix tree for a set of strings {lcub}S1, S 2, ..., Sn{rcub}. Both the suffix tree and the generalized suffix tree can be calculated in linear time and linear space. One application for generalized suffix tree is to find the longest common substring between two strings. But finding the longest common subsequence is not straight forward using the generalized suffix tree. Here we describe how we can use the GST to find the common substrings between two strings and introduce a new approach to calculate the longest common subsequence (LCS) from the common substrings. This method takes a different view at the LCS problem, shading more light at novel applications of the LCS. We also show how this method can motivate the development of new compression techniques for genome resequencing data

    Learning from Auxiliary Sources in Argumentative Revision Classification

    Full text link
    We develop models to classify desirable reasoning revisions in argumentative writing. We explore two approaches -- multi-task learning and transfer learning -- to take advantage of auxiliary sources of revision data for similar tasks. Results of intrinsic and extrinsic evaluations show that both approaches can indeed improve classifier performance over baselines. While multi-task learning shows that training on different sources of data at the same time may improve performance, transfer-learning better represents the relationship between the data

    A New Algorithm for “the LCS problem” with Application in Compressing Genome Resequencing Data

    Get PDF
    Background: The longest common subsequence (LCS) problem is a classical problem in computer science, and forms the basis of the current best-performing reference-based compression schemes for genome resequencing data. Methods: First, we present a new algorithm for the LCS problem. Using the generalized suffix tree, we identify the common substrings shared between the two input sequences. Using the maximal common substrings, we construct a directed acyclic graph (DAG), based on which we determine the LCS as the longest path in the DAG. Then, we introduce an LCS-motivated reference-based compression scheme using the components of the LCS, rather than the LCS itself. Results: Our basic scheme compressed the Homo sapiens genome (with an original size of 3,080,436,051 bytes) to 15,460,478 bytes. An improvement on the basic method further reduced this to 8,556,708 bytes, or an overall compression ratio of 360. This can be compared to the previous state-of-the-art compression ratios of 157 (Wang and Zhang, 2011) and 171 (Pinho, Pratas, and Garcia, 2011). Conclusion: We propose a new algorithm to address the longest common subsequence problem. Motivated by our LCS algorithm, we introduce a new reference-based compression scheme for genome resequencing data. Comparative results against state-of-the-art reference-based compression algorithms demonstrate the performance of the proposed method

    Desirable revisions of evidence and reasoning for argumentative writing

    No full text
    Successful essay writing by students typically involves multiple rounds of revision and assistance from teachers, peers, or automated writing evaluation (AWE) systems. Natural language processing (NLP) has become a key component of AWE systems, with NLP being used to assess the content and structure of student writing. Typically, students are involved in cycles of essay drafting and revising with or without AWE systems. After drafting an essay, students often receive formative feedback automatically generated by a system or provided by other humans such as teachers or student peers. During the revision process, students then produce texts that are in line with the feedback, to improve the quality of the essay. Hence, analyzing student revisions in terms of their desirability for improving the essay is important. Current intelligent writing assistant tools typically provide instant feedback by locating problems in the text (e.g., spelling mistake) and providing possible solutions but fail to tell if the user successfully implemented the feedback, especially feedback that involves higher-level semantic analysis (e.g., better example). In this thesis, we take a step towards advancing automated revision analysis capabilities. First, we propose a framework for analyzing the nature of students' revision of evidence use and reasoning in text-based argumentative essay writing tasks. Using statistical analysis, we evaluate the reliability of the proposed framework and establish the relationship of the scheme to essay improvement. Then we propose computational models to study the automatic classification of the desirable revisions. We explore two ways to improve the prediction of revision desirability -- the context of the revision, and the feedback students received before the revision. To the best of our knowledge, this is the first study to explore using feedback messages for a revision classification task. Finally, we also explore how auxiliary knowledge from a different writing task might help improve the identification of desirable revisions using a multi-task model and transfer-learning

    A new algorithm for “the LCS problem” with application in compressing genome resequencing data

    No full text
    BACKGROUND: The longest common subsequence (LCS) problem is a classical problem in computer science, and forms the basis of the current best-performing reference-based compression schemes for genome resequencing data. METHODS: First, we present a new algorithm for the LCS problem. Using the generalized suffix tree, we identify the common substrings shared between the two input sequences. Using the maximal common substrings, we construct a directed acyclic graph (DAG), based on which we determine the LCS as the longest path in the DAG. Then, we introduce an LCS-motivated reference-based compression scheme using the components of the LCS, rather than the LCS itself. RESULTS: Our basic scheme compressed the Homo sapiens genome (with an original size of 3,080,436,051 bytes) to 15,460,478 bytes. An improvement on the basic method further reduced this to 8,556,708 bytes, or an overall compression ratio of 360. This can be compared to the previous state-of-the-art compression ratios of 157 (Wang and Zhang, 2011) and 171 (Pinho, Pratas, and Garcia, 2011). CONCLUSION: We propose a new algorithm to address the longest common subsequence problem. Motivated by our LCS algorithm, we introduce a new reference-based compression scheme for genome resequencing data. Comparative results against state-of-the-art reference-based compression algorithms demonstrate the performance of the proposed method
    corecore