156,154 research outputs found

    Dependency parsing of Turkish

    Get PDF
    The suitability of different parsing methods for different languages is an important topic in syntactic parsing. Especially lesser-studied languages, typologically different from the languages for which methods have originally been developed, poses interesting challenges in this respect. This article presents an investigation of data-driven dependency parsing of Turkish, an agglutinative free constituent order language that can be seen as the representative of a wider class of languages of similar type. Our investigations show that morphological structure plays an essential role in finding syntactic relations in such a language. In particular, we show that employing sublexical representations called inflectional groups, rather than word forms, as the basic parsing units improves parsing accuracy. We compare two different parsing methods, one based on a probabilistic model with beam search, the other based on discriminative classifiers and a deterministic parsing strategy, and show that the usefulness of sublexical units holds regardless of parsing method.We examine the impact of morphological and lexical information in detail and show that, properly used, this kind of information can improve parsing accuracy substantially. Applying the techniques presented in this article, we achieve the highest reported accuracy for parsing the Turkish Treebank

    An intuitionistic approach to scoring DNA sequences against transcription factor binding site motifs

    Get PDF
    Background: Transcription factors (TFs) control transcription by binding to specific regions of DNA called transcription factor binding sites (TFBSs). The identification of TFBSs is a crucial problem in computational biology and includes the subtask of predicting the location of known TFBS motifs in a given DNA sequence. It has previously been shown that, when scoring matches to known TFBS motifs, interdependencies between positions within a motif should be taken into account. However, this remains a challenging task owing to the fact that sequences similar to those of known TFBSs can occur by chance with a relatively high frequency. Here we present a new method for matching sequences to TFBS motifs based on intuitionistic fuzzy sets (IFS) theory, an approach that has been shown to be particularly appropriate for tackling problems that embody a high degree of uncertainty. Results: We propose SCintuit, a new scoring method for measuring sequence-motif affinity based on IFS theory. Unlike existing methods that consider dependencies between positions, SCintuit is designed to prevent overestimation of less conserved positions of TFBSs. For a given pair of bases, SCintuit is computed not only as a function of their combined probability of occurrence, but also taking into account the individual importance of each single base at its corresponding position. We used SCintuit to identify known TFBSs in DNA sequences. Our method provides excellent results when dealing with both synthetic and real data, outperforming the sensitivity and the specificity of two existing methods in all the experiments we performed. Conclusions: The results show that SCintuit improves the prediction quality for TFs of the existing approaches without compromising sensitivity. In addition, we show how SCintuit can be successfully applied to real research problems. In this study the reliability of the IFS theory for motif discovery tasks is proven

    An Exploratory Application of Rhetorical Structure Theory to Detect Coherence Errors in L2 English Writing: Possible Implications for Automated Writing Evaluation Software

    Get PDF
    This paper presents an initial attempt to examine whether Rhetorical Structure Theory (RST) (Mann & Thompson, 1988) can be fruitfully applied to the detection of the coherence errors made by Taiwanese low-intermediate learners of English. This investigation is considered warranted for three reasons. First, other methods for bottom-up coherence analysis have proved ineffective (e.g., Watson Todd et al., 2007). Second, this research provides a preliminary categorization of the coherence errors made by first language (L1) Chinese learners of English. Third, second language discourse errors in general have received little attention in applied linguistic research. The data are 45 written samples from the LTTC English Learner Corpus, a Taiwanese learner corpus of English currently under construction. The rationale of this study is that diagrams which violate some of the rules of RST diagram formation will point to coherence errors. No reliability test has been conducted since this work is at an initial stage. Therefore, this study is exploratory and results are preliminary. Results are discussed in terms of the practicality of using this method to detect coherence errors, their possible consequences about claims for a typical inductive content order in the writing of L1 Chinese learners of English, and their potential implications for Automated Writing Evaluation (AWE) software, since discourse organization is one of the essay characteristics assessed by this software. In particular, the extent to which the kinds of errors detected through the RST analysis match those located by Criterion (Burstein, Chodorow, & Leachock, 2004), a well-known AWE software by Educational Testing Service (ETS), is discussed
    • …
    corecore