10 research outputs found

    ParaPhraser: Russian paraphrase corpus and shared task

    Get PDF
    The paper describes the results of the First Russian Paraphrase Detection Shared Task held in St.-Petersburg, Russia, in October 2016. Research in the area of paraphrase extraction, detection and generation has been successfully developing for a long time while there has been only a recent surge of interest towards the problem in the Russian community of computational linguistics. We try to overcome this gap by introducing the project ParaPhraser.ru dedicated to the collection of Russian paraphrase corpus and organizing a Paraphrase Detection Shared Task, which uses the corpus as the training data. The participants of the task applied a wide variety of techniques to the problem of paraphrase detection, from rule-based approaches to deep learning, and results of the task reflect the following tendencies: the best scores are obtained by the strategy of using traditional classifiers combined with fine-grained linguistic features, however, complex neural networks, shallow methods and purely technical methods also demonstrate competitive results.Peer reviewe

    Description of Turkish paraphrase corpus structure and generation method

    No full text
    17th International Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2016 -- 3 April 2016 through 9 April 2016 -- 212219Because developing a corpus requires a long time and lots of human effort, it is desirable to make it as resourceful as possible: rich in coverage, flexible, multipurpose and expandable. Here we describe the steps we took in the development of Turkish paraphrase corpus, the factors we considered, problems we faced and how we dealt with them. Currently our corpus contains nearly 4000 sentences with the ratio of 60% paraphrase and 40% non-paraphrase sentence pairs. The sentence pairs are annotated at 5-scale: paraphrase, encapsulating, encapsulated, non-paraphrase and opposite. The corpus is formulated in a database structure integrated with Turkish dictionary. The sources we used till now are news texts from Bilcon 2005 corpus, a set of professionally translated sentence pairs from MSRP corpus, multiple Turkish translations from different languages that are involved in Tatoeba corpus and user generated paraphrases. © Springer International Publishing AG, part of Springer Nature 2018.2015/BİL/034 114E126Acknowledgement. This work is carried under the grant of TÜBİTAK – The Scientific and Technological Research Council of Turkey to Project No: 114E126, Using Certainty Factor Approach and Creating Paraphrase Corpus for Measuring Similarity of Short Turkish Texts and Ege University Scientific Research Council Project No 2015/BİL/034, Developing a Paraphrase Corpus for Turkish Short Text Similarity Studies. -

    Description of Turkish Paraphrase Corpus Structure and Generation Method

    No full text
    17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing) -- APR 03-09, 2016 -- Mevlana Univ, Konya, TURKEYKISLA, TARIK/0000-0001-9007-7455; KARAOGLAN, BAHAR/0000-0001-9338-7491WOS:000540380100013Because developing a corpus requires a long time and lots of human effort, it is desirable to make it as resourceful as possible: rich in coverage, flexible, multipurpose and expandable. Here we describe the steps we took in the development of Turkish paraphrase corpus, the factors we considered, problems we faced and how we dealt with them. Currently our corpus contains nearly 4000 sentences with the ratio of 60% paraphrase and 40% non-paraphrase sentence pairs. The sentence pairs are annotated at 5-scale: paraphrase, encapsulating, encapsulated, non-paraphrase and opposite. The corpus is formulated in a database structure integrated with Turkish dictionary. The sources we used till now are news texts from Bilcon 2005 corpus, a set of professionally translated sentence pairs from MSRP corpus, multiple Turkish translations from different languages that are involved in Tatoeba corpus and user generated paraphrases.TUBITAK - The Scientific and Technological Research Council of TurkeyTurkiye Bilimsel ve Teknolojik Arastirma Kurumu (TUBITAK) [114E126]; Ege University Scientific Research CouncilEge University [2015/BIL/034]This work is carried under the grant of TUBITAK - The Scientific and Technological Research Council of Turkey to Project No: 114E126, Using Certainty Factor Approach and Creating Paraphrase Corpus for Measuring Similarity of Short Turkish Texts and Ege University Scientific Research Council Project No 2015/BIL/034, Developing a Paraphrase Corpus for Turkish Short Text Similarity Studies

    Combining machine translation and text similarity metrics to identify paraphrases in Turkish

    No full text
    Aselsan;et al.;Huawei;IEEE Signal Processing Society;IEEE Turkey Section;Netas26th IEEE Signal Processing and Communications Applications Conference, SIU 2018 -- 2 May 2018 through 5 May 2018 -- 137780Paraphrase identification (PI) is to recognize whether given two sentences are restatements of each other or not. In our study we propose an approach that exploits machine translation and text similarity metrics as features for PI. Machine learning algorithms like Support Vector Machine (SVM) with three different kernels, C4.5 Decision tree and Multinomial Naïve Bayes (NB) are trained with these features. We evaluated our system on Parder, Turkish paraphrase corpus. The experimental results show that the proposed approach offers state-of-the-art results. © 2018 IEEE

    Contribution of syntactic and semantic attributes in paraphrase identification [Esanlatim Tespitinde Sözdizimsel ve Anlamsal Özniteliklerin Katkisi]

    No full text
    Aselsan;et al.;Huawei;IEEE Signal Processing Society;IEEE Turkey Section;Netas26th IEEE Signal Processing and Communications Applications Conference, SIU 2018 -- 2 May 2018 through 5 May 2018 -- 137780Automatic paraphrase identification is a natural language understanding problem where a decision is to be made whether the given sentence pairs bare similar meanings to a certain extent. Syntactic and semantic features are used to classify the sentences as paraphrase or non-paraphrase. Word overlapping, word ordering are some of the syntactic features widely used in the literature, where, similarity of words in meaning and named entity (NE) overlap are among the semantic features. Turkish, unfortunately doesn't have a useful tool like WordNet to draw the semantic relations between words as it is done for English. Here we exploit tense and polarity differences as semantic features and assess the improvement on the classification brought by these semantic features. We performed the experiments with several different combinations of features on the Turkish paraphrase corpus that is built by the researchers and report the results. © 2018 IEEE

    Contribution of Syntactic and Semantic Attributes in Paraphrase Identification

    No full text
    26th IEEE Signal Processing and Communications Applications Conference (SIU) -- MAY 02-05, 2018 -- Izmir, TURKEYWOS:000511448500057Automatic paraphrase identification is a natural language understanding problem where a decision is to be made whether the given sentence pairs bare similar meanings to a certain extent. Syntactic and semantic features are used to classify the sentences as paraphrase or non-paraphrase. Word overlapping, word ordering are some of the syntactic features widely used in the literature, where, similarity of words in meaning and named entity (NE) overlap are among the semantic features. Turkish, unfortunately doesn't have a useful tool like WordNet to draw the semantic relations between words as it is done for English. Here we exploit tense and polarity differences as semantic features and assess the improvement on the classification brought by these semantic features. We performed the experiments with several different combinations of features on the Turkish paraphrase corpus that is built by the researchers and report the results.IEEE, Huawei, Aselsan, NETAS, IEEE Turkey Sect, IEEE Signal Proc Soc, IEEE Commun Soc, ViSRATEK, Adresgezgini, Rohde & Schwarz, Integrated Syst & Syst Design, Atilim Univ, Havelsan, Izmir Katip Celebi Uni
    corecore