10 research outputs found
ParaPhraser: Russian paraphrase corpus and shared task
The paper describes the results of the First Russian Paraphrase Detection Shared Task held in St.-Petersburg, Russia, in October 2016. Research in the area of paraphrase extraction, detection and generation has been successfully developing for a long time while there has been only a recent surge of interest towards the problem in the Russian community of computational linguistics. We try to overcome this gap by introducing the project ParaPhraser.ru dedicated to the collection of Russian paraphrase corpus and organizing a Paraphrase Detection Shared Task, which uses the corpus as the training data. The participants of the task applied a wide variety of techniques to the problem of paraphrase detection, from rule-based approaches to deep learning, and results of the task reflect the following tendencies: the best scores are obtained by the strategy of using traditional classifiers combined with fine-grained linguistic features, however, complex neural networks, shallow methods and purely technical methods also demonstrate competitive results.Peer reviewe
Recommended from our members
Paraphrase identification using knowledge-lean techniques
This research addresses the problem of identification of sentential paraphrases; that is, the ability of an estimator to predict well whether two sentential text fragments are paraphrases. The paraphrase identification task has practical importance in the Natural Language Processing (NLP) community because of the need to deal with the pervasive problem of linguistic variation. Accurate methods for identifying paraphrases should help to improve the performance of NLP systems that require language understanding. This includes key applications such as machine translation, information retrieval and question answering amongst others. Over the course of the last decade, a growing body of research has been conducted on paraphrase identification and it has become an individual working area of NLP.
Our objective is to investigate whether techniques concentrating on automated understanding of text requiring less resource may achieve results comparable to methods employing more sophisticated NLP processing tools and other resources. These techniques, which we call “knowledge-lean”, range from simple, shallow overlap methods based on lexical items or n-grams through to more sophisticated methods that employ automatically generated distributional thesauri.
The work begins by focusing on techniques that exploit lexical overlap and text-based statistical techniques that are much less in need of NLP tools. We investigate the question “To what extent can these methods be used for the purpose of a paraphrase identification task?” For the two gold standard data, we obtained competitive results on the Microsoft Research Paraphrase Corpus (MSRPC) and reached the state-of-the-art results on the Twitter Paraphrase Corpus, using only n-gram overlap features in conjunction with support vector machines (SVMs).
These techniques do not require any language specific tools or external resources and appear to perform well without the need to normalise colloquial language such as that found on Twitter. It was natural to extend the scope of the research and to consider experimenting on another language, which is poor in resources. The scarcity of available paraphrase data led us to construct our own corpus; we have constructed a paraphrasecorpus in Turkish. This corpus is relatively small but provides a representative collection, including a variety of texts. While there is still debate as to whether a binary or fine-grained judgement satisfies a paraphrase corpus, we chose to provide data for a sentential textual similarity task by agreeing on fine-grained scoring, knowing that this could be converted to binary scoring, but not the other way around. The correlation between the results from different corpora is promising. Therefore, it can be surmised that languages poor in resources can benefit from knowledge-lean techniques.
Discovering the strengths of knowledge-lean techniques extended with a new perspective to techniques that use distributional statistical features of text by representing each word as a vector (word2vec). While recent research focuses on larger fragments of text with word2vec, such as phrases, sentences and even paragraphs, a new approach is presented by introducing vectors of character n-grams that carry the same attributes as word vectors. The proposed method has the ability to capture syntactic relations as well as semantic relations without semantic knowledge. This is proven to be competitive on Twitter compared to more sophisticated methods
Description of Turkish paraphrase corpus structure and generation method
17th International Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2016 -- 3 April 2016 through 9 April 2016 -- 212219Because developing a corpus requires a long time and lots of human effort, it is desirable to make it as resourceful as possible: rich in coverage, flexible, multipurpose and expandable. Here we describe the steps we took in the development of Turkish paraphrase corpus, the factors we considered, problems we faced and how we dealt with them. Currently our corpus contains nearly 4000 sentences with the ratio of 60% paraphrase and 40% non-paraphrase sentence pairs. The sentence pairs are annotated at 5-scale: paraphrase, encapsulating, encapsulated, non-paraphrase and opposite. The corpus is formulated in a database structure integrated with Turkish dictionary. The sources we used till now are news texts from Bilcon 2005 corpus, a set of professionally translated sentence pairs from MSRP corpus, multiple Turkish translations from different languages that are involved in Tatoeba corpus and user generated paraphrases. © Springer International Publishing AG, part of Springer Nature 2018.2015/BİL/034 114E126Acknowledgement. This work is carried under the grant of TÜBİTAK – The Scientific and Technological Research Council of Turkey to Project No: 114E126, Using Certainty Factor Approach and Creating Paraphrase Corpus for Measuring Similarity of Short Turkish Texts and Ege University Scientific Research Council Project No 2015/BİL/034, Developing a Paraphrase Corpus for Turkish Short Text Similarity Studies. -
Description of Turkish Paraphrase Corpus Structure and Generation Method
17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing) -- APR 03-09, 2016 -- Mevlana Univ, Konya, TURKEYKISLA, TARIK/0000-0001-9007-7455; KARAOGLAN, BAHAR/0000-0001-9338-7491WOS:000540380100013Because developing a corpus requires a long time and lots of human effort, it is desirable to make it as resourceful as possible: rich in coverage, flexible, multipurpose and expandable. Here we describe the steps we took in the development of Turkish paraphrase corpus, the factors we considered, problems we faced and how we dealt with them. Currently our corpus contains nearly 4000 sentences with the ratio of 60% paraphrase and 40% non-paraphrase sentence pairs. The sentence pairs are annotated at 5-scale: paraphrase, encapsulating, encapsulated, non-paraphrase and opposite. The corpus is formulated in a database structure integrated with Turkish dictionary. The sources we used till now are news texts from Bilcon 2005 corpus, a set of professionally translated sentence pairs from MSRP corpus, multiple Turkish translations from different languages that are involved in Tatoeba corpus and user generated paraphrases.TUBITAK - The Scientific and Technological Research Council of TurkeyTurkiye Bilimsel ve Teknolojik Arastirma Kurumu (TUBITAK) [114E126]; Ege University Scientific Research CouncilEge University [2015/BIL/034]This work is carried under the grant of TUBITAK - The Scientific and Technological Research Council of Turkey to Project No: 114E126, Using Certainty Factor Approach and Creating Paraphrase Corpus for Measuring Similarity of Short Turkish Texts and Ege University Scientific Research Council Project No 2015/BIL/034, Developing a Paraphrase Corpus for Turkish Short Text Similarity Studies
Combining machine translation and text similarity metrics to identify paraphrases in Turkish
Aselsan;et al.;Huawei;IEEE Signal Processing Society;IEEE Turkey Section;Netas26th IEEE Signal Processing and Communications Applications Conference, SIU 2018 -- 2 May 2018 through 5 May 2018 -- 137780Paraphrase identification (PI) is to recognize whether given two sentences are restatements of each other or not. In our study we propose an approach that exploits machine translation and text similarity metrics as features for PI. Machine learning algorithms like Support Vector Machine (SVM) with three different kernels, C4.5 Decision tree and Multinomial Naïve Bayes (NB) are trained with these features. We evaluated our system on Parder, Turkish paraphrase corpus. The experimental results show that the proposed approach offers state-of-the-art results. © 2018 IEEE
Contribution of syntactic and semantic attributes in paraphrase identification [Esanlatim Tespitinde Sözdizimsel ve Anlamsal Özniteliklerin Katkisi]
Aselsan;et al.;Huawei;IEEE Signal Processing Society;IEEE Turkey Section;Netas26th IEEE Signal Processing and Communications Applications Conference, SIU 2018 -- 2 May 2018 through 5 May 2018 -- 137780Automatic paraphrase identification is a natural language understanding problem where a decision is to be made whether the given sentence pairs bare similar meanings to a certain extent. Syntactic and semantic features are used to classify the sentences as paraphrase or non-paraphrase. Word overlapping, word ordering are some of the syntactic features widely used in the literature, where, similarity of words in meaning and named entity (NE) overlap are among the semantic features. Turkish, unfortunately doesn't have a useful tool like WordNet to draw the semantic relations between words as it is done for English. Here we exploit tense and polarity differences as semantic features and assess the improvement on the classification brought by these semantic features. We performed the experiments with several different combinations of features on the Turkish paraphrase corpus that is built by the researchers and report the results. © 2018 IEEE
Contribution of Syntactic and Semantic Attributes in Paraphrase Identification
26th IEEE Signal Processing and Communications Applications Conference (SIU) -- MAY 02-05, 2018 -- Izmir, TURKEYWOS:000511448500057Automatic paraphrase identification is a natural language understanding problem where a decision is to be made whether the given sentence pairs bare similar meanings to a certain extent. Syntactic and semantic features are used to classify the sentences as paraphrase or non-paraphrase. Word overlapping, word ordering are some of the syntactic features widely used in the literature, where, similarity of words in meaning and named entity (NE) overlap are among the semantic features. Turkish, unfortunately doesn't have a useful tool like WordNet to draw the semantic relations between words as it is done for English. Here we exploit tense and polarity differences as semantic features and assess the improvement on the classification brought by these semantic features. We performed the experiments with several different combinations of features on the Turkish paraphrase corpus that is built by the researchers and report the results.IEEE, Huawei, Aselsan, NETAS, IEEE Turkey Sect, IEEE Signal Proc Soc, IEEE Commun Soc, ViSRATEK, Adresgezgini, Rohde & Schwarz, Integrated Syst & Syst Design, Atilim Univ, Havelsan, Izmir Katip Celebi Uni