Search CORE

2,397 research outputs found

The logic and linguistic model for automatic extraction of collocation similarity

Author: Gautam Ajit Pratap Singh
Khairova N. F.
Petrasova S. V.
Publication venue: Національний університет "Львівська політехніка"
Publication date: 01/01/2015
Field of study

The article discusses the process of automatic identification of collocation similarity. The semantic analysis is one of the most advanced as well as the most difficult NLP task. The main problem of semantic processing is the determination of polysemy and synonymy of linguistic units. In addition, the task becomes complicated in case of word collocations. The paper suggests a logical and linguistic model for automatic determining semantic similarity between colocations in Ukraine and English languages. The proposed model formalizes semantic equivalence of collocations by means of semantic and grammatical characteristics of collocates. The basic idea of this approach is that morphological, syntactic and semantic characteristics of lexical units are to be taken into account for the identification of collocation similarity. Basic mathematical means of our model are logical-algebraic equations of the finite predicates algebra. Verb-noun and noun-adjective collocations in Ukrainian and English languages consist of words belonged to main parts of speech. These collocations are examined in the model. The model allows extracting semantically equivalent collocations from semi-structured and non-structured texts. Implementations of the model will allow to automatically recognize semantically equivalent collocations. Usage of the model allows increasing the effectiveness of natural language processing tasks such as information extraction, ontology generation, sentiment analysis and some others

Biblioteka Nauki - repozytorium artykuÅÃ³w

Electronic National Technical University "Kharkiv Polytechnic Institute" Institutional Repository (eNTUKhPIIR)

From corpus-based collocation frequencies to readability measure

Author: Anagnostou N.K.
Weir G.R.S.
Publication venue
Publication date: 01/06/2006
Field of study

This paper provides a broad overview of three separate but related areas of research. Firstly, corpus linguistics is a growing discipline that applies analytical results from large language corpora to a wide variety of problems in linguistics and related disciplines. Secondly, readability research, as the name suggests, seeks to understand what makes texts more or less comprehensible to readers, and aims to apply this understanding to issues such as text rating and matching of texts to readers. Thirdly, collocation is a language feature that occurs when particular words are used frequently together for other than purely grammatical reasons. The intersection of these three aspects provides the basis for on-going research within the Department of Computer and Information Sciences at the University of Strathclyde and is the motivation for this overview. Specifically, we aim through analysis of collocation frequencies in major corpora, to afford valuable insight on the content of texts, which we believe will, in turn, provide a novel basis for estimating text readability

University of Strathclyde Institutional Repository

Method for Automatic Collocation Extraction from Ukrainian Corpora

Author: Kuzmina M.
Petrasova S. V.
Publication venue
Publication date: 01/01/2018
Field of study

The article deals with the methods for automatic collocation extraction from Ukrainian corpora. The task of collocation extraction is considered in terms of a corpus-oriented approach [1], based on statistical measures. The term «collocation» is defined as a non-random combination of two words that go together regularly

Electronic National Technical University "Kharkiv Polytechnic Institute" Institutional Repository (eNTUKhPIIR)

Refining the use of the web (and web search) as a language teaching and learning resource

Author: Franken Margaret
Witten Ian H.
Wu Shaoqun
Publication venue: 'Informa UK Limited'
Publication date: 01/01/2009
Field of study

The web is a potentially useful corpus for language study because it provides examples of language that are contextualized and authentic, and is large and easily searchable. However, web contents are heterogeneous in the extreme, uncontrolled and hence 'dirty,' and exhibit features different from the written and spoken texts in other linguistic corpora. This article explores the use of the web and web search as a resource for language teaching and learning. We describe how a particular derived corpus containing a trillion word tokens in the form of n-grams has been filtered by word lists and syntactic constraints and used to create three digital library collections, linked with other corpora and the live web, that exploit the affordances of web text and mitigate some of its constraints

Research Commons@Waikato