Search CORE

580 research outputs found

Arabic nested noun compound extraction based on linguistic features and statistical measures

Author: Nazlia Omar
Qasem Al-Tashi
Publication venue: 'Penerbit Universiti Kebangsaan Malaysia (UKM Press)'
Publication date: 01/05/2018
Field of study

The extraction of Arabic nested noun compound is significant for several research areas such as sentiment analysis, text summarization, word categorization, grammar checker, and machine translation. Much research has studied the extraction of Arabic noun compound using linguistic approaches, statistical methods, or a hybrid of both. A wide range of the existing approaches concentrate on the extraction of the bi-gram or tri-gram noun compound. Nonetheless, extracting a 4-gram or 5-gram nested noun compound is a challenging task due to the morphological, orthographic, syntactic and semantic variations. Many features have an important effect on the efficiency of extracting a noun compound such as unit-hood, contextual information, and term-hood. Hence, there is a need to improve the effectiveness of the Arabic nested noun compound extraction. Thus, this paper proposes a hybrid linguistic approach and a statistical method with a view to enhance the extraction of the Arabic nested noun compound. A number of pre-processing phases are presented, including transformation, tokenization, and normalisation. The linguistic approaches that have been used in this study consist of a part-of-speech tagging and the named entities pattern, whereas the proposed statistical methods that have been used in this study consist of the NC-value, NTC-value, NLC-value, and the combination of these association measures. The proposed methods have demonstrated that the combined association measures have outperformed the NLC-value, NTC-value, and NC-value in terms of nested noun compound extraction by achieving 90%, 88%, 87%, and 81% for bigram, trigram, 4-gram, and 5-gram, respectively

Crossref

UKM Journal Article Repository

Refining the Methodology for Investigating the Relationship Between Fluency and the Use of Formulaic Language in Learner Speech

Author: Ashby
Ashby
Bestgen
Bestgen
Biber
Biber
Biber
Biber
Boersma
Boersma
Bosker
Bosker
Butler
Butler
Bybee
Bybee
Bybee
Bybee
Chambers
Chambers
Cobb
Cobb
Conrad
Conrad
Conrad
Conrad
Conrad
Conrad
Corrigan
Corrigan
Corrigan
Corrigan
Cortes
Cortes
Cowie
Cowie
Cowie
Cowie
Cucchiarini
Cucchiarini
Dahlmann
Dahlmann
Dahlmann
Dahlmann
Dahlmann
Dahlmann
De Jong
De Jong
De Jong
De Jong
Derwing
Derwing
Derwing
Derwing
Derwing
Derwing
Donnell
Donnell
Durrant
Durrant
Ellis
Ellis
Erman
Erman
Erman
Erman
Erman
Erman
Ewa Guz
Fillmore
Fillmore
Foster
Foster
Freed
Freed
Fung
Fung
Gambi
Gambi
Gilquin
Gilquin
Goldinger
Goldinger
Goldman
Goldman
Granger
Granger
Granger
Granger
Granger
Granger
Grösjean
Grösjean
Housen
Housen
Howarth
Howarth
Hunston
Hunston
Hyland
Hyland
Irujo
Irujo
Jaglińska
Jaglińska
John
John
John
John
Kaszubski
Kaszubski
Kormos
Kormos
Lennon
Lennon
Lennon
Lennon
Lin
Lin
Lin
Lin
Lorenz
Lorenz
Moon
Moon
Müller
Müller
Nattinger
Nattinger
Nesselhauf
Nesselhauf
Pawley
Pawley
Peters
Peters
Prodromou
Prodromou
Rossiter
Rossiter
Schmidt
Schmidt
Schmitt
Schmitt
Schmitt
Schmitt
Schmitt
Schmitt
Segalowitz
Segalowitz
Simpson
Simpson
Skehan
Skehan
Skehan
Skehan
Towell
Towell
Van Lancker
Van Lancker
Van Lancker Sidtis
Van Lancker Sidtis
Weinert
Weinert
Wells
Wells
Witton
Witton
Wood
Wood
Wood
Wood
Wood
Wood
Wood
Wood
Wood
Wood
Wood
Wood
Wood
Wood
Wood
Wood
Wray
Wray
Wray
Wray
Wray
Wray
Wray
Wray
Wray
Wray
Yorio
Yorio
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 01/06/2016
Field of study

This study is a cross-sectional analysis of the relationship between productive fluency and the use of formulaic sequences in the speech of highly proficient L2 learners. Two samples of learner speech were randomly drawn and analysed. Formulaic sequences were identified on the basis of two distinct procedures: a frequency-based, distributional approach which returned a set of recurrent sequences (n-grams) and an intuition and criterion-based, linguistic procedure which returned a set of phrasemes. Formulaic material was then removed from the data. Breakdown and speed fluency measures were obtained for the following types of speech: baseline (pre-removal), formulaic, non-formulaic (post-removal). The results show significant differences between baseline and post-removal fluency scores for both learners. Also, formulaic speech is produced more fluently than non-formulaic speech. However, the comparison of the fluency scores of n-grams and phrasemes returned inconsistent results with significant differences reported only for one of the samples

Crossref

Biblioteka Nauki - repozytorium artykuÅÃ³w

Repozytorium Uniwersytetu Łódzkiego (University of Lodz Repository)

Knowledge Sharing from Domain-specific Documents

Author: Abe Yasunori
Isahara Hitoshi
Terada Akira
Yamamoto Eiko
Publication venue: AIS Electronic Library (AISeL)
Publication date: 01/05/2008
Field of study

Recently, collaborative discussions based on the participant generated documents, e.g., customer questionnaires, aviation reports and medical records, are required in various fields such as marketing, transport facilities and medical treatment, in order to share useful knowledge which is crucial to maintain various kind of securities, e.g., avoiding air-traffic accidents and malpractice. We introduce several techniques in natural language processing for extracting information from such text data and verify the validity of such techniques by using aviation documents as an example. We automatically and statistically extract from the documents related words that have not only taxonomical relations like synonyms but also thematic (non-taxonomical) relations including causal and entailment relations. These related words are useful for sharing information among participants. Moreover, we acquire domain-specific terms and phrases from the documents in order to pick up and share important topics from such reports

AIS Electronic Library (AISeL)

The underpinnings of a composite measure for automatic term extraction: The case of SRC

Author: Periñán Pascual José Carlos
Publication venue: 'John Benjamins Publishing Company'
Publication date: 01/01/2015
Field of study

The corpus-based identification of those lexical units which serve to describe a given specialized domain usually becomes a complex task, where an analysis oriented to the frequency of words and the likelihood of lexical associations is often ineffective. The goal of this article is to demonstrate that a user-adjustable composite metric such as SRC can accommodate to the diversity of domain-specific glossaries to be constructed from small-and medium-sized specialized corpora of non-structured texts. Unlike for most of the research in automatic term extraction, where single metrics are usually combined indiscriminately to produce the best results, SRC is grounded on the theoretical principles of salience, relevance and cohesion, which have been rationally implemented in the three components of this metric.Financial support for this research has been provided by the DGI, Spanish Ministry of Education and Science, grants FFI2011-29798-C02-01 and FFI2014-53788-C3-1-P.Periñán Pascual, JC. (2015). The underpinnings of a composite measure for automatic term extraction: The case of SRC. Terminology. 21(2):151-179. doi:10.1075/term.21.2.02perS15117921

Crossref

RiuNet

TermEval 2020 : shared task on automatic term extraction using the Annotated Corpora for term Extraction Research (ACTER) dataset

Author: Drouin Patrick
Hoste Veronique
Lefever Els
Rigouts Terryn Ayla
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2020
Field of study

The TermEval 2020 shared task provided a platform for researchers to work on automatic term extraction (ATE) with the same dataset: the Annotated Corpora for Term Extraction Research (ACTER). The dataset covers three languages (English, French, and Dutch) and four domains, of which the domain of heart failure was kept as a held-out test set on which final f1-scores were calculated. The aim was to provide a large, transparent, qualitatively annotated, and diverse dataset to the ATE research community, with the goal of promoting comparative research and thus identifying strengths and weaknesses of various state-of-the-art methodologies. The results show a lot of variation between different systems and illustrate how some methodologies reach higher precision or recall, how different systems extract different types of terms, how some are exceptionally good at finding rare terms, or are less impacted by term length. The current contribution offers an overview of the shared task with a comparative evaluation, which complements the individual papers by all participants

Ghent University Academic Bibliography

Automatic Extraction Of Malay Compound Nouns Using A Hybrid Of Statistical And Machine Learning Methods

Author: A. S. Hazaa Muneer
Albared Mohammed
Ba-Alwi Fadl Mutaher
Omar Nazlia
Publication venue: 'Institute of Advanced Engineering and Science'
Publication date: 01/06/2016
Field of study

Identifying of compound nouns is important for a wide spectrum of applications in the field of natural language processing such as machine translation and information retrieval. Extraction of compound nouns requires deep or shallow syntactic preprocessing tools and large corpora. This paper investigates several methods for extracting Noun compounds from Malay text corpora. First, we present the empirical results of sixteen statistical association measures of Malay <N+N> compound nouns extraction. Second, we introduce the possibility of integrating multiple association measures. Third, this work also provides a standard dataset intended to provide a common platform for evaluating research on the identification compound Nouns in Malay language. The standard data set contains 7,235 unique N-N candidates, 2,970 of them are N-N compound nouns collocations. The extraction algorithms are evaluated against this reference data set. The experimental results demonstrate that a group of association measures (T-test , Piatersky-Shapiro (PS) , C_value, FGM and rank combination method) are the best association measure and outperforms the other association measures for <N+N> collocations in the Malay corpus. Finally, we describe several classification methods for combining association measures scores of the basic measures, followed by their evaluation. Evaluation results show that classification algorithms significantly outperform individual association measures. Experimental results obtained are quite satisfactory in terms of the Precision, Recall and F-score

IAES journal

Crossref

Institute of Advanced Engineering and Science

Terminology extraction: an analysis of linguistic and statistical approaches

Author: Pazienza Mt
Pennacchiotti M
Zanzotto Fm
Publication venue: Springer
Publication date: 01/01/2005
Field of study

Are linguistic properties and behaviors important to recognize terms? Are statistical measures effective to extract terms? Is it possible to capture a sort of termhood with computation linguistic techniques? Or maybe, terms are too much sensitive to exogenous and pragmatic factors that cannot be confined in computational linguistic? All these questions are still open. This study tries to contribute in the search of an answer, with the belief that it can be found only through a careful experimental analysis of real case studies and a study of their correlation with theoretical insights

ART

Modelling collocations in OntoLex-FrAC

Author: Chiarcos Christian
Gkirtzou Katerina
Ionov Maxim
Kabashi Besim
Kha Fahad
Truică Ciprian-Octavian
Publication venue
Publication date: 20/04/2023
Field of study

Following presentations of frequency and attestations, and embeddings and distributional similarity, this paper introduces the third cornerstone of the emerging OntoLex module for Frequency, Attestation and Corpus-based Information, OntoLex-FrAC. We provide an RDF vocabulary for collocations, established as a consensus over contributions from five different institutions and numerous data sets, with the goal of eliciting feedback from reviewers, workshop audience and the scientific community in preparation of the final consolidation of the OntoLex-FrAC module, whose publication as a W3C community report is foreseen for the end of this year. The novel collocation component of OntoLex-FrAC is described in application to a lexicographic resource and corpus-based collocation scores available from the web, and finally, we demonstrate the capability and genericity of the model by showing how to retrieve and aggregate collocation information by means of SPARQL, and its export to a tabular format, so that it can be easily processed in downstream applications

OPUS Augsburg