Search CORE

233 research outputs found

Bigrams of Syntactic Labels for Authorship Discrimination of Short Texts

Author: G. Hirst
O. Feiguina
Publication venue: 'Oxford University Press (OUP)'
Publication date
Field of study

Modeling Global Syntactic Variation in English Using Dialect Classification

Author: Dunn Jonathan
Publication venue
Publication date: 11/04/2019
Field of study

This paper evaluates global-scale dialect identification for 14 national varieties of English as a means for studying syntactic variation. The paper makes three main contributions: (i) introducing data-driven language mapping as a method for selecting the inventory of national varieties to include in the task; (ii) producing a large and dynamic set of syntactic features using grammar induction rather than focusing on a few hand-selected features such as function words; and (iii) comparing models across both web corpora and social media corpora in order to measure the robustness of syntactic variation across registers

arXiv.org e-Print Archive

UC Research Repository

Ordinal analysis of lexical patterns

Author: De Gregorio Juan
Mirasso Claudio
Sanchez David
Toral Raul
Zunino Luciano
Publication venue: 'AIP Publishing'
Publication date: 14/03/2023
Field of study

Words are fundamental linguistic units that connect thoughts and things through meaning. However, words do not appear independently in a text sequence. The existence of syntactic rules induces correlations among neighboring words. Using an ordinal pattern approach, we present an analysis of lexical statistical connections for 11 major languages. We find that the diverse manners that languages utilize to express word relations give rise to unique pattern structural distributions. Furthermore, fluctuations of these pattern distributions for a given language can allow us to determine both the historical period when the text was written and its author. Taken together, our results emphasize the relevance of ordinal time series analysis in linguistic typology, historical linguistics and stylometry.Comment: 9 pages, 12 figures, 2 tables; v2: the section on universality has been removed because previous results were affected by spurious correlations. Published versio

arXiv.org e-Print Archive

Towards the Automatic Processing of Language Registers: Semi-supervisedly Built Corpus and Classifier for French

Author: Ayats Hugo
Battistelli Delphine
Béchet Nicolas
Chevelu Jonathan
Fournier Benoît
Lecorvé Gwénolé
Mekki Jade
Publication venue: HAL CCSD
Publication date: 07/04/2019
Field of study

International audienceLanguage registers are a strongly perceptible characteristic of texts and speeches. However, they are still poorly studied in natural language processing. In this paper, we present a semi-supervised approach which jointly builds a corpus of texts labeled in registers and an associated classifier. This approach relies on a small initial seed of expert data. After massively retrieving web pages, it iteratively alternates the training of an intermediate classifier and the annotation of new texts to augment the labeled corpus. The approach is applied to the casual, neutral, and formal registers, leading to a 750M word corpus and a final neural classifier with an acceptable performance

INRIA a CCSD electronic archive server

HAL Descartes

HAL-Rennes 1

Authorship Attribution: Specifics for Slovene

Author: Zwitter Vitez Ana
Publication venue: 'The University of Kansas'
Publication date: 27/06/2012
Field of study

The paper shows the importance of a quality analysis of linguistic features which enable the process of authorship attribution or author profiling in a forensic, literary or economic context (anonymous threat letters, plagiarism, literary works of unknown authorship, client profiling). It also highlights the lack of realized analyses for Slovene and outlines the methodology of detecting the syntactic, lexical, semantic and character features in order to quantify the author’s personal styl

KU ScholarWorks

Mining User-Generated Repair Instructions from Automotive Web Communities

Author: Fromm Hansjörg
Wambsganss Thiemo
Publication venue: AIS Electronic Library (AISeL)
Publication date: 01/01/2019
Field of study

The objective of this research was to automatically extract user-generated repair instructions from large amounts of web data. An artifact has been created that classifies a web post as containing a repair instruction or not. Methods from Natural Language Processing are used to transform the unstructured textual information from a web post into a set of numerical features that can be further processed by different Machine Learning Algorithms. The main contribution of this research lies in the design and prototypical implementation of these features. The evaluation shows that the created artifact can accurately distinguish posts containing repair instructions from other posts e.g. containing problem reports. With such a solution, a company can save a lot of time and money that was previously necessary to perform this classification task manually

Crossref

ScholarSpace at University of Hawai'i at Manoa

AIS Electronic Library (AISeL)

Stylistic Fingerprints, POS-tags and Inflected Languages: A Case Study in Polish

Author: Eder Maciej
Górski Rafał. L.
Publication venue: 'Informa UK Limited'
Publication date: 05/06/2022
Field of study

In stylometric investigations, frequencies of the most frequent words (MFWs) and character n-grams outperform other style-markers, even if their performance varies significantly across languages. In inflected languages, word endings play a prominent role, and hence different word forms cannot be recognized using generic text tokenization. Countless inflected word forms make frequencies sparse, making most statistical procedures complicated. Presumably, applying one of the NLP techniques, such as lemmatization and/or parsing, might increase the performance of classification. The aim of this paper is to examine the usefulness of grammatical features (as assessed via POS-tag n-grams) and lemmatized forms in recognizing authorial profiles, in order to address the underlying issue of the degree of freedom of choice within lexis and grammar. Using a corpus of Polish novels, we performed a series of supervised authorship attribution benchmarks, in order to compare the classification accuracy for different types of lexical and syntactic style-markers. Even if the performance of POS-tags as well as lemmatized forms was notoriously worse than that of lexical markers, the difference was not substantial and never exceeded ca. 15%

arXiv.org e-Print Archive