233 research outputs found
Modeling Global Syntactic Variation in English Using Dialect Classification
This paper evaluates global-scale dialect identification for 14 national
varieties of English as a means for studying syntactic variation. The paper
makes three main contributions: (i) introducing data-driven language mapping as
a method for selecting the inventory of national varieties to include in the
task; (ii) producing a large and dynamic set of syntactic features using
grammar induction rather than focusing on a few hand-selected features such as
function words; and (iii) comparing models across both web corpora and social
media corpora in order to measure the robustness of syntactic variation across
registers
Ordinal analysis of lexical patterns
Words are fundamental linguistic units that connect thoughts and things
through meaning. However, words do not appear independently in a text sequence.
The existence of syntactic rules induces correlations among neighboring words.
Using an ordinal pattern approach, we present an analysis of lexical
statistical connections for 11 major languages. We find that the diverse
manners that languages utilize to express word relations give rise to unique
pattern structural distributions. Furthermore, fluctuations of these pattern
distributions for a given language can allow us to determine both the
historical period when the text was written and its author. Taken together, our
results emphasize the relevance of ordinal time series analysis in linguistic
typology, historical linguistics and stylometry.Comment: 9 pages, 12 figures, 2 tables; v2: the section on universality has
been removed because previous results were affected by spurious correlations.
Published versio
Towards the Automatic Processing of Language Registers: Semi-supervisedly Built Corpus and Classifier for French
International audienceLanguage registers are a strongly perceptible characteristic of texts and speeches. However, they are still poorly studied in natural language processing. In this paper, we present a semi-supervised approach which jointly builds a corpus of texts labeled in registers and an associated classifier. This approach relies on a small initial seed of expert data. After massively retrieving web pages, it iteratively alternates the training of an intermediate classifier and the annotation of new texts to augment the labeled corpus. The approach is applied to the casual, neutral, and formal registers, leading to a 750M word corpus and a final neural classifier with an acceptable performance
Authorship Attribution: Specifics for Slovene
The paper shows the importance of a quality analysis of linguistic features which enable the process of authorship attribution or author profiling in a forensic, literary or economic context (anonymous threat letters, plagiarism, literary works of unknown authorship, client profiling). It also highlights the lack of realized analyses for Slovene and outlines the methodology of detecting the syntactic, lexical, semantic and character features in order to quantify the author’s personal styl
Mining User-Generated Repair Instructions from Automotive Web Communities
The objective of this research was to automatically extract user-generated repair instructions from large amounts of web data. An artifact has been created that classifies a web post as containing a repair instruction or not. Methods from Natural Language Processing are used to transform the unstructured textual information from a web post into a set of numerical features that can be further processed by different Machine Learning Algorithms. The main contribution of this research lies in the design and prototypical implementation of these features. The evaluation shows that the created artifact can accurately distinguish posts containing repair instructions from other posts e.g. containing problem reports. With such a solution, a company can save a lot of time and money that was previously necessary to perform this classification task manually
Stylistic Fingerprints, POS-tags and Inflected Languages: A Case Study in Polish
In stylometric investigations, frequencies of the most frequent words (MFWs)
and character n-grams outperform other style-markers, even if their performance
varies significantly across languages. In inflected languages, word endings
play a prominent role, and hence different word forms cannot be recognized
using generic text tokenization. Countless inflected word forms make
frequencies sparse, making most statistical procedures complicated. Presumably,
applying one of the NLP techniques, such as lemmatization and/or parsing, might
increase the performance of classification. The aim of this paper is to examine
the usefulness of grammatical features (as assessed via POS-tag n-grams) and
lemmatized forms in recognizing authorial profiles, in order to address the
underlying issue of the degree of freedom of choice within lexis and grammar.
Using a corpus of Polish novels, we performed a series of supervised authorship
attribution benchmarks, in order to compare the classification accuracy for
different types of lexical and syntactic style-markers. Even if the performance
of POS-tags as well as lemmatized forms was notoriously worse than that of
lexical markers, the difference was not substantial and never exceeded ca. 15%
- …