213 research outputs found

    Wskaźniki stylu w atrybucji autorskiej : studium porównawcze autorskiego "odcisku palca" w kilku językach

    Get PDF
    Niniejszy artykuł poświęcony jest jednemu z teoretycznych problemów atrybucji autorskiej opartej o metody ilościowe, mianowicie kwestii, które kategorie językowe zdradzają indywidualny rys autorski (stylistyczny "odcisk palca") w tekstach literackich. Liczne prace powstające w ostatnich latach dowodzą, że oprócz miar leksykalnych - szczególnie tych, które oparte są na częstości wystąpień najczęstszych wyrazów - także inne cechy języka pisanego okazują się stosunkowo silnymi czynnikami różnicującymi styl autorski. Do tej pory nie pojawiły się jednak prace, które próbowałyby porównać atrybucyjne możliwości tych cech językowych z sobą. Celem niniejszego studium było zatem przetestowanie siły dyskryminacyjnej kilku wskaźników stylu w rozpoznawaniu autorów. Do empirycznej analizy wybrano te wskaźniki, które można wyłonić z nielematyzowanego korpusu, tj. ze zwykłych plików tekstowych, takie jak najczęstsze wyrazy, zestawienia dwóch słów, różne połączenia literowe, wreszcie wskaźniki niejednorodne, połączone w jednej próbce. Równie ważne było jednak porównanie przydatności owych wybranych wskaźników stylu w kilku językach: angielskim, polskim, niemieckim i łacińskim. Wyniki potwierdziły wysoką wartość wskaźników leksykalnych w języku angielskim, podczas gdy w innych językach na ogół dokładniejsze okazywały się wskaźniki alternatywne.The present study addresses one of the theoretical problems of computer-assisted authorship attribution, namely the question which traceable features of language can betray authorial uniqueness (a stylistic fi ngerprint) of literary texts. A number of recent approaches show that apart from lexical measures - especially those relying on the frequencies of the most frequent words - also some other features of written language are considerably effective as discriminators of authorial style. However, there have been no attempts to compare the attribution potential of these features. The aim of the present study, then, was to examine the effectiveness of several style-markers in authorship attribution. The style-markers chosen for the empirical investigation are those that can be retrieved from a non-lemmatized corpus of plain text files, such as the most frequent words, word bi-grams, different letter sequences, and markers of different nature, combined in one sample. Equally important, however, was to compare usefulness of the chosen style-markers across a few languages: English, Polish, German, and Latin. The results confirmed a high attribution effectiveness of word-based style-markers in the English corpus, but the alternative markers are shown to be usually more effective in the other languages

    Stylometry in a bilingual setup

    Get PDF
    The method of stylometry by most frequent words does not allow direct comparison of original texts and their translations, i.e. across languages. For instance, in a bilingual Czech-German text collection containing parallel texts (originals and translations in both directions, along with Czech and German translations from other languages), authors would not cluster across languages, since frequency word lists for any Czech texts are obviously going to be more similar to each other than to a German text, and the other way round. We have tried to come up with an interlingua that would remove the language-specific features and possibly keep the linguistically independent features of individual author signal, if they exist. We have tagged, lemmatized, and parsed each language counterpart with the corresponding language model in UDPipe, which provides a linguistic markup that is cross-lingual to a significant extent. We stripped the output of language-dependent items, but that alone did not help much. As a next step, we transformed the lemmas of both language counterparts into shared pseudolemmas based on a very crude Czech-German glossary, with a 95.6% success. We show that, for stylometric methods based on the most frequent words, we can do without translations

    The stylometry of film dialogue : pros and pitfalls

    Get PDF
    We examine film dialogue with quantitative textual analysis (stylometry, sentiment analysis, distant reading). Working with transcribed dialogue in almost 300 productions, we explore the complex way in which most-frequent-words-based stylometry and lexicon-based sentiment analysis produce patterns of similarity and difference between screenwriters and/or a priori IMDB-defined genres. In fact, some of our results show that counting and comparing very frequent word lists reveals further similarities: of theme, implied audience, stylistic patternings. The results are encouraging enough to suggest that such quantitative approach to film dialogue may become a welcome addition to the arsenal of film studies methodology

    O uniwersaliach tłumaczeniowych w wybranych współczesnych polskich tłumaczeniach literackich

    Get PDF
    Niniejsze badanie o charakterze pilotażowym dotyczy wykorzystania wybranych metod badawczych językoznawstwa korpusowego i stylistyki komputerowej w analizie uniwersaliów tłumaczeniowych na materiale wybranych współczesnych polskich tłumaczeń literackich. Mówiąc ściślej, badanie dotyczy wybranych uniwersaliów typu T (za Chestermanem 2004), które nazywam uniwersaliami tłumaczeniowymi wewnątrz-językowymi (Grabowski 2011), takich jak kluczowe wzorce leksykalne (corepatterns of lexicaluse; Laviosa 2002) oraz hipoteza dotycząca konwergencji (levelling-out; Baker 1996). W celu przeprowadzenia niniejszego badania opracowano dwa specjalne korpusy badawcze (z 500 000 wyrazów tekstowych w każdym) obejmujące wybrane współczesne polskie powieści oraz wybrane współczesne tłumaczenia literackie z języka angielskiego na język polski. Wyniki badania wykazały, że jako całość teksty tłumaczone są bardziej zróżnicowane leksykalnie od tekstów nietłumaczonych, ale też cechują się większą liczbą powtórzeń i mniejszym zróżnicowaniem leksykalnym jeśli idzie o wyrazy o wysokiej frekwencji w tekście. Z drugiej strony badanie wykazało, że teksty nietłumaczone cechują się większym bogactwem leksykalnym w zakresie wyrazów o niskiej frekwencji w tekście, gdzie z reguły można znaleźć słownictwo kreatywne i odautorskie. Metody wielowymiarowe (analiza głównych składowych, analiza skupień) potwierdziła hipotezę dotyczącą konwergencji, zgodnie z którą można zaobserwować większe podobieństwo między tekstami tłumaczonymi niż między tekstami tłumaczonymi a oryginałami napisanymi w tym samym języku.This pilot study attempts to examine the potential of selected corpus linguistics and computational stylistics methods in the investigation of translation universals in translational literary Polish. More specifically, the study deals with T-universals (after Chesterman 2004), which are also referred to as intralingual translation universals (Grabowski 2011), with emphasis on core patterns of lexical use, as proposed by Laviosa (1998, 2002), and the leveling-out hypothesis, as proposed by Baker (1996). To that end, the custom-designed corpora,with approximately 500,000 tokens each, of contemporary translational and non-translational literary Polish were compiled. The results of the study reveal that on the whole translated texts are more varied lexically and have more repetitions and lower lexical variety among top-frequency words than non-translated Polish texts. On the other hand, the study shows that non-translational texts have higher lexical variety among bottom-frequency words, where usually one can find author-specific and creative vocabulary. The results of multivariate methods (Principal Components Analysis and Cluster Analysis) confirm the leveling-out hypothesis that translations are more alike as compared with native texts

    Success rates in most-frequent-word-based authorship attribution : a case study of 1000 Polish novels from Ignacy Krasicki to Jerzy Pilch

    Get PDF
    W artykule zbadano skuteczność atrybucji autorskiej opartej na wielowymiarowej analizie najczęstszych słów w korpusie 1000 powieści polskich napisanych między końcem XVIII i początkiem XXI wieku. Oceniono wpływ liczby autorów i/lub tekstów na uzyskane wyniki. Porównano skuteczność atrybucji w niniejszej pracy z wynikami uzyskanymi we wcześniejszych opracowaniach wykorzystujących mniejsze korpusy – a więc te, które mogły nie wykazywać regularnych prawidłowości pod tym względem. Wykazano, że w dużych kolekcjach tekstów sprawdzają się intuicyjne przypuszczenia: 1) im więcej autorów, tym trudniej o skuteczną atrybucję; 2) przy tej samej liczbie autorów liczba tekstów nie ma wpływu na skuteczność atrybucji.The success rate of authorship attribution by multivariate analysis of most-frequent-word frequencies is studied in a 1000-novel corpus of Polish literary works from the late 18th to the early 21st century. The results are examined for possible influences of the number of authors and/or the number of texts to be attributed. Also, the success rates achieved in this study are compared to those obtained in earlier studies for smaller corpora, too small perhaps to produce regular patterns. This study shows that text sets of this size confirm the intuitive predictions as to those influences: 1) the more authors, the less successful attribution; 2) for the same number of authors, the number of texts to be attributed does not influence success rate

    Stylistic Fingerprints, POS-tags and Inflected Languages: A Case Study in Polish

    Full text link
    In stylometric investigations, frequencies of the most frequent words (MFWs) and character n-grams outperform other style-markers, even if their performance varies significantly across languages. In inflected languages, word endings play a prominent role, and hence different word forms cannot be recognized using generic text tokenization. Countless inflected word forms make frequencies sparse, making most statistical procedures complicated. Presumably, applying one of the NLP techniques, such as lemmatization and/or parsing, might increase the performance of classification. The aim of this paper is to examine the usefulness of grammatical features (as assessed via POS-tag n-grams) and lemmatized forms in recognizing authorial profiles, in order to address the underlying issue of the degree of freedom of choice within lexis and grammar. Using a corpus of Polish novels, we performed a series of supervised authorship attribution benchmarks, in order to compare the classification accuracy for different types of lexical and syntactic style-markers. Even if the performance of POS-tags as well as lemmatized forms was notoriously worse than that of lexical markers, the difference was not substantial and never exceeded ca. 15%

    Drawing Elena Ferrante's Profile. Workshop Proceedings, Padova, 7 September 2017

    Get PDF
    Elena Ferrante is an internationally acclaimed Italian novelist whose real identity has been kept secret by E/O publishing house for more than 25 years. Owing to her popularity, major Italian and foreign newspapers have long tried to discover her real identity. However, only a few attempts have been made to foster a scientific debate on her work. In 2016, Arjuna Tuzzi and Michele Cortelazzo led an Italian research team that conducted a preliminary study and collected a well-founded, large corpus of Italian novels comprising 150 works published in the last 30 years by 40 different authors. Moreover, they shared their data with a select group of international experts on authorship attribution, profiling, and analysis of textual data: Maciej Eder and Jan Rybicki (Poland), Patrick Juola (United States), Vittorio Loreto and his research team, Margherita Lalli and Francesca Tria (Italy), George Mikros (Greece), Pierre Ratinaud (France), and Jacques Savoy (Switzerland). The chapters of this volume report the results of this endeavour that were first presented during the international workshop Drawing Elena Ferrante's Profile in Padua on 7 September 2017 as part of the 3rd IQLA-GIAT Summer School in Quantitative Analysis of Textual Data. The fascinating research findings suggest that Elena Ferrante\u2019s work definitely deserves \u201cmany hands\u201d as well as an extensive effort to understand her distinct writing style and the reasons for her worldwide success

    Quantifying origin and character of long-range correlations in narrative texts

    Full text link
    In natural language using short sentences is considered efficient for communication. However, a text composed exclusively of such sentences looks technical and reads boring. A text composed of long ones, on the other hand, demands significantly more effort for comprehension. Studying characteristics of the sentence length variability (SLV) in a large corpus of world-famous literary texts shows that an appealing and aesthetic optimum appears somewhere in between and involves selfsimilar, cascade-like alternation of various lengths sentences. A related quantitative observation is that the power spectra S(f) of thus characterized SLV universally develop a convincing `1/f^beta' scaling with the average exponent beta =~ 1/2, close to what has been identified before in musical compositions or in the brain waves. An overwhelming majority of the studied texts simply obeys such fractal attributes but especially spectacular in this respect are hypertext-like, "stream of consciousness" novels. In addition, they appear to develop structures characteristic of irreducibly interwoven sets of fractals called multifractals. Scaling of S(f) in the present context implies existence of the long-range correlations in texts and appearance of multifractality indicates that they carry even a nonlinear component. A distinct role of the full stops in inducing the long-range correlations in texts is evidenced by the fact that the above quantitative characteristics on the long-range correlations manifest themselves in variation of the full stops recurrence times along texts, thus in SLV, but to a much lesser degree in the recurrence times of the most frequent words. In this latter case the nonlinear correlations, thus multifractality, disappear even completely for all the texts considered. Treated as one extra word, the full stops at the same time appear to obey the Zipfian rank-frequency distribution, however.Comment: 28 pages, 8 figures, accepted for publication in Information Science

    Challenging stylometry: The authorship of the baroque play La Segunda Celestina

    Get PDF
    The aim of this study was to verify the possibility of Sor Juana Ine´ s de la Cruz authoring the anonymous part of the baroque play La Segunda Celestina, commissioned to Agustın de Salazar, and left unfinished after his death. This is a first systematic stylometric study on this problem and a baroque hispanoamerican text. In our study, we faced building a balanced corpus from few available resources, and took extensive evaluation measures to deal with unclear stylometric signals. We use a variety of established attribution and verification methods, and introduce a novel evaluation procedure of examining historic texts with scarce corpora. The results support Sor Juana’s authorship, and unravel new connections between her and other authors of the time, showing, still undermined, powerful impact of her works on the epoch. The solutions adopted in solving methodological problems of such a complex task show how stylometry can overcome similar challenges

    The translator’s wife’s traces : Alma Cardell Curtin and Jeremiah Curtin

    Get PDF
    Jeremiah Curtin translated most works by Poland’s first literary Nobel Prize winner, Henryk Sienkiewicz. He was helped in this life-long task by his wife Alma Cardell Curtin. It was Alma who, after her husband’s death, produced the lengthy Memoirs she steadfastly ascribed to her husband for his, rather than hers, greater glory. This paper investigates the possible textual influences Alma might have had on other works by her husband, including his travelogues, ethnographic and mythological studies, and the translations themselves. Lacking traditional authorial evidence, this study relies on stylometric methods comparing most frequent word usage by means of cluster analysis of z-scores. There is much in this statistics-based authorial attribution to show how Alma Cardell Curtin affected at least two other original works of her husband and, possibly, at least two of his translations as well.
    corecore