Search CORE

1,354 research outputs found

Statistical keyword detection in literary corpora

Author: Cancho
Cassandro
Cohen
Ebeling
Ebeling
Grosse
J. P. Herrera
Luhn
Mantegna
Montemurro
Ortuño
P. A. Pury
Stanley
Yang
Zhou
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 30/05/2008
Field of study

Understanding the complexity of human language requires an appropriate analysis of the statistical distribution of words in texts. We consider the information retrieval problem of detecting and ranking the relevant words of a text by means of statistical information referring to the "spatial" use of the words. Shannon's entropy of information is used as a tool for automatic keyword extraction. By using The Origin of Species by Charles Darwin as a representative text sample, we show the performance of our detector and compare it with another proposals in the literature. The random shuffled text receives special attention as a tool for calibrating the ranking indices.Comment: Published version. 11 pages, 7 figures. SVJour for LaTeX2

arXiv.org e-Print Archive

Crossref

EDP Sciences OAI-PMH repository (1.2.0)

Recommended from our members

Identifying idiolect in forensic authorship attribution: an n-gram textbite approach

Author: Johnson A
Wright D
Publication venue: Faculdade de Letras da Universidade do Porto
Publication date: 01/01/2014
Field of study

Forensic authorship attribution is concerned with identifying authors of disputed or anonymous documents, which are potentially evidential in legal cases, through the analysis of linguistic clues left behind by writers. The forensic linguist “approaches this problem of questioned authorship from the theoretical position that every native speaker has their own distinct and individual version of the language [. . . ], their own idiolect” (Coulthard, 2004: 31). However, given the diXculty in empirically substantiating a theory of idiolect, there is growing concern in the Veld that it remains too abstract to be of practical use (Kredens, 2002; Grant, 2010; Turell, 2010). Stylistic, corpus, and computational approaches to text, however, are able to identify repeated collocational patterns, or n-grams, two to six word chunks of language, similar to the popular notion of soundbites: small segments of no more than a few seconds of speech that journalists are able to recognise as having news value and which characterise the important moments of talk. The soundbite oUers an intriguing parallel for authorship attribution studies, with the following question arising: looking at any set of texts by any author, is it possible to identify ‘n-gram textbites’, small textual segments that characterise that author’s writing, providing DNA-like chunks of identifying material

Nottingham Trent Institutional Repository (IRep)

White Rose Research Online

Towards the quantification of the semantic information encoded in written language

Author: Montemurro Marcelo A.
Zanette Damian
Publication venue: 'World Scientific Pub Co Pte Lt'
Publication date: 27/07/2009
Field of study

Written language is a complex communication signal capable of conveying information encoded in the form of ordered sequences of words. Beyond the local order ruled by grammar, semantic and thematic structures affect long-range patterns in word usage. Here, we show that a direct application of information theory quantifies the relationship between the statistical distribution of words and the semantic content of the text. We show that there is a characteristic scale, roughly around a few thousand words, which establishes the typical size of the most informative segments in written language. Moreover, we find that the words whose contributions to the overall information is larger, are the ones more closely associated with the main subjects and topics of the text. This scenario can be explained by a model of word usage that assumes that words are distributed along the text in domains of a characteristic size where their frequency is higher than elsewhere. Our conclusions are based on the analysis of a large database of written language, diverse in subjects and styles, and thus are likely to be applicable to general language sequences encoding complex information.Comment: 19 pages, 4 figure

arXiv.org e-Print Archive

CONICET Digital

Open Research Online (The Open University)

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

The University of Manchester - Institutional Repository

Neįprastas žvilgsnis į istorinę monografiją. Analizė su dirbtinio intelekto (AI) įrankiais

Author: Jarzyńska Katarzyna
Publication venue: Vilniaus universiteto leidykla (Vilnius University Press)
Publication date: 30/12/2021
Field of study

Goal and theses: The article aims to check the applicability of methods based on processing large sets of information in research in social sciences. Conception/research methods: The dynamic development of new research methods based on the automated processing of large data sets using artificial intelligence (AI) means that they are used in an increasingly wide range of disciplines, going beyond the field of exact and natural sciences. Text mining was combined with available CLARIN web applications and keyword extraction and analysis strategy, a combination of the YAKE! written in Python with the VOSViewer program for the visualisation of bibliometric networks. Results and conclusions: The study showed how automatic keyword extraction creates opportunities in social science research. The use of CLARIN and Google Pinpoint web tools in the analysis significantly facilitates working with a large body of texts and accelerates its analysis. Cognitive value/originality: The study indicates new research methods that can contribute to the development of social sciences. The perspectives for the implementation of the ways of dealing with large data sets are presented in work in research on society, and conclusions regarding the development of digital social sciences are formulated.Straipsnio tikslas ir tezės: straipsniu siekiama patikrinti metodų, grįstų didelių informacijos rinkinių apdorojimu, pritaikomumą socialinių mokslų tyrimuose. Koncepcija/tyrimo metodai: dinamiška naujų tyrimų metodų, grįstų automatizuotu didelių duomenų rinkinių apdorojimu naudojant dirbtinį intelektą (DI), plėtra reiškia, kad jie naudojami vis platesniuose disciplinų laukuose, peržengiant tiksliųjų ir gamtos mokslų sritis. Teksto gavyba buvo derinama su turimomis CLARIN žiniatinklio programomis ir raktinių žodžių ištraukimo bei analizės strategija, YAKE! parašyta Python kalba su VOSViewer programa, skirta bibliometriniams tinklams vizualizuoti

Journalism Research

Analyzing Research Tendencies of ELT Researchers and Trajectory of English Language Teaching and Learning in the last Five Years

Author: Ayan Erdal
Demirel Elif
Publication venue: World Council for Curriculum and Instruction- WCCI
Publication date: 30/11/2017
Field of study

In accordance with the new advances in language teaching methodologies and integration of high technology tools as well as web applications, there are many scientific research published on English language teaching (ELT) and learning (ELL) in recent years. However, on the one hand, it is still a significant question to research that exactly what types of research topics are mostly studied among the researchers from different countries. What are the leading research groups on the world? Even though there are noteworthy studies to clarify mostly studied topics and trajectory of the researches on ELT by means of literature reviews, and there are very few studies to compare research tendencies of the researchers over text/content mining methodology. In fact, the papers reviewing literature are mostly limited in terms of depicting a broad understanding the scope of such studies. On the other hand, a corpus based detection methodology, which may illuminate those research tendencies and trajectory, and come up with very informative descriptive results in the field, is actually missing. In sum, the current research aims at finding out the most frequent research contexts and topics in the last five years through analyzing research papers published in leading academic journals in the field, and compare tendencies of the researchers from different institutions and countries in terms of selecting their research context and topics, and to figure out the trajectory for future studies. In this study, the researchers believe that there are different tendencies among the researchers in terms of their selecting research contexts and topics, which should be revealed for future researches. Researchers use a corpus-based detection methodology in this study, which is composed of storing variable data in .txt files and analyzing variables over the concordancer. Corpus-based detection method defines process of gathering textual data mentioned in the variables and analyzing them by means of a concordancer, named AntConc. The corpus-based data from the variables are analyzed by means of a statistical software, known as JASP in order to clear out potential differences among the researchers. A short analysis of the data indicates that the researchers still focus on the key words such as explicit learning and knowledge, implicit learning and knowledge as well as age and bilingualism. It is also observed that meta-analysis is an important topic in the studies conducted lately. Further results of the study could be beneficial for all followers including researchers and learners inside and outside the field of ELT and help people focus less frequently studied contexts and topics

International Journal of Curriculum and Instruction (World Council for Curriculum and Instruction - WCCI)

Spelling errors and keywords in born-digital data: a case study using the Teenage Health Freak Corpus

Author: Archer D.
Baron A.
Baron A.
Butler C.
Catherine Smith
Conover W.J.
Crystal D.
Hoffman S.
Hofland K.
Jurafsky D.
Kevin Harvey
Louise Mullany
Scott M.
Sprent P.
Svenja Adolphs
Publication venue: 'Edinburgh University Press'
Publication date: 01/11/2014
Field of study

The abundance of language data that is now available in digital form, and the rise of distinct language varieties that are used for digital communication, means that issues of non-standard spellings and spelling errors are, in future, likely to become more prominent for compilers of corpora. This paper examines the effect of spelling variation on keywords in a born-digital corpus in order to explore the extent and impact of this variation for future corpus studies. The corpus used in this study consists of e-mails about health concerns that were sent to a health website by adolescents. Keywords are generated using the original version of the corpus and a version with spelling errors corrected, and the British National Corpus (BNC) acts as the reference corpus. The ranks of the keywords are shown to be very similar and, therefore, suggest that, depending on the research goals, keywords could be generated reliably without any need for spelling correction

Nottingham ePrints

Nottingham eTheses

Crossref

Repository@Nottingham

University of Birmingham Research Portal

DARIAH and the Benelux

Author: Backes Marianne
Chambers Sally
Hoogerwerf Maarten
Van der West Jan
Publication venue: Department of Applied Linguistics, Translators and Interpreters, University of Antwerp
Publication date: 01/01/2015
Field of study

Ghent University Academic Bibliography