1,354 research outputs found
Statistical keyword detection in literary corpora
Understanding the complexity of human language requires an appropriate
analysis of the statistical distribution of words in texts. We consider the
information retrieval problem of detecting and ranking the relevant words of a
text by means of statistical information referring to the "spatial" use of the
words. Shannon's entropy of information is used as a tool for automatic keyword
extraction. By using The Origin of Species by Charles Darwin as a
representative text sample, we show the performance of our detector and compare
it with another proposals in the literature. The random shuffled text receives
special attention as a tool for calibrating the ranking indices.Comment: Published version. 11 pages, 7 figures. SVJour for LaTeX2
Recommended from our members
Identifying idiolect in forensic authorship attribution: an n-gram textbite approach
Forensic authorship attribution is concerned with identifying authors of disputed or anonymous documents, which are potentially evidential in legal cases, through the analysis of linguistic clues left behind by writers. The forensic linguist âapproaches this problem of questioned authorship from the theoretical position that every native speaker has their own distinct and individual version of the language [. . . ], their own idiolectâ (Coulthard, 2004: 31). However, given the diXculty in empirically substantiating a theory of idiolect, there is growing concern in the Veld that it remains too abstract to be of practical use (Kredens, 2002; Grant, 2010; Turell, 2010). Stylistic, corpus, and computational approaches to text, however, are able to identify repeated collocational patterns, or n-grams, two to six word chunks of language, similar to the popular notion of soundbites: small segments of no more than a few seconds of speech that journalists are able to recognise as having news value and which characterise the important moments of talk. The soundbite oUers an intriguing parallel for authorship attribution studies, with the following question arising: looking at any set of texts by any author, is it possible to identify ân-gram textbitesâ, small textual segments that characterise that authorâs writing, providing DNA-like chunks of identifying material
Towards the quantification of the semantic information encoded in written language
Written language is a complex communication signal capable of conveying
information encoded in the form of ordered sequences of words. Beyond the local
order ruled by grammar, semantic and thematic structures affect long-range
patterns in word usage. Here, we show that a direct application of information
theory quantifies the relationship between the statistical distribution of
words and the semantic content of the text. We show that there is a
characteristic scale, roughly around a few thousand words, which establishes
the typical size of the most informative segments in written language.
Moreover, we find that the words whose contributions to the overall information
is larger, are the ones more closely associated with the main subjects and
topics of the text. This scenario can be explained by a model of word usage
that assumes that words are distributed along the text in domains of a
characteristic size where their frequency is higher than elsewhere. Our
conclusions are based on the analysis of a large database of written language,
diverse in subjects and styles, and thus are likely to be applicable to general
language sequences encoding complex information.Comment: 19 pages, 4 figure
NeÄŻprastas ĆŸvilgsnis ÄŻ istorinÄ monografijÄ . AnalizÄ su dirbtinio intelekto (AI) ÄŻrankiais
Goal and theses: The article aims to check the applicability of methods based on processing large sets of information in research in social sciences.
Conception/research methods: The dynamic development of new research methods based on the automated processing of large data sets using artificial intelligence (AI) means that they are used in an increasingly wide range of disciplines, going beyond the field of exact and natural sciences. Text mining was combined with available CLARIN web applications and keyword extraction and analysis strategy, a combination of the YAKE! written in Python with the VOSViewer program for the visualisation of bibliometric networks.
Results and conclusions: The study showed how automatic keyword extraction creates opportunities in social science research. The use of CLARIN and Google Pinpoint web tools in the analysis significantly facilitates working with a large body of texts and accelerates its analysis.
Cognitive value/originality: The study indicates new research methods that can contribute to the development of social sciences. The perspectives for the implementation of the ways of dealing with large data sets are presented in work in research on society, and conclusions regarding the development of digital social sciences are formulated.Straipsnio tikslas ir tezÄs: straipsniu siekiama patikrinti metodĆł, grÄŻstĆł dideliĆł informacijos rinkiniĆł apdorojimu, pritaikomumÄ
socialiniĆł mokslĆł tyrimuose.
Koncepcija/tyrimo metodai: dinamiĆĄka naujĆł tyrimĆł metodĆł, grÄŻstĆł automatizuotu dideliĆł duomenĆł rinkiniĆł apdorojimu naudojant dirbtinÄŻ intelektÄ
(DI), plÄtra reiĆĄkia, kad jie naudojami vis platesniuose disciplinĆł laukuose, perĆŸengiant tiksliĆłjĆł ir gamtos mokslĆł sritis. Teksto gavyba buvo derinama su turimomis CLARIN ĆŸiniatinklio programomis ir raktiniĆł ĆŸodĆŸiĆł iĆĄtraukimo bei analizÄs strategija, YAKE! paraĆĄyta Python kalba su VOSViewer programa, skirta bibliometriniams tinklams vizualizuoti
Analyzing Research Tendencies of ELT Researchers and Trajectory of English Language Teaching and Learning in the last Five Years
In accordance with the new advances in language teaching methodologies and integration of high technology tools as well as web applications, there are many scientific research published on English language teaching (ELT) and learning (ELL) in recent years. However, on the one hand, it is still a significant question to research that exactly what types of research topics are mostly studied among the researchers from different countries. What are the leading research groups on the world? Even though there are noteworthy studies to clarify mostly studied topics and trajectory of the researches on ELT by means of literature reviews, and there are very few studies to compare research tendencies of the researchers over text/content mining methodology. In fact, the papers reviewing literature are mostly limited in terms of depicting a broad understanding the scope of such studies. On the other hand, a corpus based detection methodology, which may illuminate those research tendencies and trajectory, and come up with very informative descriptive results in the field, is actually missing. In sum, the current research aims at finding out the most frequent research contexts and topics in the last five years through analyzing research papers published in leading academic journals in the field, and compare tendencies of the researchers from different institutions and countries in terms of selecting their research context and topics, and to figure out the trajectory for future studies. In this study, the researchers believe that there are different tendencies among the researchers in terms of their selecting research contexts and topics, which should be revealed for future researches. Researchers use a corpus-based detection methodology in this study, which is composed of storing variable data in .txt files and analyzing variables over the concordancer. Corpus-based detection method defines process of gathering textual data mentioned in the variables and analyzing them by means of a concordancer, named AntConc. The corpus-based data from the variables are analyzed by means of a statistical software, known as JASP in order to clear out potential differences among the researchers. A short analysis of the data indicates that the researchers still focus on the key words such as explicit learning and knowledge, implicit learning and knowledge as well as age and bilingualism. It is also observed that meta-analysis is an important topic in the studies conducted lately. Further results of the study could be beneficial for all followers including researchers and learners inside and outside the field of ELT and help people focus less frequently studied contexts and topics
Spelling errors and keywords in born-digital data: a case study using the Teenage Health Freak Corpus
The abundance of language data that is now available in digital form, and the rise of distinct language varieties that are used for digital communication, means that issues of non-standard spellings and spelling errors are, in future, likely to become more prominent for compilers of corpora. This paper examines the effect of spelling variation on keywords in a born-digital corpus in order to explore the extent and impact of this variation for future corpus studies. The corpus used in this study consists of e-mails about health concerns that were sent to a health website by adolescents. Keywords are generated using the original version of the corpus and a version with spelling errors corrected, and the British National Corpus (BNC) acts as the reference corpus. The ranks of the keywords are shown to be very similar and, therefore, suggest that, depending on the research goals, keywords could be generated reliably without any need for spelling correction
- âŠ