1,354 research outputs found

    Statistical keyword detection in literary corpora

    Full text link
    Understanding the complexity of human language requires an appropriate analysis of the statistical distribution of words in texts. We consider the information retrieval problem of detecting and ranking the relevant words of a text by means of statistical information referring to the "spatial" use of the words. Shannon's entropy of information is used as a tool for automatic keyword extraction. By using The Origin of Species by Charles Darwin as a representative text sample, we show the performance of our detector and compare it with another proposals in the literature. The random shuffled text receives special attention as a tool for calibrating the ranking indices.Comment: Published version. 11 pages, 7 figures. SVJour for LaTeX2

    Towards the quantification of the semantic information encoded in written language

    Get PDF
    Written language is a complex communication signal capable of conveying information encoded in the form of ordered sequences of words. Beyond the local order ruled by grammar, semantic and thematic structures affect long-range patterns in word usage. Here, we show that a direct application of information theory quantifies the relationship between the statistical distribution of words and the semantic content of the text. We show that there is a characteristic scale, roughly around a few thousand words, which establishes the typical size of the most informative segments in written language. Moreover, we find that the words whose contributions to the overall information is larger, are the ones more closely associated with the main subjects and topics of the text. This scenario can be explained by a model of word usage that assumes that words are distributed along the text in domains of a characteristic size where their frequency is higher than elsewhere. Our conclusions are based on the analysis of a large database of written language, diverse in subjects and styles, and thus are likely to be applicable to general language sequences encoding complex information.Comment: 19 pages, 4 figure

    NeÄŻprastas ĆŸvilgsnis ÄŻ istorinę monografiją. Analizė su dirbtinio intelekto (AI) ÄŻrankiais

    Get PDF
    Goal and theses: The article aims to check the applicability of methods based on processing large sets of information in research in social sciences. Conception/research methods: The dynamic development of new research methods based on the automated processing of large data sets using artificial intelligence (AI) means that they are used in an increasingly wide range of disciplines, going beyond the field of exact and natural sciences. Text mining was combined with available CLARIN web applications and keyword extraction and analysis strategy, a combination of the YAKE! written in Python with the VOSViewer program for the visualisation of bibliometric networks. Results and conclusions: The study showed how automatic keyword extraction creates opportunities in social science research. The use of CLARIN and Google Pinpoint web tools in the analysis significantly facilitates working with a large body of texts and accelerates its analysis. Cognitive value/originality: The study indicates new research methods that can contribute to the development of social sciences. The perspectives for the implementation of the ways of dealing with large data sets are presented in work in research on society, and conclusions regarding the development of digital social sciences are formulated.Straipsnio tikslas ir tezės: straipsniu siekiama patikrinti metodĆł, grÄŻstĆł dideliĆł informacijos rinkiniĆł apdorojimu, pritaikomumą socialiniĆł mokslĆł tyrimuose. Koncepcija/tyrimo metodai: dinamiĆĄka naujĆł tyrimĆł metodĆł, grÄŻstĆł automatizuotu dideliĆł duomenĆł rinkiniĆł apdorojimu naudojant dirbtinÄŻ intelektą (DI), plėtra reiĆĄkia, kad jie naudojami vis platesniuose disciplinĆł laukuose, perĆŸengiant tiksliĆłjĆł ir gamtos mokslĆł sritis. Teksto gavyba buvo derinama su turimomis CLARIN ĆŸiniatinklio programomis ir raktiniĆł ĆŸodĆŸiĆł iĆĄtraukimo bei analizės strategija, YAKE! paraĆĄyta Python kalba su VOSViewer programa, skirta bibliometriniams tinklams vizualizuoti

    Analyzing Research Tendencies of ELT Researchers and Trajectory of English Language Teaching and Learning in the last Five Years

    Get PDF
    In accordance with the new advances in language teaching methodologies and integration of high technology tools as well as web applications, there are many scientific research published on English language teaching (ELT) and learning (ELL) in recent years. However, on the one hand, it is still a significant question to research that exactly what types of research topics are mostly studied among the researchers from different countries. What are the leading research groups on the world? Even though there are noteworthy studies to clarify mostly studied topics and trajectory of the researches on ELT by means of literature reviews, and there are very few studies to compare research tendencies of the researchers over text/content mining methodology. In fact, the papers reviewing literature are mostly limited in terms of depicting a broad understanding the scope of such studies. On the other hand, a corpus based detection methodology, which may illuminate those research tendencies and trajectory, and come up with very informative descriptive results in the field, is actually missing. In sum, the current research aims at finding out the most frequent research contexts and topics in the last five years through analyzing research papers published in leading academic journals in the field, and compare tendencies of the researchers from different institutions and countries in terms of selecting their research context and topics, and to figure out the trajectory for future studies. In this study, the researchers believe that there are different tendencies among the researchers in terms of their selecting research contexts and topics, which should be revealed for future researches. Researchers use a corpus-based detection methodology in this study, which is composed of storing variable data in .txt files and analyzing variables over the concordancer. Corpus-based detection method defines process of gathering textual data mentioned in the variables and analyzing them by means of a concordancer, named AntConc. The corpus-based data from the variables are analyzed by means of a statistical software, known as JASP in order to clear out potential differences among the researchers. A short analysis of the data indicates that the researchers still focus on the key words such as explicit learning and knowledge, implicit learning and knowledge as well as age and bilingualism. It is also observed that meta-analysis is an important topic in the studies conducted lately. Further results of the study could be beneficial for all followers including researchers and learners inside and outside the field of ELT and help people focus less frequently studied contexts and topics

    Spelling errors and keywords in born-digital data: a case study using the Teenage Health Freak Corpus

    Get PDF
    The abundance of language data that is now available in digital form, and the rise of distinct language varieties that are used for digital communication, means that issues of non-standard spellings and spelling errors are, in future, likely to become more prominent for compilers of corpora. This paper examines the effect of spelling variation on keywords in a born-digital corpus in order to explore the extent and impact of this variation for future corpus studies. The corpus used in this study consists of e-mails about health concerns that were sent to a health website by adolescents. Keywords are generated using the original version of the corpus and a version with spelling errors corrected, and the British National Corpus (BNC) acts as the reference corpus. The ranks of the keywords are shown to be very similar and, therefore, suggest that, depending on the research goals, keywords could be generated reliably without any need for spelling correction

    DARIAH and the Benelux

    Get PDF
    • 

    corecore