Search CORE

5 research outputs found

Crawling microblogging services to gather language-classified URLs Workflow and case study

Author: Adrien Barbaresi
Publication venue
Publication date: 05/08/2013
Field of study

We present a way to extract links from messages published on microblogging platforms and we classify them according to the language and possible relevance of their target in order to build a text corpus. Three platforms are taken into consideration: FriendFeed, identi.ca and Reddit, as they account for a relative diversity of user profiles and more importantly user languages. In order to explore them, we introduce a traversal algorithm based on user pages. As we target lesser-known languages, we try to focus on non-English posts by filtering out English text. Using mature open-source software from the NLP research field, a spell checker (aspell) and a language identification system (langid.py), our case study and our benchmarks give an insight into the linguistic structure of the considered services.

HAL-ENS-LYON

CiteSeerX

HAL

Border crossing and trespassing? : Expanding digital humanities research to developing peripheries with the novel digital technologies

Author: Hyyryläinen Torsti
Ryynänen Toni
Publication venue: University of Oulu
Publication date: 01/01/2019
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Wanca in Korp : Text corpora for underresourced Uralic languages

Author: Jauhiainen Heidi
Jauhiainen Tommi
Linden Krister
Publication venue: University of Oulu
Publication date: 01/01/2019
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Proceedings of the Research Data And Humanities (RDHUM) 2019 Conference: Data, Methods And Tools

Author: Ali Zeeshan Ijaz
Iiro Tiihonen
Leo Lahti
Mikko Tolonen
Publication venue: Suomen kasvatuksen ja koulutuksen historian seura
Publication date: 28/10/2022
Field of study

Analytical bibliography aims to understand the production of books. Systematic methods can be used to determine an overall view of the publication history. In this paper, we present the state of the art analytical approach towards the determination of editions using the ESTC meta data. The preliminary results illustrate that metadata cleanup and analysis can provide opportunities for edition determination. This would significantly help projects aiming to do large scale text mining.</p

UTUPub

Proceedings of the Research data and humanities (RDHUM) 2019 conference : data, methods and tools

Author: Annekatrin Kaivapalu
Christophe Leblay
Elisa Reunanen
Jorma Luutonen
Maarit Koponen
Maarit Mutta
Markku Nikulin
Nobufumi Inaba
Tommi Kurki
Veronika Laippala
Publication venue: Suomen kasvatuksen ja koulutuksen historian seura
Publication date: 28/10/2022
Field of study

UTUPub