5 research outputs found
Crawling microblogging services to gather language-classified URLs Workflow and case study
We present a way to extract links from messages published on microblogging platforms and we classify them according to the language and possible relevance of their target in order to build a text corpus. Three platforms are taken into consideration: FriendFeed, identi.ca and Reddit, as they account for a relative diversity of user profiles and more importantly user languages. In order to explore them, we introduce a traversal algorithm based on user pages. As we target lesser-known languages, we try to focus on non-English posts by filtering out English text. Using mature open-source software from the NLP research field, a spell checker (aspell) and a language identification system (langid.py), our case study and our benchmarks give an insight into the linguistic structure of the considered services.
Proceedings of the Research Data And Humanities (RDHUM) 2019 Conference: Data, Methods And Tools
Analytical bibliography aims to understand the production of books. Systematic
methods can be used to determine an overall view of the publication history. In this
paper, we present the state of the art analytical approach towards the determination
of editions using the ESTC meta data. The preliminary results illustrate that
metadata cleanup and analysis can provide opportunities for edition determination.
This would significantly help projects aiming to do large scale text mining.</p