1 research outputs found
Creating a contemporary corpus of similes in Serbian by using natural language processing
Simile is a figure of speech that compares two things through the use of
connection words, but where comparison is not intended to be taken literally.
They are often used in everyday communication, but they are also a part of
linguistic cultural heritage. In this paper we present a methodology for
semi-automated collection of similes from the World Wide Web using text mining
and machine learning techniques. We expanded an existing corpus by collecting
442 similes from the internet and adding them to the existing corpus collected
by Vuk Stefanovic Karadzic that contained 333 similes. We, also, introduce
crowdsourcing to the collection of figures of speech, which helped us to build
corpus containing 787 unique similes.Comment: 15 pages, submitted to journal Slovo, however, later withdrawn to
correct. Additional work was not done on it, so it is still waiting to be
extended. Output of the system can be seen here:
http://ezbirka.starisloveni.com/. arXiv admin note: text overlap with
arXiv:1605.0631