1 research outputs found

    Brazilian Portuguese Hotel’s Reviews Corpus

    No full text
    Corpus composed of hotel reviews from TripAdvisor. OnTripAdvisor when making a review, users should enter the number of circles corresponding to the overall evaluation of the hotel, give a title to the evaluation, the evaluation with at least 200 characters, choose the month and year of the visit, as well as other non-mandatory information. We collected four information from the evaluations: the number of circles, the title, the evaluation and the date on which the evaluation was performed. The data were collected only from accommodations classified as hotels. The Reviews were collected from February to March 2018, so the most recent review dates from March 20, 2018. Reviews were taken from hotels of the 26 capitals of Brazilian states and the Federal District as well. We chose to collect reviews of hotels in the capitals in order to have a clear criteria of delimitation of the number of cities and to cover all Brazilian states as well. We gathered a total of 730,069 reviews. The first normalization we did in the corpus was the removal of the excesses of punctuation and sequence of repeated letters that did not form words. As TripAdvisor requires the user to write a comment of at least 200 characters, in several occasions the users completed the comments with punctuation sequences and random characters that did not form words, so we removed several of these occurrences. We kept reticence and sequences of up to three exclamation question marks (which may have some meaning when it comes to the sentiment analysis). We have developed a lexical dataset to help us to reduce the number of words that were linked to each other (eg ”Bomcaf ́edamanh ̃a”). We separated terms such as numbers or hyphens (preceded and followed by spaces) linked to words (eg “8Limpeza” to “8 Limpeza” and ‘-Gostei‘” to “- Gostei”). In addition,we kept the words the way they were written, even if incorrectly, due to typingerrors or intentionally. So we kept terms with “adoreeeeei” and “lliiiixxxoooo”(similar to “I loveeeeed it” and “trrraaaaaassshh”, respectively). On the Internet it is common to write uppercase terms as a way of emphasizing something either positively or negatively, for this reason we also kept the texts capitalization intact. After carrying out the normalizations, we have obtained a count of 56,743,114 tokens and 246,307 types. Considering stop-words and punctuation signs, we can see that each review, on average, consists of about 77 tokens, with the largest review having 2,857 tokens and the lowest having only 2 tokens. More details can be found in the paper: SOUZA, J. G. R. ; OLIVEIRA, A. P. ; MOREIRA, A. . Development of a Brazilian Portuguese Hotel's Reviews Corpus. In: PROPOR - International Conference on Computational Processing of the Portuguese Language, 2018, Canela. PROPOR - International Conference on Computational Processing of the Portuguese Language, 2018. v. 11122. p. 353-361