Semantically Analyzed Metadata of Tumblr Posts and Bloggers
AbstractThe dataset "Tumblr.zip" is the first ever published dataset on Tumblr. It contains the Tumblr metadata of posts and bloggers collected via bootstrapping method. The dataset also contains various features extracted after semantically analyzing the textual post.
The dataset contains three files: Tumblr.sql, semtags.txt and a README file. Tumblr.sql creates 8 tables in mysql primarily named as blogger, blogger_desc, Document_Sentiment_Feature, Post_desc, Posts, Semantic_Tagging, Tone and Topic_Classification.
semtags.txt is a lexicon of tags/cods used for semantic tagging of each post. This list is created by USAS (UCREL Semantic Analysis System). Followings are the list and description of all attributes and tables used in the dataset. Same attributes used in different tables are listed only once.
1. Table- Posts, Post_desc
Post_ID- unique id of each post
Timestamp- Timestamp of when the post was created
gmt- GMT timestamp of each post
blogger- unique id of the author of the post
url- short url to original Tumblr post
tags- tags/keywords associated with the posts.
num_tags- number of tags in each post
type- type of a post (Text, Quote, Chat..)
notes- number of notes (like + reblog) on a post
rebloggedfrom- id of the blogger from which profile the post was retrieved. Null if the post is originally created by 'blogger'.
title- title of the post
Desc- description or body content of the post.
2. Table- blogger, blogger_desc
blogger_id- unique id of a blogger
ask- if users allows asking question on his profile
ask_anon- if users allows anonymous questions from other bloggers
like_count- number of like count on blogger's page
post_count- number of posts made by the blogger (including re-blogged posts)
title- title of blogger's page
desc- description of blogger
3. Table- Document_Sentiment_Feature
score- sentiment score of a post
label- label of a sentiment based on the score value
4. Table- Tone
Emotion- label and confidence score of a emotion tone in a post (joy, fear, sadness...)
Writing- label and confidence score of a writing tone in a post (analytical, confident..)
Social- label and confidence score of a social tone in a post (openness, Conscientiousness)
lang_post- language of a post (English, Arabic, German, Italian..)
taxonomy- topics being discussed/mentioned in the Post.
Class- label of a post assigned by our classifier.
Tagged_Posts- contains the original text post encoded with semantics tags (codes available in semtags.txt file