946 research outputs found

    Popularity Prediction of Reddit Texts

    Get PDF
    Popularity prediction is a useful technique for marketers to anticipate the success of marketing campaigns, to build recommendation systems that suggest new products to consumers, and to develop targeted advertising. Researchers likewise use popularity prediction to measure how popularity changes within a community or within a given timespan. In this paper, I explore ways to predict popularity of posts in reddit.com, which is a blend of news aggregator and community forum. I frame popularity prediction as a text classification problem and attempt to solve it by first identifying topics in the text and then classifying whether the topics identified are more characteristic of popular or unpopular texts. This classifier is then used to label unseen texts as popular or not dependent on the topics found in these new posts. I explore the use of Latent Dirichlet Allocation and term frequency-inverse document frequency for topic identification and naïve Bayes classifiers and support vector machines for classification. The relation between topics and popularity is dynamic -- topics in Reddit communities can wax and wane in popularity. Despite the inherent variability, the methods explored in the paper are effective, showing prediction accuracy between 60% and 75%. The study contributes to the field in various ways. For example, it provides novel data for research and development, not only for text classification but also for the study of relation between topics and popularity in general. The study also helps us better understand different topic identification and classification methods by illustrating their effectiveness on real-life data from a fast-changing and multi-purpose websit

    Topic modelling of Finnish Internet discussion forums as a tool for trend identification and marketing applications

    Get PDF
    The increasing availability of public discussion text data on the Internet motivates to study methods to identify current themes and trends. Being able to extract and summarize relevant information from public data in real time gives rise to competitive advantage and applications in the marketing actions of a company. This thesis presents a method of topic modelling and trend identification to extract information from Finnish Internet discussion forums. The development of text analytics, and especially topic modelling techniques, is reviewed and suitable methods are identified from the literature. The Latent Dirichlet Allocation topic model and the Dynamic Topic Model are applied in finding underlying topics from the Internet discussion forum data. The discussion data collection with web scarping and text data preprocessing methods are presented. Trends are identified with a method derived from outlier detection. Real world events, such as the news about Finnish army vegetarian meal day and the Helsinki summit of presidents Trump and Putin, were identified in an unsupervised manner. Applications for marketing are considered, e.g. automatic search engine advert keyword generation and website content recommendation. Future prospects for further improving the developed topical trend identification method are proposed. This includes the use of more complex topic models, extensive framework for tuning trend identification parameters and studying the use of more domain specific text data sources such as blogs, social media feeds or customer feedback

    Analyzing the Language of Food on Social Media

    Full text link
    We investigate the predictive power behind the language of food on social media. We collect a corpus of over three million food-related posts from Twitter and demonstrate that many latent population characteristics can be directly predicted from this data: overweight rate, diabetes rate, political leaning, and home geographical location of authors. For all tasks, our language-based models significantly outperform the majority-class baselines. Performance is further improved with more complex natural language processing, such as topic modeling. We analyze which textual features have most predictive power for these datasets, providing insight into the connections between the language of food, geographic locale, and community characteristics. Lastly, we design and implement an online system for real-time query and visualization of the dataset. Visualization tools, such as geo-referenced heatmaps, semantics-preserving wordclouds and temporal histograms, allow us to discover more complex, global patterns mirrored in the language of food.Comment: An extended abstract of this paper will appear in IEEE Big Data 201

    Sentiment analysis in hospitality using text mining: the case of a Portuguese eco-hotel

    Get PDF
    Jel Classification System: Z32 Tourism and Development; M30 Marketing and AdvertisingThe rapid development of the Internet and mobile devices enabled the emergence of travel and hospitality review sites, leading to a large number of customer opinion posts. While such comments may influence future demand of the targeted hotels, they can also be used by hotel managers for improving customer experience. Nevertheless, this trend poses a problem, considering information is widely scattered, making almost impossible to extract from it useful knowledge. In this study, with the aim of facilitating this process, sentiment classification of an eco-hotel is assessed through a text mining approach using several different sources of customer reviews. Two dictionaries are compiled for building the lexicon used to parse the 401 reviews collected from a Portuguese eco-hotel between January and August of 2015. Then, the latent Dirichlet allocation (LDA) modeling algorithm is applied to gather relevant topics that characterize a given hospitality issue by a sentiment. Findings of this study state that accuracy is influenced by interaction between LDA generated topic models and the correct construction of both dictionaries. These results also reveal that text mining can generate new insights into variables that have been extensively studied in hospitality industry, including that hotel food generates ordinary positive sentiments for the case studied, while hospitality generates both ordinary and strong positive feelings. Such results are valuable for hospitality management, validating the approach proposed.O rápido desenvolvimento da Internet e dos dispositivos móveis possibilitou o aparecimento de sites de viagens e sites de opinião na indústria hoteleira, levando a um grande número opiniões publicadas por parte do cliente. Embora, esses comentários possam influenciar a procura futura de certos hotéis, estes também podem ser usados pelos gestores dos hotéis para melhorar a experiência do cliente. No entanto, esta tendência representa um problema, uma vez que hoje em dia a informação se apresenta bastante ampla e dispersa, tornando quase impossível analisar todas as opiniões de clientes. Neste estudo, com o objetivo de facilitar este processo, a classificação de sentimentos de um hotel ecológico é avaliada através de uma abordagem de “text mining” usando diversas fontes de comentários de clientes. Dois dicionários foram compilados para a construção do léxico usado para analisar os 401 comentários recolhidos a partir de um Eco hotel português entre janeiro e agosto de 2015. Em seguida, o algoritmo de modelação “latent Dirichlet allocation” (LDA) é aplicado para reunir tópicos relevantes que caracterizam uma determinada questão de hospitalidade por um sentimento. Os resultados apurados neste estudo focam essencialmente que a precisão do mesmo é influenciada pela interação entre o modelo LDA, neste caso entre os tópicos por ele gerados e a correta construção de ambos os dicionários. Estes resultados revelam também que o “text mining” pode gerar novas perspetivas acerca de variáveis que têm sido extensivamente estudadas na indústria hoteleira, incluindo, no caso estudado, que a comida do hotel gera sentimentos positivos comuns, enquanto a hospitalidade gera ambos os sentimentos: positivos comuns e positivos fortes. Tais resultados são valiosos para a gestão hoteleira validando a abordagem proposta

    Recommender Systems

    Get PDF
    The ongoing rapid expansion of the Internet greatly increases the necessity of effective recommender systems for filtering the abundant information. Extensive research for recommender systems is conducted by a broad range of communities including social and computer scientists, physicists, and interdisciplinary researchers. Despite substantial theoretical and practical achievements, unification and comparison of different approaches are lacking, which impedes further advances. In this article, we review recent developments in recommender systems and discuss the major challenges. We compare and evaluate available algorithms and examine their roles in the future developments. In addition to algorithms, physical aspects are described to illustrate macroscopic behavior of recommender systems. Potential impacts and future directions are discussed. We emphasize that recommendation has a great scientific depth and combines diverse research fields which makes it of interests for physicists as well as interdisciplinary researchers.Comment: 97 pages, 20 figures (To appear in Physics Reports

    LDA-Based Industry Classification

    Get PDF
    Industry classification is a crucial step for financial analysis. However, existing industry classification schemes have several limitations. In order to overcome these limitations, in this paper, we propose an industry classification methodology on the basis of business commonalities using the topic features learned by the Latent Dirichlet Allocation (LDA) from firms’ business descriptions. Two types of classification – firm-centric classification and industry-centric classification were explored. Preliminary evaluation results showed the effectiveness of our method

    Modeling Dynamic User Interests: A Neural Matrix Factorization Approach

    Full text link
    In recent years, there has been significant interest in understanding users' online content consumption patterns. But, the unstructured, high-dimensional, and dynamic nature of such data makes extracting valuable insights challenging. Here we propose a model that combines the simplicity of matrix factorization with the flexibility of neural networks to efficiently extract nonlinear patterns from massive text data collections relevant to consumers' online consumption patterns. Our model decomposes a user's content consumption journey into nonlinear user and content factors that are used to model their dynamic interests. This natural decomposition allows us to summarize each user's content consumption journey with a dynamic probabilistic weighting over a set of underlying content attributes. The model is fast to estimate, easy to interpret and can harness external data sources as an empirical prior. These advantages make our method well suited to the challenges posed by modern datasets. We use our model to understand the dynamic news consumption interests of Boston Globe readers over five years. Thorough qualitative studies, including a crowdsourced evaluation, highlight our model's ability to accurately identify nuanced and coherent consumption patterns. These results are supported by our model's superior and robust predictive performance over several competitive baseline methods
    corecore