2 research outputs found

    Keywords at Work: Investigating Keyword Extraction in Social Media Applications

    Full text link
    This dissertation examines a long-standing problem in Natural Language Processing (NLP) -- keyword extraction -- from a new angle. We investigate how keyword extraction can be formulated on social media data, such as emails, product reviews, student discussions, and student statements of purpose. We design novel graph-based features for supervised and unsupervised keyword extraction from emails, and use the resulting system with success to uncover patterns in a new dataset -- student statements of purpose. Furthermore, the system is used with new features on the problem of usage expression extraction from product reviews, where we obtain interesting insights. The system while used on student discussions, uncover new and exciting patterns. While each of the above problems is conceptually distinct, they share two key common elements -- keywords and social data. Social data can be messy, hard-to-interpret, and not easily amenable to existing NLP resources. We show that our system is robust enough in the face of such challenges to discover useful and important patterns. We also show that the problem definition of keyword extraction itself can be expanded to accommodate new and challenging research questions and datasets.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/145929/1/lahiri_1.pd

    Keyword extraction and named entity recognition on Reddit submissions

    Full text link
    Cilj te naloge je bila konstrukcija postopka za luščenje pomembnih podatkov iz kratkih besedil v naravnem jeziku, bolj specifično objav s spletnega portala Reddit. Dve glavni področji naših raziskav sta bili luščenje ključnih besed in razpoznavanje entitet. Za namene naloge smo implementirali in analizirali štiri algoritme za luščenje ključnih besed (RAKE, TextRank, nevronske mreže LSTM in biLSTM) in tri algoritme za razpoznavanje entitet (modeli knjižnice Spacy, Stanford NER in umerjeni modeli BERT). Analiza algoritmov je pokazala, da dosežemo najboljše rezultate z uporabo nevronske mreže s tremi sloji biLSTM za luščenje ključnih besed, model biLSTM za male črke, umerjen na podatkovni zbirki MIT movie corpus, za razpoznavanje imen igralcev in model, umerjen na podatkovni zbirki Ontonotes 5, za razpoznavanje naslovov filmov.The goal of this thesis was to create a pipeline for extraction of valuable information from short natural language texts, more specifically Reddit submissions. The two main areas of research that we covered were keyword extraction and named entity recognition for the extraction of keywords and the recognition of actors and movie titles in the texts. In our thesis we implemented and evaluated four different approaches for keyword extraction (RAKE, TextRank, LSTM and biLSTM networks) and three different approaches for named entity recognition (Spacy library models, Stanford NER and Fine-tuned BERT models). The analysis of the algorithms showed that the best results were achieved when using a three layered biLSTM network for keyword extraction, an uncased BERT model fine-tuned on the MIT movie corpus dataset for the recognition of actors, and the BERT model fine-tuned on the Ontonotes 5 dataset for the recognition of movie titles
    corecore