43 research outputs found

    Recent Advances in Social Data and Artificial Intelligence 2019

    Get PDF
    The importance and usefulness of subjects and topics involving social data and artificial intelligence are becoming widely recognized. This book contains invited review, expository, and original research articles dealing with, and presenting state-of-the-art accounts pf, the recent advances in the subjects of social data and artificial intelligence, and potentially their links to Cyberspace

    Automatic keyphrase extraction on Amazon reviews

    Get PDF
    People are facing severe challenges posed by big data. As an important type of the online text, product reviews have evoked much research interest because of their commercial potential. This thesis takes Amazon camera reviews as the research focus and implements an automatic keyphrase extraction system. The system consists of three modules, including the Crawler module, the Extraction module, and the Web module. The Crawler module is responsible for capturing Amazon product reviews. The Web module is responsible for obtaining user input and displaying the final results. The Extraction module is the core processing module of the system, which analyzes product reviews according to the following sequence: (1) Pre-processing of review data, including removal of stop words and segmentation. ( 2) Candidate keyphrase extraction. Through the Spacy part-of speech tagger and Dependency parser, the dependency relationships of each review sentence are obtained, and then the feature and opinion words are extracted based on several predefined dependency rules. (3) Candidate keyphrase clustering. By using a Latent Dirichlet Allocation (LDA) model, the candidate keyphrases are clustered according to their topics . ( 4) Candidate keyphrase ranking. Two different algorithms, LDA-TFIDF and LDA-MT, are applied to rank the keyphrases in different clusters to get the representative keyphrases. The experimental results show that the system performs well in the task of keyphrase extraction

    Keywords at Work: Investigating Keyword Extraction in Social Media Applications

    Full text link
    This dissertation examines a long-standing problem in Natural Language Processing (NLP) -- keyword extraction -- from a new angle. We investigate how keyword extraction can be formulated on social media data, such as emails, product reviews, student discussions, and student statements of purpose. We design novel graph-based features for supervised and unsupervised keyword extraction from emails, and use the resulting system with success to uncover patterns in a new dataset -- student statements of purpose. Furthermore, the system is used with new features on the problem of usage expression extraction from product reviews, where we obtain interesting insights. The system while used on student discussions, uncover new and exciting patterns. While each of the above problems is conceptually distinct, they share two key common elements -- keywords and social data. Social data can be messy, hard-to-interpret, and not easily amenable to existing NLP resources. We show that our system is robust enough in the face of such challenges to discover useful and important patterns. We also show that the problem definition of keyword extraction itself can be expanded to accommodate new and challenging research questions and datasets.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/145929/1/lahiri_1.pd

    Constructing and modeling text-rich information networks: a phrase mining-based approach

    Get PDF
    A lot of digital ink has been spilled on "big data" over the past few years, which is often characterized by an explosion of information. Most of this surge owes its origin to the unstructured data in the wild like words, images and video as comparing to the structured information stored in fielded form in databases. The proliferation of text-heavy data is particularly overwhelming, reflected in everyone's daily life in forms of web documents, business reviews, news, social posts, etc. In the mean time, textual data and structured entities often come in intertwined, such as authors/posters, document categories and tags, and document-associated geo locations. With this background, a core research challenge presents itself as how to turn massive, (semi-)unstructured data into structured knowledge. One promising paradigm studied in this dissertation is to integrate structured and unstructured data, constructing an organized heterogeneous information network, and developing powerful modeling mechanisms on such organized network. We name it text-rich information network, since it is an integrated representation of both structured and unstructured textual data. To thoroughly develop the construction and modeling paradigm, this dissertation will focus on forming a scalable data-driven framework and propose a new line of techniques relying on the idea of phrase mining to bridge textual documents and structured entities. We will first introduce the phrase mining method named SegPhrase+ to globally discover semantically meaningful phrases from massive textual data, providing a high quality dictionary for text structuralization. Clearly distinct from previous works that mostly focused on raw statistics of string matching, SegPhrase+ looks into the phrase context and effectively rectifies raw statistics to significantly boost the performance. Next, a novel algorithm based on latent keyphrases is developed and adopted to largely eliminate irregularities in massive text via providing an consistent and interpretable document representation. As a critical process in constructing the network, it uses the quality phrases generated in the previous step as candidates. From them a set of keyphrases are extracted to represent a particular document with inferred strength through a statistical model. After this step, documents become more structured and are consistently represented in the form of a bipartite network connecting documents with quality keyphrases. A more heterogeneous text-rich information network can be constructed by incorporating different types of document-associated entities as additional nodes. Lastly, a general and scalable framework, Tensor2vec, are to be added to trational data minining machanism, as the latter cannot readily solve the problem when the organized heterogeneous network has nodes with different types. Tensor2vec is expected to elegantly handle relevance search, entity classification, summarization and recommendation problems, by making use of higher-order link information and projecting multi-typed nodes into a shared low-dimensional vectorial space such that node proximity can be easily computed and accurately predicted

    Statistical natural language processing methods for intelligent process automation

    Get PDF
    Nowadays, digitization is transforming the way businesses work. Recently, Artificial Intelligence (AI) techniques became an essential part of the automation of business processes: In addition to cost advantages, these techniques offer fast processing times and higher customer satisfaction rates, thus ultimately increasing sales. One of the intelligent approaches for accelerating digital transformation in companies is the Robotic Process Automation (RPA). An RPA-system is a software tool that robotizes routine and time-consuming responsibilities such as email assessment, various calculations, or creation of documents and reports (Mohanty and Vyas, 2018). Its main objective is to organize a smart workflow and therethrough to assist employees by offering them more scope for cognitively demanding and engaging work. Intelligent Process Automation (IPA) offers all these advantages as well; however, it goes beyond the RPA by adding AI components such as Machine- and Deep Learning techniques to conventional automation solutions. Previously, IPA approaches were primarily employed within the computer vision domain. However, in recent times, Natural Language Processing (NLP) became one of the potential applications for IPA as well due to its ability to understand and interpret human language. Usually, NLP methods are used to analyze large amounts of unstructured textual data and to respond to various inquiries. However, one of the central applications of NLP within the IPA domain – are conversational interfaces (e.g., chatbots, virtual agents) that are used to enable human-to-machine communication. Nowadays, conversational agents gain enormous demand due to their ability to support a large number of users simultaneously while communicating in a natural language. The implementation of a conversational agent comprises multiple stages and involves diverse types of NLP sub-tasks, starting with natural language understanding (e.g., intent recognition, named entity extraction) and going towards dialogue management (i.e., determining the next possible bots action) and response generation. Typical dialogue system for IPA purposes undertakes straightforward customer support requests (e.g., FAQs), allowing human workers to focus on more complicated inquiries. In this thesis, we are addressing two potential Intelligent Process Automation (IPA) applications and employing statistical Natural Language Processing (NLP) methods for their implementation. The first block of this thesis (Chapter 2 – Chapter 4) deals with the development of a conversational agent for IPA purposes within the e-learning domain. As already mentioned, chatbots are one of the central applications for the IPA domain since they can effectively perform time-consuming tasks while communicating in a natural language. Within this thesis, we realized the IPA conversational bot that takes care of routine and time-consuming tasks regularly performed by human tutors of an online mathematical course. This bot is deployed in a real-world setting within the OMB+ mathematical platform. Conducting experiments for this part, we observed two possibilities to build the conversational agent in industrial settings – first, with purely rule-based methods, considering the missing training data and individual aspects of the target domain (i.e., e-learning). Second, we re-implemented two of the main system components (i.e., Natural Language Understanding (NLU) and Dialogue Manager (DM) units) using the current state-of-the-art deep-learning architecture (i.e., Bidirectional Encoder Representations from Transformers (BERT)) and investigated their performance and potential use as a part of a hybrid model (i.e., containing both rule-based and machine learning methods). The second part of the thesis (Chapter 5 – Chapter 6) considers an IPA subproblem within the predictive analytics domain and addresses the task of scientific trend forecasting. Predictive analytics forecasts future outcomes based on historical and current data. Therefore, using the benefits of advanced analytics models, an organization can, for instance, reliably determine trends and emerging topics and then manipulate it while making significant business decisions (i.e., investments). In this work, we dealt with the trend detection task – specifically, we addressed the lack of publicly available benchmarks for evaluating trend detection algorithms. We assembled the benchmark for the detection of both scientific trends and downtrends (i.e., topics that become less frequent overtime). To the best of our knowledge, the task of downtrend detection has not been addressed before. The resulting benchmark is based on a collection of more than one million documents, which is among the largest that has been used for trend detection before, and therefore, offers a realistic setting for the development of trend detection algorithms.Robotergesteuerte Prozessautomatisierung (RPA) ist eine Art von Software-Bots, die manuelle menschliche Tätigkeiten wie die Eingabe von Daten in das System, die Anmeldung in Benutzerkonten oder die Ausführung einfacher, aber sich wiederholender Arbeitsabläufe nachahmt (Mohanty and Vyas, 2018). Einer der Hauptvorteile und gleichzeitig Nachteil der RPA-bots ist jedoch deren Fähigkeit, die gestellte Aufgabe punktgenau zu erfüllen. Einerseits ist ein solches System in der Lage, die Aufgabe akkurat, sorgfältig und schnell auszuführen. Andererseits ist es sehr anfällig für Veränderungen in definierten Szenarien. Da der RPA-Bot für eine bestimmte Aufgabe konzipiert ist, ist es oft nicht möglich, ihn an andere Domänen oder sogar für einfache Änderungen in einem Arbeitsablauf anzupassen (Mohanty and Vyas, 2018). Diese Unfähigkeit, sich an veränderte Bedingungen anzupassen, führte zu einem weiteren Verbesserungsbereich für RPAbots – den Intelligenten Prozessautomatisierungssystemen (IPA). IPA-Bots kombinieren RPA mit Künstlicher Intelligenz (AI) und können komplexe und kognitiv anspruchsvollere Aufgaben erfüllen, die u.A. Schlussfolgerungen und natürliches Sprachverständnis erfordern. Diese Systeme übernehmen zeitaufwändige und routinemäßige Aufgaben, ermöglichen somit einen intelligenten Arbeitsablauf und befreien Fachkräfte für die Durchführung komplizierterer Aufgaben. Bisher wurden die IPA-Techniken hauptsächlich im Bereich der Bildverarbeitung eingesetzt. In der letzten Zeit wurde die natürliche Sprachverarbeitung (NLP) jedoch auch zu einem der potenziellen Anwendungen für IPA, und zwar aufgrund von der Fähigkeit, die menschliche Sprache zu interpretieren. NLP-Methoden werden eingesetzt, um große Mengen an Textdaten zu analysieren und auf verschiedene Anfragen zu reagieren. Auch wenn die verfügbaren Daten unstrukturiert sind oder kein vordefiniertes Format haben (z.B. E-Mails), oder wenn die in einem variablen Format vorliegen (z.B. Rechnungen, juristische Dokumente), dann werden ebenfalls die NLP Techniken angewendet, um die relevanten Informationen zu extrahieren, die dann zur Lösung verschiedener Probleme verwendet werden können. NLP im Rahmen von IPA beschränkt sich jedoch nicht auf die Extraktion relevanter Daten aus Textdokumenten. Eine der zentralen Anwendungen von IPA sind Konversationsagenten, die zur Interaktion zwischen Mensch und Maschine eingesetzt werden. Konversationsagenten erfahren enorme Nachfrage, da sie in der Lage sind, eine große Anzahl von Benutzern gleichzeitig zu unterstützen, und dabei in einer natürlichen Sprache kommunizieren. Die Implementierung eines Chatsystems umfasst verschiedene Arten von NLP-Teilaufgaben, beginnend mit dem Verständnis der natürlichen Sprache (z.B. Absichtserkennung, Extraktion von Entitäten) über das Dialogmanagement (z.B. Festlegung der nächstmöglichen Bot-Aktion) bis hin zur Response-Generierung. Ein typisches Dialogsystem für IPA-Zwecke übernimmt in der Regel unkomplizierte Kundendienstanfragen (z.B. Beantwortung von FAQs), so dass sich die Mitarbeiter auf komplexere Anfragen konzentrieren können. Diese Dissertation umfasst zwei Bereiche, die durch das breitere Thema vereint sind, nämlich die Intelligente Prozessautomatisierung (IPA) unter Verwendung statistischer Methoden der natürlichen Sprachverarbeitung (NLP). Der erste Block dieser Arbeit (Kapitel 2 – Kapitel 4) befasst sich mit der Impementierung eines Konversationsagenten für IPA-Zwecke innerhalb der E-Learning-Domäne. Wie bereits erwähnt, sind Chatbots eine der zentralen Anwendungen für die IPA-Domäne, da sie zeitaufwändige Aufgaben in einer natürlichen Sprache effektiv ausführen können. Der IPA-Kommunikationsbot, der in dieser Arbeit realisiert wurde, kümmert sich ebenfalls um routinemäßige und zeitaufwändige Aufgaben, die sonst von Tutoren in einem Online-Mathematikkurs in deutscher Sprache durchgeführt werden. Dieser Bot ist in der täglichen Anwendung innerhalb der mathematischen Plattform OMB+ eingesetzt. Bei der Durchführung von Experimenten beobachteten wir zwei Möglichkeiten, den Konversationsagenten im industriellen Umfeld zu entwickeln – zunächst mit rein regelbasierten Methoden, unter Bedingungen der fehlenden Trainingsdaten und besonderer Aspekte der Zieldomäne (d.h. E-Learning). Zweitens haben wir zwei der Hauptsystemkomponenten (Sprachverständnismodul, Dialog-Manager) mit dem derzeit fortschrittlichsten Deep Learning Algorithmus reimplementiert und die Performanz dieser Komponenten untersucht. Der zweite Teil der Doktorarbeit (Kapitel 5 – Kapitel 6) betrachtet ein IPA-Problem innerhalb des Vorhersageanalytik-Bereichs. Vorhersageanalytik zielt darauf ab, Prognosen über zukünftige Ergebnisse auf der Grundlage von historischen und aktuellen Daten zu erstellen. Daher kann ein Unternehmen mit Hilfe der Vorhersagesysteme z.B. die Trends oder neu entstehende Themen zuverlässig bestimmen und diese Informationen dann bei wichtigen Geschäftsentscheidungen (z.B. Investitionen) einsetzen. In diesem Teil der Arbeit beschäftigen wir uns mit dem Teilproblem der Trendprognose – insbesondere mit dem Fehlen öffentlich zugänglicher Benchmarks für die Evaluierung von Trenderkennungsalgorithmen. Wir haben den Benchmark zusammengestellt und veröffentlicht, um sowohl Trends als auch Abwärtstrends zu erkennen. Nach unserem besten Wissen ist die Aufgabe der Abwärtstrenderkennung bisher nicht adressiert worden. Der resultierende Benchmark basiert auf einer Sammlung von mehr als einer Million Dokumente, der zu den größten gehört, die bisher für die Trenderkennung verwendet wurden, und somit einen realistischen Rahmen für die Entwicklung von Trenddetektionsalgorithmen bietet

    Predictive Modeling for Navigating Social Media

    Get PDF
    Social media changes the way people use the Web. It has transformed ordinary Web users from information consumers to content contributors. One popular form of content contribution is social tagging, in which users assign tags to Web resources. By the collective efforts of the social tagging community, a new information space has been created for information navigation. Navigation allows serendipitous discovery of information by examining the information objects linked to one another in the social tagging space. In this dissertation, we study prediction tasks that facilitate navigation in social tagging systems. For social tagging systems to meet complex navigation needs of users, two issues are fundamental, namely link sparseness and object selection. Link sparseness is observed for many resources that are untagged or inadequately tagged, hindering navigation to the resources. Object selection is concerned when there are a large number of information objects that are linked to the current object, requiring to select the more interesting or relevant ones for guiding navigation effectively. This dissertation focuses on three dimensions, namely the semantic, social and temporal dimensions, to address link sparseness and object selection. To address link sparseness, we study the task of tag prediction. This task aims to enrich tags for the untagged or inadequately tagged resources, such that the predicted tags can serve as navigable links to these resources. For this task, we take a topic modeling approach to exploit the latent semantic relationships between resource content and tags. To address object selection, we study the task of personalized tag recommendation and trend discovery using social annotations. Personalized tag recommendation leverages the collective wisdom from the social tagging community to recommend tags that are semantically relevant to the target resource, while being tailored to the tagging preferences of individual users. For this task, we propose a probabilistic framework which leverages the implicit social links between like-minded users, i.e. who show similar tagging preferences, to recommend suitable tags. Social tags capture the interest of the users in the annotated resources at different times. These social annotations allow us to construct temporal profiles for the annotated resources. By analyzing these temporal profiles, we unveil the non-trivial temporal trends of the annotated resources, which provide novel metrics for selecting relevant and interesting resources for guiding navigation. For trend discovery using social annotations, we propose a trend discovery process which enables us to analyze trends for a multitude of semantics encapsulated in the temporal profiles of the annotated resources

    A Comprehensive Survey of Artificial Intelligence Techniques for Talent Analytics

    Full text link
    In today's competitive and fast-evolving business environment, it is a critical time for organizations to rethink how to make talent-related decisions in a quantitative manner. Indeed, the recent development of Big Data and Artificial Intelligence (AI) techniques have revolutionized human resource management. The availability of large-scale talent and management-related data provides unparalleled opportunities for business leaders to comprehend organizational behaviors and gain tangible knowledge from a data science perspective, which in turn delivers intelligence for real-time decision-making and effective talent management at work for their organizations. In the last decade, talent analytics has emerged as a promising field in applied data science for human resource management, garnering significant attention from AI communities and inspiring numerous research efforts. To this end, we present an up-to-date and comprehensive survey on AI technologies used for talent analytics in the field of human resource management. Specifically, we first provide the background knowledge of talent analytics and categorize various pertinent data. Subsequently, we offer a comprehensive taxonomy of relevant research efforts, categorized based on three distinct application-driven scenarios: talent management, organization management, and labor market analysis. In conclusion, we summarize the open challenges and potential prospects for future research directions in the domain of AI-driven talent analytics.Comment: 30 pages, 15 figure

    Textual Analysis of Intangible Information

    Get PDF
    Traditionally, equity investors have relied upon the information reported in firms’ financial accounts to make their investment decisions. Due to the conservative nature of accounting standards, firms cannot value their intangible assets such as corporate culture, brand value and reputation. Investors’ efforts to collect such information have been hampered by the voluntary nature of Corporate Social Responsibility (CSR) reporting standards, which have resulted in the publication of inconsistent, stale and incomplete information across firms. In short, information on intangible assets is less salient to investors compared to accounting information because it is more costly to collect, process and analyse. In this thesis we design an automated approach to collect and quantify information on firms’ intangible assets by drawing upon techniques commonly adopted in the fields of Natural Language Processing (NLP) and Information Retrieval. The exploitation of unstructured data available on the Web holds promise for investors seeking to integrate a wider variety of information into their investment processes. The objectives of this research are: 1) to draw upon textual analysis methodologies to measure intangible information from a range of unstructured data sources, 2) to integrate intangible information and accounting information into an investment analysis framework, 3) evaluate the merits of unstructured data for the prediction of firms’ future earnings

    Neural information extraction from natural language text

    Get PDF
    Natural language processing (NLP) deals with building computational techniques that allow computers to automatically analyze and meaningfully represent human language. With an exponential growth of data in this digital era, the advent of NLP-based systems has enabled us to easily access relevant information via a wide range of applications, such as web search engines, voice assistants, etc. To achieve it, a long-standing research for decades has been focusing on techniques at the intersection of NLP and machine learning. In recent years, deep learning techniques have exploited the expressive power of Artificial Neural Networks (ANNs) and achieved state-of-the-art performance in a wide range of NLP tasks. Being one of the vital properties, Deep Neural Networks (DNNs) can automatically extract complex features from the input data and thus, provide an alternative to the manual process of handcrafted feature engineering. Besides ANNs, Probabilistic Graphical Models (PGMs), a coupling of graph theory and probabilistic methods have the ability to describe causal structure between random variables of the system and capture a principled notion of uncertainty. Given the characteristics of DNNs and PGMs, they are advantageously combined to build powerful neural models in order to understand the underlying complexity of data. Traditional machine learning based NLP systems employed shallow computational methods (e.g., SVM or logistic regression) and relied on handcrafting features which is time-consuming, complex and often incomplete. However, deep learning and neural network based methods have recently shown superior results on various NLP tasks, such as machine translation, text classification, namedentity recognition, relation extraction, textual similarity, etc. These neural models can automatically extract an effective feature representation from training data. This dissertation focuses on two NLP tasks: relation extraction and topic modeling. The former aims at identifying semantic relationships between entities or nominals within a sentence or document. Successfully extracting the semantic relationships greatly contributes in building structured knowledge bases, useful in downstream NLP application areas of web search, question-answering, recommendation engines, etc. On other hand, the task of topic modeling aims at understanding the thematic structures underlying in a collection of documents. Topic modeling is a popular text-mining tool to automatically analyze a large collection of documents and understand topical semantics without actually reading them. In doing so, it generates word clusters (i.e., topics) and document representations useful in document understanding and information retrieval, respectively. Essentially, the tasks of relation extraction and topic modeling are built upon the quality of representations learned from text. In this dissertation, we have developed task-specific neural models for learning representations, coupled with relation extraction and topic modeling tasks in the realms of supervised and unsupervised machine learning paradigms, respectively. More specifically, we make the following contributions in developing neural models for NLP tasks: 1. Neural Relation Extraction: Firstly, we have proposed a novel recurrent neural network based architecture for table-filling in order to jointly perform entity and relation extraction within sentences. Then, we have further extended our scope of extracting relationships between entities across sentence boundaries, and presented a novel dependency-based neural network architecture. The two contributions lie in the supervised paradigm of machine learning. Moreover, we have contributed in building a robust relation extractor constrained by the lack of labeled data, where we have proposed a novel weakly-supervised bootstrapping technique. Given the contributions, we have further explored interpretability of the recurrent neural networks to explain their predictions for the relation extraction task. 2. Neural Topic Modeling: Besides the supervised neural architectures, we have also developed unsupervised neural models to learn meaningful document representations within topic modeling frameworks. Firstly, we have proposed a novel dynamic topic model that captures topics over time. Next, we have contributed in building static topic models without considering temporal dependencies, where we have presented neural topic modeling architectures that also exploit external knowledge, i.e., word embeddings to address data sparsity. Moreover, we have developed neural topic models that incorporate knowledge transfers using both the word embeddings and latent topics from many sources. Finally, we have shown improving neural topic modeling by introducing language structures (e.g., word ordering, local syntactic and semantic information, etc.) that deals with bag-of-words issues in traditional topic models. The class of proposed neural NLP models in this section are based on techniques at the intersection of PGMs, deep learning and ANNs. Here, the task of neural relation extraction employs neural networks to learn representations typically at the sentence level, without access to the broader document context. However, topic models have access to statistical information across documents. Therefore, we advantageously combine the two complementary learning paradigms in a neural composite model, consisting of a neural topic and a neural language model that enables us to jointly learn thematic structures in a document collection via the topic model, and word relations within a sentence via the language model. Overall, our research contributions in this dissertation extend NLP-based systems for relation extraction and topic modeling tasks with state-of-the-art performances
    corecore