13 research outputs found

    A framework for automated rating of online reviews against the underlying topics

    Get PDF
    Even though the most online review systems offer star rating in addition to free text reviews, this only applies to the overall review. However, different users may have different preferences in relation to different aspects of a product or a service and may struggle to extract relevant information from a massive amount of consumer reviews available online. In this paper, we present a framework for extracting prevalent topics from online reviews and automatically rating them on a 5-star scale. It consists of five modules, including linguistic pre-processing, topic modelling, text classification, sentiment analysis, and rating. Topic modelling is used to extract prevalent topics, which are then used to classify individual sentences against these topics. A state-of-the-art word embedding method is used to measure the sentiment of each sentence. The two types of information associated with each sentence – its topic and sentiment – are combined to aggregate the sentiment associated with each topic. The overall topic sentiment is then projected onto the 5-star rating scale. We use a dataset of Airbnb online reviews to demonstrate a proof of concept. The proposed framework is simple and fully unsupervised. It is also domain independent, and, therefore, applicable to any other domains of products and services

    보도 사진의 통사적 구조와 미디어 편향에 대한 고찰: 유럽과 중동 언론의 분쟁 보도에 대한 시각 데이터 분석을 중심으로

    Get PDF
    학위논문 (석사)-- 서울대학교 융합과학기술대학원 융합과학부, 2017. 8. 서봉원.기계의 시각 인지는 패턴을 읽는 방식으로 진행돼 왔다. 현행 기계학습 (Machine Learning)에서는 특질을 통해 대규모 데이터를 분석하고 결과를 예측하는 형태(feature engineering)를 띤다. Manovich가 언급한 패턴 읽기 방식처 이제는 대규모 시각 데이터에 대한 분석도 가능해졌다. 본 연구는 패턴 읽기를 통해 기계가 사진의 문맥을 파악하고, 그 안에 포함된 인간의 판단을 확인할 수 있는 가능성을 알아보고자 시행됐다. 쉽게 말해 인간이 의식적으로 또는 무의식적으로 고른 사진 속에서, 기계가 인간의 편견이나 고정관념을 찾아낼 수 있는 지를 살펴본 것이다. 분석을 위해 중동 언론인 알 자지라(Al Jazeera)와 유 언론인 로이터(Reuters)의 보도 사진 가운데 중동 및 유 테러를 다룬 것을 데이터로 썼다. 이들 사진을 대규모로 모은 뒤 각각 자기 문화권과 타 문화권의 테러 보도 방식에 어떤 차이가 있는 지 보고, 이를 토대로 편견 여부를 확인했다. 사진의 문맥을 파악하기 위해, 기존의 요소 인지를 중심으로 하는 패턴 뿐 아니라 사진의 구도에 대해서도 살펴봤다. 언어학의 통사론을 차용해 사진 속 위치에 따른 요소들의 기능을 확인했다. 분석 결과, 인간의 시각을 가장 많이 사로잡는 왼쪽 상단과 중앙, 중앙 상단과 한 가운데에는 인물의 등장이 잦았다. 그리고 이 위치에서 그룹별로 나타나는 인물의 감정에는 상당한 차이가 있었다. 연구를 통해 인간의 편견과 같이 정성(定性)적으로 파악되는 요소에 대해서도 기계적 분석을 할 수 있다는 가능성을 확인했다. 이는 기존 사회과학 연구에서 인간 코더가 하던 일을 기계가 대신 할 수 있는 장을 열어둠으로써, 더 많은 양의 데이터를 객관적으로 다룰 수 있도록 한다. 그동안 명확하지 않았던 인간이 편견을 느끼는 근거에 대해서도 교차적으로 연구할 수 있는 가능성 또한 열렸다. 본 연구에서 제시된 바, 사진 속 위치에 따른 문맥 파악을 토대로 이를 활용한 데이터 군집화(clustering) 및 편견 감지(detecting) 등이 가능할 것으로 보인다. 특히 미디어에서 시각적으로 치우친 보도를 할 경우를 경계하는 도구로도 사용될 수 있을 것이다. Eco 가 미디어를 비판하며 말한 바와 같이 세계에서 우리가 할 수 있는 유일한 자유 선택일 지도 모른다.제 1 장 서 론 제 1 절 연구의 배경 제 2 절 연구의 목표 제 2 장 선행 연구 제 1 절 저널리즘의 관점에 대한 연구 1.1 지역 미디어에 대한 연구 1.2 포토 저널리즘의 편향에 대한 연구 제 2 절 이미지 분석에 대한 연구 2.1 컴퓨터 비전 2.2 감성 컴퓨팅 2.3 색채 시맨틱 연구 제 3 절 사진의 구조에 대한 연구 3.1 인간의 시지각에 대한 연구 3.2 시각적 표상에 대한 철학적 논의 제 3 장 데이터 수집 및 기계 분석 제 1 절 데이터 개요 제 2 절 데이터의 편견에 대한 연구 2.1 사물 및 상황 분석 2.2 감정 분석 제 3 절 신뢰도 확인을 위한 파일럿 테스트 제 4 장 이미지 속 위치에 따른 분석 제 1 절 사진 속 인물의 비중 및 위치에 대한 연구 제 2 절 사진 속 인물의 위치별 감정에 대한 연구 제 3 절 사진의 통사적 구조에 대한 결론 제 5 장 결론 및 연구 의의 제 1 절 연구 요약 제 2 절 연구의 시사점 제 3 절 연구의 한계 및 제언 참고 문헌 별첨 AbstractMaste

    Diffusion of Falsehoods on Social Media

    Get PDF
    Misinformation has captured the interest of academia in recent years with several studies looking at the topic broadly. However, these studies mostly focused on rumors which are social in nature and can be either classified as false or real. In this research, we attempt to bridge the gap in the literature by examining the impacts of user characteristics and feature contents on the diffusion of (mis)information using verified true and false information. We apply a topic allocation model augmented by both supervised and unsupervised machine learning algorithms to identify tweets on novel topics. We find that retweet count is higher for fake news, novel tweets, and tweets with negative sentiment and lower lexical structure. In addition, our results show that the impacts of sentiment are opposite for fake news versus real news. We also find that tweets on the environment have a lower retweet count than the baseline religious news and real social news tweets are shared more often than fake social news. Furthermore, our studies show the counter intuitive nature of current correction endeavors by FEMA and other fact checking organizations in combating falsehoods. Specifically, we show that even though fake news causes an increase in correction messages, they influenced the propagation of falsehoods. Finally our empirical results reveal that correction messages, positive tweets and emotionally charged tweets morph faster. Furthermore, we show that tweets with positive sentiment or are emotionally charged morph faster over time. Word count and past morphing history also positively affect morphing behavior

    Approximate Bayesian Inference for Count Data Modeling

    Get PDF
    Bayesian inference allows to make conclusions based on some antecedents that depend on prior knowledge. It additionally allows to quantify uncertainty, which is important in Machine Learning in order to make better predictions and model interpretability. However, in real applications, we often deal with complicated models for which is unfeasible to perform full Bayesian inference. This thesis explores the use of approximate Bayesian inference for count data modeling using Expectation Propagation and Stochastic Expectation Propagation. In Chapter 2, we develop an expectation propagation approach to learn an EDCM finite mixture model. The EDCM distribution is an exponential approximation to the widely used Dirichlet Compound distribution and has shown to offer excellent modeling capabilities in the case of sparse count data. Chapter 3 develops an efficient generative mixture model of EMSD distributions. We use Stochastic Expectation Propagation, which reduces memory consumption, important characteristic when making inference in large datasets. Finally, Chapter 4 develops a probabilistic topic model using the generalized Dirichlet distribution (LGDA) in order to capture topic correlation while maintaining conjugacy. We make use of Expectation Propagation to approximate the posterior, resulting in a model that achieves more accurate inference compared to variational inference. We show that latent topics can be used as a proxy for improving supervised tasks

    Content Enrichment of Digital Libraries: Methods, Technologies and Implementations

    Get PDF
    Parallel to the establishment of the concept of a "digital library", there have been rapid developments in the fields of semantic technologies, information retrieval and artificial intelligence. The idea is to use make use of these three fields to crosslink bibliographic data, i.e., library content, and to enrich it "intelligently" with additional, especially non-library, information. By linking the contents of a library, it is possible to offer users access to semantically similar contents of different digital libraries. For instance, a list of semantically similar publications from completely different subject areas and from different digital libraries can be made accessible. In addition, the user is able to see a wider profile about authors, enriched with information such as biographical details, name alternatives, images, job titles, institute affiliations, etc. This information comes from a wide variety of sources, most of which are not library sources. In order to make such scenarios a reality, this dissertation follows two approaches. The first approach is about crosslinking digital library content in order to offer semantically similar publications based on additional information for a publication. Hence, this approach uses publication-related metadata as a basis. The aligned terms between linked open data repositories/thesauri are considered as an important starting point by considering narrower, broader and related concepts through semantic data models such as SKOS. Information retrieval methods are applied to identify publications with high semantic similarity. For this purpose, approaches of vector space models and "word embedding" are applied and analyzed comparatively. The analyses are performed in digital libraries with different thematic focuses (e.g. economy and agriculture). Using machine learning techniques, metadata is enriched, e.g. with synonyms for content keywords, in order to further improve similarity calculations. To ensure quality, the proposed approaches will be analyzed comparatively with different metadata sets, which will be assessed by experts. Through the combination of different information retrieval methods, the quality of the results can be further improved. This is especially true when user interactions offer possibilities for adjusting the search properties. In the second approach, which this dissertation pursues, author-related data are harvested in order to generate a comprehensive author profile for a digital library. For this purpose, non-library sources, such as linked data repositories (e.g. WIKIDATA) and library sources, such as authority data, are used. If such different sources are used, the disambiguation of author names via the use of already existing persistent identifiers becomes necessary. To this end, we offer an algorithmic approach to disambiguate authors, which makes use of authority data such as the Virtual International Authority File (VIAF). Referring to computer sciences, the methodological value of this dissertation lies in the combination of semantic technologies with methods of information retrieval and artificial intelligence to increase the interoperability between digital libraries and between libraries with non-library sources. By positioning this dissertation as an application-oriented contribution to improve the interoperability, two major contributions are made in the context of digital libraries: (1) The retrieval of information from different Digital Libraries can be made possible via a single access. (2) Existing information about authors is collected from different sources and aggregated into one author profile.Parallel zur Etablierung des Konzepts einer „Digitalen Bibliothek“ gab es rasante Weiterentwicklungen in den Bereichen semantischer Technologien, Information Retrieval und künstliche Intelligenz. Die Idee ist es, mit ihrer Hilfe bibliographische Daten, also Inhalte von Bibliotheken, miteinander zu vernetzen und „intelligent“ mit zusätzlichen, insbesondere nicht-bibliothekarischen Informationen anzureichern. Durch die Verknüpfung von Inhalten einer Bibliothek wird es möglich, einen Zugang für Benutzer*innen anzubieten, über den semantisch ähnliche Inhalte unterschiedlicher Digitaler Bibliotheken zugänglich werden. Beispielsweise können hierüber ausgehend von einer bestimmten Publikation eine Liste semantisch ähnlicher Publikationen ggf. aus völlig unterschiedlichen Themenfeldern und aus verschiedenen digitalen Bibliotheken zugänglich gemacht werden. Darüber hinaus können sich Nutzer*innen ein breiteres Autoren-Profil anzeigen lassen, das mit Informationen wie biographischen Angaben, Namensalternativen, Bildern, Berufsbezeichnung, Instituts-Zugehörigkeiten usw. angereichert ist. Diese Informationen kommen aus unterschiedlichsten und in der Regel nicht-bibliothekarischen Quellen. Um derartige Szenarien Realität werden zu lassen, verfolgt diese Dissertation zwei Ansätze. Der erste Ansatz befasst sich mit der Vernetzung von Inhalten Digitaler Bibliotheken, um auf Basis zusätzlicher Informationen für eine Publikation semantisch ähnliche Publikationen anzubieten. Dieser Ansatz verwendet publikationsbezogene Metadaten als Grundlage. Die verknüpften Begriffe zwischen verlinkten offenen Datenrepositorien/Thesauri werden als wichtiger Angelpunkt betrachtet, indem Unterbegriffe, Oberbegriffe und verwandten Konzepte über semantische Datenmodelle, wie SKOS, berücksichtigt werden. Methoden des Information Retrieval werden angewandt, um v.a. Publikationen mit hoher semantischer Verwandtschaft zu identifizieren. Zu diesem Zweck werden Ansätze des Vektorraummodells und des „Word Embedding“ eingesetzt und vergleichend analysiert. Die Analysen werden in Digitalen Bibliotheken mit unterschiedlichen thematischen Schwerpunkten (z.B. Wirtschaft und Landwirtschaft) durchgeführt. Durch Techniken des maschinellen Lernens werden hierfür Metadaten angereichert, z.B. mit Synonymen für inhaltliche Schlagwörter, um so Ähnlichkeitsberechnungen weiter zu verbessern. Zur Sicherstellung der Qualität werden die beiden Ansätze mit verschiedenen Metadatensätzen vergleichend analysiert wobei die Beurteilung durch Expert*innen erfolgt. Durch die Verknüpfung verschiedener Methoden des Information Retrieval kann die Qualität der Ergebnisse weiter verbessert werden. Dies trifft insbesondere auch dann zu wenn Benutzerinteraktion Möglichkeiten zur Anpassung der Sucheigenschaften bieten. Im zweiten Ansatz, den diese Dissertation verfolgt, werden autorenbezogene Daten gesammelt, verbunden mit dem Ziel, ein umfassendes Autorenprofil für eine Digitale Bibliothek zu generieren. Für diesen Zweck kommen sowohl nicht-bibliothekarische Quellen, wie Linked Data-Repositorien (z.B. WIKIDATA) und als auch bibliothekarische Quellen, wie Normdatensysteme, zum Einsatz. Wenn solch unterschiedliche Quellen genutzt werden, wird die Disambiguierung von Autorennamen über die Nutzung bereits vorhandener persistenter Identifikatoren erforderlich. Hierfür bietet sich ein algorithmischer Ansatz für die Disambiguierung von Autoren an, der Normdaten, wie die des Virtual International Authority File (VIAF) nachnutzt. Mit Bezug zur Informatik liegt der methodische Wert dieser Dissertation in der Kombination von semantischen Technologien mit Verfahren des Information Retrievals und der künstlichen Intelligenz zur Erhöhung von Interoperabilität zwischen Digitalen Bibliotheken und zwischen Bibliotheken und nicht-bibliothekarischen Quellen. Mit der Positionierung dieser Dissertation als anwendungsorientierter Beitrag zur Verbesserung von Interoperabilität werden zwei wesentliche Beiträge im Kontext Digitaler Bibliotheken geleistet: (1) Die Recherche nach Informationen aus unterschiedlichen Digitalen Bibliotheken kann über einen Zugang ermöglicht werden. (2) Vorhandene Informationen über Autor*innen werden aus unterschiedlichsten Quellen eingesammelt und zu einem Autorenprofil aggregiert
    corecore