Search CORE

13 research outputs found

A framework for automated rating of online reviews against the underlying topics

Author: Andres Frederic
Dai Xiangfeng
Spasic Irena
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2017
Field of study

Even though the most online review systems offer star rating in addition to free text reviews, this only applies to the overall review. However, different users may have different preferences in relation to different aspects of a product or a service and may struggle to extract relevant information from a massive amount of consumer reviews available online. In this paper, we present a framework for extracting prevalent topics from online reviews and automatically rating them on a 5-star scale. It consists of five modules, including linguistic pre-processing, topic modelling, text classification, sentiment analysis, and rating. Topic modelling is used to extract prevalent topics, which are then used to classify individual sentences against these topics. A state-of-the-art word embedding method is used to measure the sentiment of each sentence. The two types of information associated with each sentence – its topic and sentiment – are combined to aggregate the sentiment associated with each topic. The overall topic sentiment is then projected onto the 5-star rating scale. We use a dataset of Airbnb online reviews to demonstrate a proof of concept. The proposed framework is simple and fully unsupervised. It is also domain independent, and, therefore, applicable to any other domains of products and services

Crossref

Online Research @ Cardiff

Detecting referential inconsistencies in electronic CV datasets

Author: A Veloso
AA Ferreira
AA Ferreira
D Lee
D Shin
DA Pereira
E Rahm
EN Borges
EN Borges
G Navarro
H Han
H Han
H Han
I-S Kang
Ivison C. Rubim
J Huang
J Tang
JD Ullman
JW Hunt
K-H Yang
L Shu
M Ley
M Ley
MG Carvalho
P Kanani
RG Cota
S Sarawagi
Vanessa Braganholo
W Liu
WJ Masek
Y Song
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

보도 사진의 통사적 구조와 미디어 편향에 대한 고찰: 유럽과 중동 언론의 분쟁 보도에 대한 시각 데이터 분석을 중심으로

Author: 유재연
Publication venue: 서울대학교 융합과학기술대학원
Publication date: 01/08/2017
Field of study

학위논문 (석사)-- 서울대학교 융합과학기술대학원 융합과학부, 2017. 8. 서봉원.기계의 시각 인지는 패턴을 읽는 방식으로 진행돼 왔다. 현행 기계학습 (Machine Learning)에서는 특질을 통해 대규모 데이터를 분석하고 결과를 예측하는 형태(feature engineering)를 띤다. Manovich가 언급한 패턴 읽기 방식처 이제는 대규모 시각 데이터에 대한 분석도 가능해졌다. 본 연구는 패턴 읽기를 통해 기계가 사진의 문맥을 파악하고, 그 안에 포함된 인간의 판단을 확인할 수 있는 가능성을 알아보고자 시행됐다. 쉽게 말해 인간이 의식적으로 또는 무의식적으로 고른 사진 속에서, 기계가 인간의 편견이나 고정관념을 찾아낼 수 있는 지를 살펴본 것이다. 분석을 위해 중동 언론인 알 자지라(Al Jazeera)와 유 언론인 로이터(Reuters)의 보도 사진 가운데 중동 및 유 테러를 다룬 것을 데이터로 썼다. 이들 사진을 대규모로 모은 뒤 각각 자기 문화권과 타 문화권의 테러 보도 방식에 어떤 차이가 있는 지 보고, 이를 토대로 편견 여부를 확인했다. 사진의 문맥을 파악하기 위해, 기존의 요소 인지를 중심으로 하는 패턴 뿐 아니라 사진의 구도에 대해서도 살펴봤다. 언어학의 통사론을 차용해 사진 속 위치에 따른 요소들의 기능을 확인했다. 분석 결과, 인간의 시각을 가장 많이 사로잡는 왼쪽 상단과 중앙, 중앙 상단과 한 가운데에는 인물의 등장이 잦았다. 그리고 이 위치에서 그룹별로 나타나는 인물의 감정에는 상당한 차이가 있었다. 연구를 통해 인간의 편견과 같이 정성(定性)적으로 파악되는 요소에 대해서도 기계적 분석을 할 수 있다는 가능성을 확인했다. 이는 기존 사회과학 연구에서 인간 코더가 하던 일을 기계가 대신 할 수 있는 장을 열어둠으로써, 더 많은 양의 데이터를 객관적으로 다룰 수 있도록 한다. 그동안 명확하지 않았던 인간이 편견을 느끼는 근거에 대해서도 교차적으로 연구할 수 있는 가능성 또한 열렸다. 본 연구에서 제시된 바, 사진 속 위치에 따른 문맥 파악을 토대로 이를 활용한 데이터 군집화(clustering) 및 편견 감지(detecting) 등이 가능할 것으로 보인다. 특히 미디어에서 시각적으로 치우친 보도를 할 경우를 경계하는 도구로도 사용될 수 있을 것이다. Eco 가 미디어를 비판하며 말한 바와 같이 세계에서 우리가 할 수 있는 유일한 자유 선택일 지도 모른다.제 1 장 서 론 제 1 절 연구의 배경 제 2 절 연구의 목표 제 2 장 선행 연구 제 1 절 저널리즘의 관점에 대한 연구 1.1 지역 미디어에 대한 연구 1.2 포토 저널리즘의 편향에 대한 연구 제 2 절 이미지 분석에 대한 연구 2.1 컴퓨터 비전 2.2 감성 컴퓨팅 2.3 색채 시맨틱 연구 제 3 절 사진의 구조에 대한 연구 3.1 인간의 시지각에 대한 연구 3.2 시각적 표상에 대한 철학적 논의 제 3 장 데이터 수집 및 기계 분석 제 1 절 데이터 개요 제 2 절 데이터의 편견에 대한 연구 2.1 사물 및 상황 분석 2.2 감정 분석 제 3 절 신뢰도 확인을 위한 파일럿 테스트 제 4 장 이미지 속 위치에 따른 분석 제 1 절 사진 속 인물의 비중 및 위치에 대한 연구 제 2 절 사진 속 인물의 위치별 감정에 대한 연구 제 3 절 사진의 통사적 구조에 대한 결론 제 5 장 결론 및 연구 의의 제 1 절 연구 요약 제 2 절 연구의 시사점 제 3 절 연구의 한계 및 제언 참고 문헌 별첨 AbstractMaste

SNU Open Repository and Archive

Diffusion of Falsehoods on Social Media

Author: King Kelvin Kizito
Publication venue: ScholarWorks @ UTRGV
Publication date: 01/08/2020
Field of study

Misinformation has captured the interest of academia in recent years with several studies looking at the topic broadly. However, these studies mostly focused on rumors which are social in nature and can be either classified as false or real. In this research, we attempt to bridge the gap in the literature by examining the impacts of user characteristics and feature contents on the diffusion of (mis)information using verified true and false information. We apply a topic allocation model augmented by both supervised and unsupervised machine learning algorithms to identify tweets on novel topics. We find that retweet count is higher for fake news, novel tweets, and tweets with negative sentiment and lower lexical structure. In addition, our results show that the impacts of sentiment are opposite for fake news versus real news. We also find that tweets on the environment have a lower retweet count than the baseline religious news and real social news tweets are shared more often than fake social news. Furthermore, our studies show the counter intuitive nature of current correction endeavors by FEMA and other fact checking organizations in combating falsehoods. Specifically, we show that even though fake news causes an increase in correction messages, they influenced the propagation of falsehoods. Finally our empirical results reveal that correction messages, positive tweets and emotionally charged tweets morph faster. Furthermore, we show that tweets with positive sentiment or are emotionally charged morph faster over time. Word count and past morphing history also positively affect morphing behavior

Scholarworks@UTRGV Univ. of Texas RioGrande Valley

Approximate Bayesian Inference for Count Data Modeling

Author: Sumba Toral Francisco Xavier
Publication venue
Publication date: 01/04/2020
Field of study

Bayesian inference allows to make conclusions based on some antecedents that depend on prior knowledge. It additionally allows to quantify uncertainty, which is important in Machine Learning in order to make better predictions and model interpretability. However, in real applications, we often deal with complicated models for which is unfeasible to perform full Bayesian inference. This thesis explores the use of approximate Bayesian inference for count data modeling using Expectation Propagation and Stochastic Expectation Propagation. In Chapter 2, we develop an expectation propagation approach to learn an EDCM finite mixture model. The EDCM distribution is an exponential approximation to the widely used Dirichlet Compound distribution and has shown to offer excellent modeling capabilities in the case of sparse count data. Chapter 3 develops an efficient generative mixture model of EMSD distributions. We use Stochastic Expectation Propagation, which reduces memory consumption, important characteristic when making inference in large datasets. Finally, Chapter 4 develops a probabilistic topic model using the generalized Dirichlet distribution (LGDA) in order to capture topic correlation while maintaining conjugacy. We make use of Expectation Propagation to approximate the posterior, resulting in a model that achieves more accurate inference compared to variational inference. We show that latent topics can be used as a proxy for improving supervised tasks

Concordia University Research Repository

Content Enrichment of Digital Libraries: Methods, Technologies and Implementations

Author: Hajra Arben
Publication venue
Publication date: 01/01/2020
Field of study

Parallel to the establishment of the concept of a "digital library", there have been rapid developments in the fields of semantic technologies, information retrieval and artificial intelligence. The idea is to use make use of these three fields to crosslink bibliographic data, i.e., library content, and to enrich it "intelligently" with additional, especially non-library, information. By linking the contents of a library, it is possible to offer users access to semantically similar contents of different digital libraries. For instance, a list of semantically similar publications from completely different subject areas and from different digital libraries can be made accessible. In addition, the user is able to see a wider profile about authors, enriched with information such as biographical details, name alternatives, images, job titles, institute affiliations, etc. This information comes from a wide variety of sources, most of which are not library sources. In order to make such scenarios a reality, this dissertation follows two approaches. The first approach is about crosslinking digital library content in order to offer semantically similar publications based on additional information for a publication. Hence, this approach uses publication-related metadata as a basis. The aligned terms between linked open data repositories/thesauri are considered as an important starting point by considering narrower, broader and related concepts through semantic data models such as SKOS. Information retrieval methods are applied to identify publications with high semantic similarity. For this purpose, approaches of vector space models and "word embedding" are applied and analyzed comparatively. The analyses are performed in digital libraries with different thematic focuses (e.g. economy and agriculture). Using machine learning techniques, metadata is enriched, e.g. with synonyms for content keywords, in order to further improve similarity calculations. To ensure quality, the proposed approaches will be analyzed comparatively with different metadata sets, which will be assessed by experts. Through the combination of different information retrieval methods, the quality of the results can be further improved. This is especially true when user interactions offer possibilities for adjusting the search properties. In the second approach, which this dissertation pursues, author-related data are harvested in order to generate a comprehensive author profile for a digital library. For this purpose, non-library sources, such as linked data repositories (e.g. WIKIDATA) and library sources, such as authority data, are used. If such different sources are used, the disambiguation of author names via the use of already existing persistent identifiers becomes necessary. To this end, we offer an algorithmic approach to disambiguate authors, which makes use of authority data such as the Virtual International Authority File (VIAF). Referring to computer sciences, the methodological value of this dissertation lies in the combination of semantic technologies with methods of information retrieval and artificial intelligence to increase the interoperability between digital libraries and between libraries with non-library sources. By positioning this dissertation as an application-oriented contribution to improve the interoperability, two major contributions are made in the context of digital libraries: (1) The retrieval of information from different Digital Libraries can be made possible via a single access. (2) Existing information about authors is collected from different sources and aggregated into one author profile.Parallel zur Etablierung des Konzepts einer „Digitalen Bibliothek“ gab es rasante Weiterentwicklungen in den Bereichen semantischer Technologien, Information Retrieval und künstliche Intelligenz. Die Idee ist es, mit ihrer Hilfe bibliographische Daten, also Inhalte von Bibliotheken, miteinander zu vernetzen und „intelligent“ mit zusätzlichen, insbesondere nicht-bibliothekarischen Informationen anzureichern. Durch die Verknüpfung von Inhalten einer Bibliothek wird es möglich, einen Zugang für Benutzer*innen anzubieten, über den semantisch ähnliche Inhalte unterschiedlicher Digitaler Bibliotheken zugänglich werden. Beispielsweise können hierüber ausgehend von einer bestimmten Publikation eine Liste semantisch ähnlicher Publikationen ggf. aus völlig unterschiedlichen Themenfeldern und aus verschiedenen digitalen Bibliotheken zugänglich gemacht werden. Darüber hinaus können sich Nutzer*innen ein breiteres Autoren-Profil anzeigen lassen, das mit Informationen wie biographischen Angaben, Namensalternativen, Bildern, Berufsbezeichnung, Instituts-Zugehörigkeiten usw. angereichert ist. Diese Informationen kommen aus unterschiedlichsten und in der Regel nicht-bibliothekarischen Quellen. Um derartige Szenarien Realität werden zu lassen, verfolgt diese Dissertation zwei Ansätze. Der erste Ansatz befasst sich mit der Vernetzung von Inhalten Digitaler Bibliotheken, um auf Basis zusätzlicher Informationen für eine Publikation semantisch ähnliche Publikationen anzubieten. Dieser Ansatz verwendet publikationsbezogene Metadaten als Grundlage. Die verknüpften Begriffe zwischen verlinkten offenen Datenrepositorien/Thesauri werden als wichtiger Angelpunkt betrachtet, indem Unterbegriffe, Oberbegriffe und verwandten Konzepte über semantische Datenmodelle, wie SKOS, berücksichtigt werden. Methoden des Information Retrieval werden angewandt, um v.a. Publikationen mit hoher semantischer Verwandtschaft zu identifizieren. Zu diesem Zweck werden Ansätze des Vektorraummodells und des „Word Embedding“ eingesetzt und vergleichend analysiert. Die Analysen werden in Digitalen Bibliotheken mit unterschiedlichen thematischen Schwerpunkten (z.B. Wirtschaft und Landwirtschaft) durchgeführt. Durch Techniken des maschinellen Lernens werden hierfür Metadaten angereichert, z.B. mit Synonymen für inhaltliche Schlagwörter, um so Ähnlichkeitsberechnungen weiter zu verbessern. Zur Sicherstellung der Qualität werden die beiden Ansätze mit verschiedenen Metadatensätzen vergleichend analysiert wobei die Beurteilung durch Expert*innen erfolgt. Durch die Verknüpfung verschiedener Methoden des Information Retrieval kann die Qualität der Ergebnisse weiter verbessert werden. Dies trifft insbesondere auch dann zu wenn Benutzerinteraktion Möglichkeiten zur Anpassung der Sucheigenschaften bieten. Im zweiten Ansatz, den diese Dissertation verfolgt, werden autorenbezogene Daten gesammelt, verbunden mit dem Ziel, ein umfassendes Autorenprofil für eine Digitale Bibliothek zu generieren. Für diesen Zweck kommen sowohl nicht-bibliothekarische Quellen, wie Linked Data-Repositorien (z.B. WIKIDATA) und als auch bibliothekarische Quellen, wie Normdatensysteme, zum Einsatz. Wenn solch unterschiedliche Quellen genutzt werden, wird die Disambiguierung von Autorennamen über die Nutzung bereits vorhandener persistenter Identifikatoren erforderlich. Hierfür bietet sich ein algorithmischer Ansatz für die Disambiguierung von Autoren an, der Normdaten, wie die des Virtual International Authority File (VIAF) nachnutzt. Mit Bezug zur Informatik liegt der methodische Wert dieser Dissertation in der Kombination von semantischen Technologien mit Verfahren des Information Retrievals und der künstlichen Intelligenz zur Erhöhung von Interoperabilität zwischen Digitalen Bibliotheken und zwischen Bibliotheken und nicht-bibliothekarischen Quellen. Mit der Positionierung dieser Dissertation als anwendungsorientierter Beitrag zur Verbesserung von Interoperabilität werden zwei wesentliche Beiträge im Kontext Digitaler Bibliotheken geleistet: (1) Die Recherche nach Informationen aus unterschiedlichen Digitalen Bibliotheken kann über einen Zugang ermöglicht werden. (2) Vorhandene Informationen über Autor*innen werden aus unterschiedlichsten Quellen eingesammelt und zu einem Autorenprofil aggregiert

MACAU: Open Access Repository of Kiel University

Recommended from our members

From Texts to Pixels: Leveraging Artificial Intelligence to Achieve Novel Insights into Hydrologic Research, Human-Drought Interactions, and Global Drought Prediction

Author: Rahman Mashrekur
Publication venue: eScholarship, University of California
Publication date: 01/01/2024
Field of study

This dissertation presents three interconnected studies that leverage advanced computational techniques, including Natural Language Processing (NLP), Computer Vision, Machine Learning, and Big Data analytics to gain insights into various aspects of hydrologic sciences and drought research.In the first study, we applied NLP to assess topic diversity in approximately 75,000 research articles from eighteen water science and hydrology journals published between 1991 and 2019. We found that individual water science and hydrology research articles are becoming increasingly diverse in the sense that, on average, the number of topics represented in individual articles is increasing, which may be a sign of increasing interdisciplinarity. This is true even though the body of water science and hydrology literature as a whole is not becoming more topically diverse. Topics with the largest increases in popularity were Climate Change Impacts, Water Policy & Planning, and Pollutant Removal. Topics with the largest decreases in popularity were Stochastic Models and Numerical Models. At a journal level, Water Resources Research, Journal of Hydrology, and Hydrological Processes are the three most topically diverse journals among the corpus that we studied.The second study focused on understanding the relationship between droughts and drought awareness, which is crucial for decision-making, policy development, and socioeconomic outcomes related to water management and conservation strategies. We used computer vision (UNet models) to analyze nonlinear, lagged correlations between Standardized Precipitation Evapotranspiration Index (SPEI) and Google Trends Search Interest within the Continental United States (CONUS). We also used Twitter data to asses people's sentiments about droughts. The most important drivers of this relationship are the variability and ranges of drought trends and severity, as well as climatic extremes. This relationship was the strongest for Western states, followed by Northeastern, Southeastern, and Central regions. Search interest tends to lag droughts by a period of ~1-3 months. We also found evidence that reductionist linear approaches, such as Principal Component Analysis, might not be as effective as UNet models in capturing the nuanced relationship between droughts and drought awareness at various dimensions and scales. We subsequently applied sentiment analysis on a set of 2.5 million georeferenced tweets related to droughts and found that people's sentiments towards drought have become increasingly positive with decreasing neutral sentiments since 2014 within the United States. In the third study, we propose a novel approach for global drought prediction using the Vision Transformer (ViT) model, leveraging its ability to contextually learn spatial and temporal patterns from high-dimensional climate data. Using a sliding window approach, we trained the ViT model on a global dataset spanning from January 1970 to December 2004, using Sea Surface Temperature (SST), 2-meter Air Temperature (T2M), and Total Precipitation (TP) as input variables, and the Standardized Precipitation Evapotranspiration Index (SPEI) (looking ahead 0, 1, and 2 months) as the target variable. The model's performance is evaluated on a test dataset from January 2005 to December 2020 using accuracy, precision, recall, and F1 score metrics. Our results demonstrate the ViT model's effectiveness in predicting drought occurrences, with high accuracy scores ranging from 0.9456 to 0.9475 and precision scores from 0.8747 to 0.8781 for a three-month prediction horizon. The model's relatively lower recall scores (0.6285 to 0.6465) indicate room for improvement in capturing all drought occurrences, particularly in regions with complex or sporadic drought patterns. The findings of this study indicate substantial potential of the ViT model in predicting increasingly complex meteorological drought occurrences on a global scale. Collectively, these studies contribute to the advancement of hydrologic sciences by providing operational tools and insights for researchers, policymakers, and all stakeholders in the field of water resources science and management

eScholarship - University of California