1,145 research outputs found
Automated image tagging through tag propagation
Trabalho apresentado no âmbito do Mestrado em
Engenharia Informática, como requisito parcial
Para obtenção do grau de Mestre em Engenharia
InformáticaToday, more and more data is becoming available on the Web. In particular, we have recently witnessed an exponential increase of multimedia content within various content sharing websites. While this content is widely available, great challenges have arisen to effectively search and browse such vast amount of content. A solution to this problem is to annotate information, a task that without computer aid requires a large-scale human effort. The goal of this thesis is to automate the task of annotating multimedia information with machine learning algorithms.
We propose the development of a machine learning framework capable of doing automated image annotation in large-scale consumer photos. To this extent a study on state of art algorithms was conducted, which concluded with a baseline implementation of a k-nearest neighbor algorithm. This baseline was used to implement a more advanced algorithm capable of annotating images in the situations with limited training images and a large set of test images – thus, a semi-supervised approach.
Further studies were conducted on the feature spaces used to describe images towards a successful integration in the developed framework. We first analyzed the semantic gap between the visual feature spaces and concepts present in an image, and how to avoid or mitigate this gap. Moreover, we examined how users perceive images by performing a statistical analysis of the image tags inserted by users. A linguistic and statistical expansion of image tags was also implemented.
The developed framework withstands uneven data distributions that occur in consumer datasets, and scales accordingly, requiring few previously annotated data. The principal mechanism that allows easier scaling is the propagation of information between the annotated data and un-annotated data
Large-scale structure of a nation-wide production network
Production in an economy is a set of firms' activities as suppliers and
customers; a firm buys goods from other firms, puts value added and sells
products to others in a giant network of production. Empirical study is lacking
despite the fact that the structure of the production network is important to
understand and make models for many aspects of dynamics in economy. We study a
nation-wide production network comprising a million firms and millions of
supplier-customer links by using recent statistical methods developed in
physics. We show in the empirical analysis scale-free degree distribution,
disassortativity, correlation of degree to firm-size, and community structure
having sectoral and regional modules. Since suppliers usually provide credit to
their customers, who supply it to theirs in turn, each link is actually a
creditor-debtor relationship. We also study chains of failures or bankruptcies
that take place along those links in the network, and corresponding
avalanche-size distribution.Comment: 17 pages with 8 figures; revised section VI and references adde
A Survey of Location Prediction on Twitter
Locations, e.g., countries, states, cities, and point-of-interests, are
central to news, emergency events, and people's daily lives. Automatic
identification of locations associated with or mentioned in documents has been
explored for decades. As one of the most popular online social network
platforms, Twitter has attracted a large number of users who send millions of
tweets on daily basis. Due to the world-wide coverage of its users and
real-time freshness of tweets, location prediction on Twitter has gained
significant attention in recent years. Research efforts are spent on dealing
with new challenges and opportunities brought by the noisy, short, and
context-rich nature of tweets. In this survey, we aim at offering an overall
picture of location prediction on Twitter. Specifically, we concentrate on the
prediction of user home locations, tweet locations, and mentioned locations. We
first define the three tasks and review the evaluation metrics. By summarizing
Twitter network, tweet content, and tweet context as potential inputs, we then
structurally highlight how the problems depend on these inputs. Each dependency
is illustrated by a comprehensive review of the corresponding strategies
adopted in state-of-the-art approaches. In addition, we also briefly review two
related problems, i.e., semantic location prediction and point-of-interest
recommendation. Finally, we list future research directions.Comment: Accepted to TKDE. 30 pages, 1 figur
Doctor of Philosophy
dissertationDue to the popularity of Web 2.0 and Social Media in the last decade, the percolation of user generated content (UGC) has rapidly increased. In the financial realm, this results in the emergence of virtual investing communities (VIC) to the investing public. There is an on-going debate among scholars and practitioners on whether such UGC contain valuable investing information or mainly noise. I investigate two major studies in my dissertation. First I examine the relationship between peer influence and information quality in the context of individual characteristics in stock microblogging. Surprisingly, I discover that the set of individual characteristics that relate to peer influence is not synonymous with those that relate to high information quality. In relating to information quality, influentials who are frequently mentioned by peers due to their name value are likely to possess higher information quality while those who are better at diffusing information via retweets are likely to associate with lower information quality. Second I propose a study to explore predictability of stock microblog dimensions and features over stock price directional movements using data mining classification techniques. I find that author-ticker-day dimension produces the highest predictive accuracy inferring that this dimension is able to capture both relevant author and ticker information as compared to author-day and ticker-day. In addition to these two studies, I also explore two topics: network structure of co-tweeted tickers and sentiment annotation via crowdsourcing. I do this in order to understand and uncover new features as well as new outcome indicators with the objective of improving predictive accuracy of the classification or saliency of the explanatory models. My dissertation work extends the frontier in understanding the relationship between financial UGC, specifically stock microblogging with relevant phenomena as well as predictive outcomes
Collective dynamics of social annotation
The enormous increase of popularity and use of the WWW has led in the recent
years to important changes in the ways people communicate. An interesting
example of this fact is provided by the now very popular social annotation
systems, through which users annotate resources (such as web pages or digital
photographs) with text keywords dubbed tags. Understanding the rich emerging
structures resulting from the uncoordinated actions of users calls for an
interdisciplinary effort. In particular concepts borrowed from statistical
physics, such as random walks, and the complex networks framework, can
effectively contribute to the mathematical modeling of social annotation
systems. Here we show that the process of social annotation can be seen as a
collective but uncoordinated exploration of an underlying semantic space,
pictured as a graph, through a series of random walks. This modeling framework
reproduces several aspects, so far unexplained, of social annotation, among
which the peculiar growth of the size of the vocabulary used by the community
and its complex network structure that represents an externalization of
semantic structures grounded in cognition and typically hard to access
A survey of statistical network models
Networks are ubiquitous in science and have become a focal point for
discussion in everyday life. Formal statistical models for the analysis of
network data have emerged as a major topic of interest in diverse areas of
study, and most of these involve a form of graphical representation.
Probability models on graphs date back to 1959. Along with empirical studies in
social psychology and sociology from the 1960s, these early works generated an
active network community and a substantial literature in the 1970s. This effort
moved into the statistical literature in the late 1970s and 1980s, and the past
decade has seen a burgeoning network literature in statistical physics and
computer science. The growth of the World Wide Web and the emergence of online
networking communities such as Facebook, MySpace, and LinkedIn, and a host of
more specialized professional network communities has intensified interest in
the study of networks and network data. Our goal in this review is to provide
the reader with an entry point to this burgeoning literature. We begin with an
overview of the historical development of statistical network modeling and then
we introduce a number of examples that have been studied in the network
literature. Our subsequent discussion focuses on a number of prominent static
and dynamic network models and their interconnections. We emphasize formal
model descriptions, and pay special attention to the interpretation of
parameters and their estimation. We end with a description of some open
problems and challenges for machine learning and statistics.Comment: 96 pages, 14 figures, 333 reference
Effects of Network Connectivity and Diversity Distribution on Human Collective Ideation
Human collectives, e.g., teams and organizations, increasingly require
participation of members with diverse backgrounds working in networked social
environments. However, little is known about how network structure and the
diversity of member backgrounds would affect collective processes. Here we
conducted three sets of human-subject experiments which involved 617
participants who collaborated anonymously in a collective ideation task on a
custom-made online social network platform. We found that spatially clustered
collectives with clustered background distribution tended to explore more
diverse ideas than in other conditions, whereas collectives with random
background distribution consistently generated ideas with the highest utility.
We also found that higher network connectivity may improve individuals' overall
experience but may not improve the collective performance regarding idea
generation, idea diversity, and final idea quality.Comment: 43 pages, 19 figures, 4 table
- …