67,762 research outputs found
On the discovery of social roles in large scale social systems
The social role of a participant in a social system is a label
conceptualizing the circumstances under which she interacts within it. They may
be used as a theoretical tool that explains why and how users participate in an
online social system. Social role analysis also serves practical purposes, such
as reducing the structure of complex systems to rela- tionships among roles
rather than alters, and enabling a comparison of social systems that emerge in
similar contexts. This article presents a data-driven approach for the
discovery of social roles in large scale social systems. Motivated by an
analysis of the present art, the method discovers roles by the conditional
triad censuses of user ego-networks, which is a promising tool because they
capture the degree to which basic social forces push upon a user to interact
with others. Clusters of censuses, inferred from samples of large scale network
carefully chosen to preserve local structural prop- erties, define the social
roles. The promise of the method is demonstrated by discussing and discovering
the roles that emerge in both Facebook and Wikipedia. The article con- cludes
with a discussion of the challenges and future opportunities in the discovery
of social roles in large social systems
Graph Summarization
The continuous and rapid growth of highly interconnected datasets, which are
both voluminous and complex, calls for the development of adequate processing
and analytical techniques. One method for condensing and simplifying such
datasets is graph summarization. It denotes a series of application-specific
algorithms designed to transform graphs into more compact representations while
preserving structural patterns, query answers, or specific property
distributions. As this problem is common to several areas studying graph
topologies, different approaches, such as clustering, compression, sampling, or
influence detection, have been proposed, primarily based on statistical and
optimization methods. The focus of our chapter is to pinpoint the main graph
summarization methods, but especially to focus on the most recent approaches
and novel research trends on this topic, not yet covered by previous surveys.Comment: To appear in the Encyclopedia of Big Data Technologie
Handling oversampling in dynamic networks using link prediction
Oversampling is a common characteristic of data representing dynamic
networks. It introduces noise into representations of dynamic networks, but
there has been little work so far to compensate for it. Oversampling can affect
the quality of many important algorithmic problems on dynamic networks,
including link prediction. Link prediction seeks to predict edges that will be
added to the network given previous snapshots. We show that not only does
oversampling affect the quality of link prediction, but that we can use link
prediction to recover from the effects of oversampling. We also introduce a
novel generative model of noise in dynamic networks that represents
oversampling. We demonstrate the results of our approach on both synthetic and
real-world data.Comment: ECML/PKDD 201
apk2vec: Semi-supervised multi-view representation learning for profiling Android applications
Building behavior profiles of Android applications (apps) with holistic, rich
and multi-view information (e.g., incorporating several semantic views of an
app such as API sequences, system calls, etc.) would help catering downstream
analytics tasks such as app categorization, recommendation and malware analysis
significantly better. Towards this goal, we design a semi-supervised
Representation Learning (RL) framework named apk2vec to automatically generate
a compact representation (aka profile/embedding) for a given app. More
specifically, apk2vec has the three following unique characteristics which make
it an excellent choice for largescale app profiling: (1) it encompasses
information from multiple semantic views such as API sequences, permissions,
etc., (2) being a semi-supervised embedding technique, it can make use of
labels associated with apps (e.g., malware family or app category labels) to
build high quality app profiles, and (3) it combines RL and feature hashing
which allows it to efficiently build profiles of apps that stream over time
(i.e., online learning). The resulting semi-supervised multi-view hash
embeddings of apps could then be used for a wide variety of downstream tasks
such as the ones mentioned above. Our extensive evaluations with more than
42,000 apps demonstrate that apk2vec's app profiles could significantly
outperform state-of-the-art techniques in four app analytics tasks namely,
malware detection, familial clustering, app clone detection and app
recommendation.Comment: International Conference on Data Mining, 201
A Similarity Measure for Material Appearance
We present a model to measure the similarity in appearance between different
materials, which correlates with human similarity judgments. We first create a
database of 9,000 rendered images depicting objects with varying materials,
shape and illumination. We then gather data on perceived similarity from
crowdsourced experiments; our analysis of over 114,840 answers suggests that
indeed a shared perception of appearance similarity exists. We feed this data
to a deep learning architecture with a novel loss function, which learns a
feature space for materials that correlates with such perceived appearance
similarity. Our evaluation shows that our model outperforms existing metrics.
Last, we demonstrate several applications enabled by our metric, including
appearance-based search for material suggestions, database visualization,
clustering and summarization, and gamut mapping.Comment: 12 pages, 17 figure
Statistical Traffic State Analysis in Large-scale Transportation Networks Using Locality-Preserving Non-negative Matrix Factorization
Statistical traffic data analysis is a hot topic in traffic management and
control. In this field, current research progresses focus on analyzing traffic
flows of individual links or local regions in a transportation network. Less
attention are paid to the global view of traffic states over the entire
network, which is important for modeling large-scale traffic scenes. Our aim is
precisely to propose a new methodology for extracting spatio-temporal traffic
patterns, ultimately for modeling large-scale traffic dynamics, and long-term
traffic forecasting. We attack this issue by utilizing Locality-Preserving
Non-negative Matrix Factorization (LPNMF) to derive low-dimensional
representation of network-level traffic states. Clustering is performed on the
compact LPNMF projections to unveil typical spatial patterns and temporal
dynamics of network-level traffic states. We have tested the proposed method on
simulated traffic data generated for a large-scale road network, and reported
experimental results validate the ability of our approach for extracting
meaningful large-scale space-time traffic patterns. Furthermore, the derived
clustering results provide an intuitive understanding of spatial-temporal
characteristics of traffic flows in the large-scale network, and a basis for
potential long-term forecasting.Comment: IET Intelligent Transport Systems (2013
Oceanographic drivers of deep-sea coral species distribution and community assembly on seamounts, islands, atolls, and reefs within the Phoenix Islands Protected Area
© The Author(s), 2020. This article is distributed under the terms of the Creative Commons Attribution License. The definitive version was published in Auscavitch, S. R., Deere, M. C., Keller, A. G., Rotjan, R. D., Shank, T. M., & Cordes, E. E. Oceanographic drivers of deep-sea coral species distribution and community assembly on seamounts, islands, atolls, and reefs within the Phoenix Islands Protected Area. Frontiers in Marine Science, 7, (2020): 42, doi:10.3389/fmars.2020.00042.The Phoenix Islands Protected Area, in the central Pacific waters of the Republic of Kiribati, is a model for large marine protected area (MPA) development and maintenance, but baseline records of the protected biodiversity in its largest environment, the deep sea (>200 m), have not yet been determined. In general, the equatorial central Pacific lacks biogeographic perspective on deep-sea benthic communities compared to more well-studied regions of the North and South Pacific Ocean. In 2017, explorations by the NOAA ship Okeanos Explorer and R/V Falkor were among the first to document the diversity and distribution of deep-water benthic megafauna on numerous seamounts, islands, shallow coral reef banks, and atolls in the region. Here, we present baseline deep-sea coral species distribution and community assembly patterns within the Scleractinia, Octocorallia, Antipatharia, and Zoantharia with respect to different seafloor features and abiotic environmental variables across bathyal depths (200–2500 m). Remotely operated vehicle (ROV) transects were performed on 17 features throughout the Phoenix Islands and Tokelau Ridge Seamounts resulting in the observation of 12,828 deep-water corals and 167 identifiable morphospecies. Anthozoan assemblages were largely octocoral-dominated consisting of 78% of all observations with seamounts having a greater number of observed morphospecies compared to other feature types. Overlying water masses were observed to have significant effects on community assembly across bathyal depths. Revised species inventories further suggest that the protected area it is an area of biogeographic overlap for Pacific deep-water corals, containing species observed across bathyal provinces in the North Pacific, Southwest Pacific, and Western Pacific. These results underscore significant geographic and environmental complexity associated with deep-sea coral communities that remain in under-characterized in the equatorial central Pacific, but also highlight the additional efforts that need to be brought forth to effectively establish baseline ecological metrics in data deficient bathyal provinces.Funding for this work was provided by NOAA Office of Ocean Exploration and Research (Grant No. NA17OAR0110083) to RR, EC, TS, and David Gruber
On the role of pre and post-processing in environmental data mining
The quality of discovered knowledge is highly depending on data quality. Unfortunately real data use to contain noise, uncertainty, errors, redundancies or even irrelevant information. The more complex is the reality to be analyzed, the higher the risk of getting low quality data. Knowledge Discovery from Databases (KDD) offers a global framework to prepare data in the right form to perform correct analyses. On the other hand, the quality of decisions taken upon KDD results, depend not only on the quality of the results themselves, but on the capacity of the system to communicate those results in an understandable form. Environmental systems are particularly complex and environmental users particularly require clarity in their results. In this paper some details about how this can be achieved are provided. The role of the pre and post processing in the whole process of Knowledge Discovery in environmental systems is discussed
- …