88,333 research outputs found
Exploring Student Check-In Behavior for Improved Point-of-Interest Prediction
With the availability of vast amounts of user visitation history on
location-based social networks (LBSN), the problem of Point-of-Interest (POI)
prediction has been extensively studied. However, much of the research has been
conducted solely on voluntary checkin datasets collected from social apps such
as Foursquare or Yelp. While these data contain rich information about
recreational activities (e.g., restaurants, nightlife, and entertainment),
information about more prosaic aspects of people's lives is sparse. This not
only limits our understanding of users' daily routines, but more importantly
the modeling assumptions developed based on characteristics of recreation-based
data may not be suitable for richer check-in data. In this work, we present an
analysis of education "check-in" data using WiFi access logs collected at
Purdue University. We propose a heterogeneous graph-based method to encode the
correlations between users, POIs, and activities, and then jointly learn
embeddings for the vertices. We evaluate our method compared to previous
state-of-the-art POI prediction methods, and show that the assumptions made by
previous methods significantly degrade performance on our data with dense(r)
activity signals. We also show how our learned embeddings could be used to
identify similar students (e.g., for friend suggestions).Comment: published in KDD'1
A Bayesian Tensor Factorization Model via Variational Inference for Link Prediction
Probabilistic approaches for tensor factorization aim to extract meaningful
structure from incomplete data by postulating low rank constraints. Recently,
variational Bayesian (VB) inference techniques have successfully been applied
to large scale models. This paper presents full Bayesian inference via VB on
both single and coupled tensor factorization models. Our method can be run even
for very large models and is easily implemented. It exhibits better prediction
performance than existing approaches based on maximum likelihood on several
real-world datasets for missing link prediction problem.Comment: arXiv admin note: substantial text overlap with arXiv:1409.808
ReviewQA: a relational aspect-based opinion reading dataset
Deep reading models for question-answering have demonstrated promising
performance over the last couple of years. However current systems tend to
learn how to cleverly extract a span of the source document, based on its
similarity with the question, instead of seeking for the appropriate answer.
Indeed, a reading machine should be able to detect relevant passages in a
document regarding a question, but more importantly, it should be able to
reason over the important pieces of the document in order to produce an answer
when it is required. To motivate this purpose, we present ReviewQA, a
question-answering dataset based on hotel reviews. The questions of this
dataset are linked to a set of relational understanding competencies that we
expect a model to master. Indeed, each question comes with an associated type
that characterizes the required competency. With this framework, it is possible
to benchmark the main families of models and to get an overview of what are the
strengths and the weaknesses of a given model on the set of tasks evaluated in
this dataset. Our corpus contains more than 500.000 questions in natural
language over 100.000 hotel reviews. Our setup is projective, the answer of a
question does not need to be extracted from a document, like in most of the
recent datasets, but selected among a set of candidates that contains all the
possible answers to the questions of the dataset. Finally, we present several
baselines over this dataset.Comment: Accepted at Conf\'erence sur l'apprentissage automatique (CAp 2018
Source Code Properties of Defective Infrastructure as Code Scripts
Context: In continuous deployment, software and services are rapidly deployed
to end-users using an automated deployment pipeline. Defects in infrastructure
as code (IaC) scripts can hinder the reliability of the automated deployment
pipeline. We hypothesize that certain properties of IaC source code such as
lines of code and hard-coded strings used as configuration values, show
correlation with defective IaC scripts. Objective: The objective of this paper
is to help practitioners in increasing the quality of infrastructure as code
(IaC) scripts through an empirical study that identifies source code properties
of defective IaC scripts. Methodology: We apply qualitative analysis on
defect-related commits mined from open source software repositories to identify
source code properties that correlate with defective IaC scripts. Next, we
survey practitioners to assess the practitioner's agreement level with the
identified properties. We also construct defect prediction models using the
identified properties for 2,439 scripts collected from four datasets. Results:
We identify 10 source code properties that correlate with defective IaC
scripts. Of the identified 10 properties we observe lines of code and
hard-coded string to show the strongest correlation with defective IaC scripts.
Hard-coded string is the property of specifying configuration value as
hard-coded string. According to our survey analysis, majority of the
practitioners show agreement for two properties: include, the property of
executing external modules or scripts, and hard-coded string. Using the
identified properties, our constructed defect prediction models show a
precision of 0.70~0.78, and a recall of 0.54~0.67.Comment: arXiv admin note: text overlap with arXiv:1809.0793
How did the discussion go: Discourse act classification in social media conversations
We propose a novel attention based hierarchical LSTM model to classify
discourse act sequences in social media conversations, aimed at mining data
from online discussion using textual meanings beyond sentence level. The very
uniqueness of the task is the complete categorization of possible pragmatic
roles in informal textual discussions, contrary to extraction of
question-answers, stance detection or sarcasm identification which are very
much role specific tasks. Early attempt was made on a Reddit discussion
dataset. We train our model on the same data, and present test results on two
different datasets, one from Reddit and one from Facebook. Our proposed model
outperformed the previous one in terms of domain independence; without using
platform-dependent structural features, our hierarchical LSTM with word
relevance attention mechanism achieved F1-scores of 71\% and 66\% respectively
to predict discourse roles of comments in Reddit and Facebook discussions.
Efficiency of recurrent and convolutional architectures in order to learn
discursive representation on the same task has been presented and analyzed,
with different word and comment embedding schemes. Our attention mechanism
enables us to inquire into relevance ordering of text segments according to
their roles in discourse. We present a human annotator experiment to unveil
important observations about modeling and data annotation. Equipped with our
text-based discourse identification model, we inquire into how heterogeneous
non-textual features like location, time, leaning of information etc. play
their roles in charaterizing online discussions on Facebook
Comment on Clark et al. (2019) "The Physical Nature of Neutral Hydrogen Intensity Structure"
A recent publication by Clark et.al (2019, CX19) uses both GALFA-HI
observational data and numerical simulations to address the nature of intensity
fluctuations in Position-Position-Velocity (PPV) space. The study questions the
validity and applicability of the statistical theory of PPV space fluctuations
formulated in Lazarian & Pogosyan (2000, LP00) to HI gas and concludes that
{\it a significant reassessment of many observational and theoretical
studies of turbulence in HI}. This implies that dozens of papers that used
LP00 theory to explore interstellar turbulence as well as the ongoing research
based on LP00 theory are in error. This situation motivates the urgency of our
public response. In our Comment we explain why we believe the criticism in CX19
is based on the incorrect understanding of the LP00 theory. In particular, we
illustrate that the correlation between PPV slices and dust emissions in CX19
does not properly reveal the relative importance of velocity and density
fluctuations in velocity channel maps. While CX19 provides an explanation of
the change of the spectral index with respect to the thickness of PPV slice
based on the two-phase nature of H1 gas, we failed to see any observational
support for this idea. On the contrary, we show that the observations both in
two-phase HI and one phase CO show similar results. Moreover, the observed
change is in good agreement with LP00 predictions and spectral indexes of
velocity and density spectra that are obtained following LP00 procedures are in
good agreement with the numerically confirmed expectations of compressible MHD
turbulence theory. In short, we could not find any justification of the
criticism of LP00 theory that is provided in CX19. On the contrary, our
analysis testifies that both available observational and numerical data agree
well with the predictions of LP00 theory.Comment: 10 pages, 4 figure
Asterias: a parallelized web-based suite for the analysis of expression and aCGH data
Asterias (\url{http://www.asterias.info}) is an integrated collection of
freely-accessible web tools for the analysis of gene expression and aCGH data.
Most of the tools use parallel computing (via MPI). Most of our applications
allow the user to obtain additional information for user-selected genes by
using clickable links in tables and/or figures. Our tools include:
normalization of expression and aCGH data; converting between different types
of gene/clone and protein identifiers; filtering and imputation; finding
differentially expressed genes related to patient class and survival data;
searching for models of class prediction; using random forests to search for
minimal models for class prediction or for large subsets of genes with
predictive capacity; searching for molecular signatures and predictive genes
with survival data; detecting regions of genomic DNA gain or loss. The
capability to send results between different applications, access to additional
functional information, and parallelized computation make our suite unique and
exploit features only available to web-based applications.Comment: web based application; 3 figure
Noise and vibration from building-mounted micro wind turbines Part 1: Review and proposed methodology
Description
To research the quantification of vibration from a micro turbine, and to develop a method of prediction of vibration and structure borne noise in a wide variety of installations in the UK.
Objective
The objectives of the study are as follows:
1) Develop a methodology to quantify the amount of source vibration from a building mounted micro wind turbine installation, and to predict the level of vibration and structure-borne noise impact within such buildings in the UK.
2) Test and validate the hypothesis on a statically robust sample size
3) Report the developed methodology in a form suitable for widespread adoption by industry and regulators, and report back on the suitability of the method on which to base policy decisions for a future inclusion for building mounted turbines in the GPDO
Towards using social media to identify individuals at risk for preventable chronic illness
We describe a strategy for the acquisition of training data necessary to
build a social-media-driven early detection system for individuals at risk for
(preventable) type 2 diabetes mellitus (T2DM). The strategy uses a game-like
quiz with data and questions acquired semi-automatically from Twitter. The
questions are designed to inspire participant engagement and collect relevant
data to train a public-health model applied to individuals. Prior systems
designed to use social media such as Twitter to predict obesity (a risk factor
for T2DM) operate on entire communities such as states, counties, or cities,
based on statistics gathered by government agencies. Because there is
considerable variation among individuals within these groups, training data on
the individual level would be more effective, but this data is difficult to
acquire. The approach proposed here aims to address this issue. Our strategy
has two steps. First, we trained a random forest classifier on data gathered
from (public) Twitter statuses and state-level statistics with state-of-the-art
accuracy. We then converted this classifier into a 20-questions-style quiz and
made it available online. In doing so, we achieved high engagement with
individuals that took the quiz, while also building a training set of
voluntarily supplied individual-level data for future classification.Comment: This paper will appear in LREC 201
What is the Nature of Chinese MicroBlogging: Unveiling the Unique Features of Tencent Weibo
China has the largest number of online users in the world and about 20%
internet users are from China. This is a huge, as well as a mysterious, market
for IT industry due to various reasons such as culture difference. Twitter is
the largest microblogging service in the world and Tencent Weibo is one of the
largest microblogging services in China. Employ the two data sets as a source
in our study, we try to unveil the unique behaviors of Chinese users. We have
collected the entire Tencent Weibo from 10th, Oct, 2011 to 5th, Jan, 2012 and
obtained 320 million user profiles, 5.15 billion user actions. We study Tencent
Weibo from both macro and micro levels. From the macro level, Tencent users are
more active on forwarding messages, but with less reciprocal relationships than
Twitter users, their topic preferences are very different from Twitter users
from both content and time consuming; besides, information can be diffused more
efficient in Tencent Weibo. From the micro level, we mainly evaluate users'
social influence from two indexes: "Forward" and \Follower", we study how
users' actions will contribute to their social influences, and further identify
unique features of Tencent users. According to our studies, Tencent users'
actions are more personalized and diversity, and the influential users play a
more important part in the whole networks. Based on the above analysis, we
design a graphical model for predicting users' forwarding behaviors. Our
experimental results on the large Tencent Weibo data validate the correctness
of the discoveries and the effectiveness of the proposed model. To the best of
our knowledge, this work is the first quantitative study on the entire
Tencentsphere and information diffusion on it.Comment: WWW2013(submitted
- …