2,548 research outputs found
Digital Stylometry: Linking Profiles Across Social Networks
There is an ever growing number of users with accounts on multiple social
media and networking sites. Consequently, there is increasing interest in
matching user accounts and profiles across different social networks in order
to create aggregate profiles of users. In this paper, we present models for
Digital Stylometry, which is a method for matching users through stylometry
inspired techniques. We experimented with linguistic, temporal, and combined
temporal-linguistic models for matching user accounts, using standard and novel
techniques. Using publicly available data, our best model, a combined
temporal-linguistic one, was able to correctly match the accounts of 31% of
5,612 distinct users across Twitter and Facebook.Comment: SocInfo'15, Beijing, China. In proceedings of the 7th International
Conference on Social Informatics (SocInfo 2015). Beijing, Chin
MODELLING PRICE DYNAMICS IN THE HONG KONG PROPERTY MARKET
The property market in Hong Kong plays an important role in the political, social and economic life of this vibrant city. Understanding the dynamics of the market is essential to guide government policy making and investment decisions. Using data collected between 1993 and 2006, this study investigates the monthly returns, volatilities, and time-varying correlations in the residential, office, and retail property markets in Hong Kong. A vector autoregressive (VAR) model is used to examine the conditional mean, and a multivariate generalized autoregressive conditional heteroscedasticity (MGARCH) model is adopted to analyze the conditional variance. The dynamic conditional correlation (DCC) approach is utilized to specify the MGARCH model. All of the property types show strong auto- and cross-correlations, which indicates that the sectors relate to each other closely. All three sectors have higher volatilities when major political and economic events occur. The findings reveal the possibility of balancing investment portfolios between the three sectors in the Hong Kong property market. However, exposure to the residential sector may reduce the chance of investment diversification because of the higher correlation of this sector with the other property sectors.Return, volatility, dynamic conditional correlation.
Robustly detecting differential expression in RNA sequencing data using observation weights
A popular approach for comparing gene expression levels between (replicated) conditions of RNA sequencing data relies on counting reads that map to features of interest. Within such count-based methods, many flexible and advanced statistical approaches now exist and offer the ability to adjust for covariates (e.g. batch effects). Often, these methods include some sort of ‘sharing of information' across features to improve inferences in small samples. It is important to achieve an appropriate tradeoff between statistical power and protection against outliers. Here, we study the robustness of existing approaches for count-based differential expression analysis and propose a new strategy based on observation weights that can be used within existing frameworks. The results suggest that outliers can have a global effect on differential analyses. We demonstrate the effectiveness of our new approach with real data and simulated data that reflects properties of real datasets (e.g. dispersion-mean trend) and develop an extensible framework for comprehensive testing of current and future methods. In addition, we explore the origin of such outliers, in some cases highlighting additional biological or technical factors within the experiment. Further details can be downloaded from the project website: http://imlspenticton.uzh.ch/robinson_lab/edgeR_robus
Domain Adaptation under Missingness Shift
Rates of missing data often depend on record-keeping policies and thus may
change across times and locations, even when the underlying features are
comparatively stable. In this paper, we introduce the problem of Domain
Adaptation under Missingness Shift (DAMS). Here, (labeled) source data and
(unlabeled) target data would be exchangeable but for different missing data
mechanisms. We show that when missing data indicators are available, DAMS can
reduce to covariate shift. Focusing on the setting where missing data
indicators are absent, we establish the following theoretical results for
underreporting completely at random: (i) covariate shift is violated
(adaptation is required); (ii) the optimal source predictor can perform worse
on the target domain than a constant one; (iii) the optimal target predictor
can be identified, even when the missingness rates themselves are not; and (iv)
for linear models, a simple analytic adjustment yields consistent estimates of
the optimal target parameters. In experiments on synthetic and semi-synthetic
data, we demonstrate the promise of our methods when assumptions hold. Finally,
we discuss a rich family of future extensions
Evaluating Model Performance in Medical Datasets Over Time
Machine learning (ML) models deployed in healthcare systems must face data
drawn from continually evolving environments. However, researchers proposing
such models typically evaluate them in a time-agnostic manner, splitting
datasets according to patients sampled randomly throughout the entire study
time period. This work proposes the Evaluation on Medical Datasets Over Time
(EMDOT) framework, which evaluates the performance of a model class across
time. Inspired by the concept of backtesting, EMDOT simulates possible training
procedures that practitioners might have been able to execute at each point in
time and evaluates the resulting models on all future time points. Evaluating
both linear and more complex models on six distinct medical data sources
(tabular and imaging), we show how depending on the dataset, using all
historical data may be ideal in many cases, whereas using a window of the most
recent data could be advantageous in others. In datasets where models suffer
from sudden degradations in performance, we investigate plausible explanations
for these shocks. We release the EMDOT package to help facilitate further works
in deployment-oriented evaluation over time.Comment: To appear at Conference on Health, Inference, and Learning (CHIL)
2023. arXiv admin note: substantial text overlap with arXiv:2211.0716
Enhanced Twitter Sentiment Classification Using Contextual Information
The rise in popularity and ubiquity of Twitter has made sentiment analysis of
tweets an important and well-covered area of research. However, the 140 character limit imposed on tweets makes it hard to use standard linguistic methods for sentiment classification. On the other hand, what tweets lack in structure they make up with sheer volume and rich metadata. This metadata includes geolocation, temporal and author information. We hypothesize that sentiment is dependent on all these contextual factors. Different locations, times and authors have different emotional valences. In this paper, we explored this hypothesis by utilizing distant supervision to collect millions of labelled tweets from different locations, times and authors. We used this data to analyse the variation of tweet sentiments across different authors, times and locations. Once we explored and understood the relationship between these variables and sentiment, we used a Bayesian approach to combine these variables with more standard linguistic features such as n-grams to create a Twitter sentiment classifier. This combined classifier outperforms the purely linguistic classifier, showing that integrating the rich contextual information available on Twitter into sentiment classification is a promising direction of research.Twitter (Firm
The cold adapted and temperature sensitive influenza A/Ann Arbor/6/60 virus, the master donor virus for live attenuated influenza vaccines, has multiple defects in replication at the restrictive temperature
AbstractWe have previously determined that the temperature sensitive (ts) and attenuated (att) phenotypes of the cold adapted influenza A/Ann Arbor/6/60 strain (MDV-A), the master donor virus for the live attenuated influenza A vaccines (FluMist®), are specified by the five amino acids in the PB1, PB2 and NP gene segments. To understand how these loci control the ts phenotype of MDV-A, replication of MDV-A at the non-permissive temperature (39 °C) was compared with recombinant wild-type A/Ann Arbor/6/60 (rWt). The mRNA and protein synthesis of MDV-A in the infected MDCK cells were not significantly reduced at 39 °C during a single-step replication, however, vRNA synthesis was reduced and the nuclear–cytoplasmic export of viral RNP (vRNP) was blocked. In addition, the virions released from MDV-A infected cells at 39 °C exhibited irregular morphology and had a greatly reduced amount of the M1 protein incorporated. The reduced M1 protein incorporation and vRNP export blockage correlated well with the virus ts phenotype because these defects could be partially alleviated by removing the three ts loci from the PB1 gene. The virions and vRNPs isolated from the MDV-A infected cells contained a higher level of heat shock protein 70 (Hsp70) than those of rWt, however, whether Hsp70 is involved in thermal inhibition of MDV-A replication remains to be determined. Our studies demonstrate that restrictive replication of MDV-A at the non-permissive temperature occurs in multiple steps of the virus replication cycle
- …