4,011 research outputs found
A Survey of Location Prediction on Twitter
Locations, e.g., countries, states, cities, and point-of-interests, are
central to news, emergency events, and people's daily lives. Automatic
identification of locations associated with or mentioned in documents has been
explored for decades. As one of the most popular online social network
platforms, Twitter has attracted a large number of users who send millions of
tweets on daily basis. Due to the world-wide coverage of its users and
real-time freshness of tweets, location prediction on Twitter has gained
significant attention in recent years. Research efforts are spent on dealing
with new challenges and opportunities brought by the noisy, short, and
context-rich nature of tweets. In this survey, we aim at offering an overall
picture of location prediction on Twitter. Specifically, we concentrate on the
prediction of user home locations, tweet locations, and mentioned locations. We
first define the three tasks and review the evaluation metrics. By summarizing
Twitter network, tweet content, and tweet context as potential inputs, we then
structurally highlight how the problems depend on these inputs. Each dependency
is illustrated by a comprehensive review of the corresponding strategies
adopted in state-of-the-art approaches. In addition, we also briefly review two
related problems, i.e., semantic location prediction and point-of-interest
recommendation. Finally, we list future research directions.Comment: Accepted to TKDE. 30 pages, 1 figur
Confounds and Consequences in Geotagged Twitter Data
Twitter is often used in quantitative studies that identify
geographically-preferred topics, writing styles, and entities. These studies
rely on either GPS coordinates attached to individual messages, or on the
user-supplied location field in each profile. In this paper, we compare these
data acquisition techniques and quantify the biases that they introduce; we
also measure their effects on linguistic analysis and text-based geolocation.
GPS-tagging and self-reported locations yield measurably different corpora, and
these linguistic differences are partially attributable to differences in
dataset composition by age and gender. Using a latent variable model to induce
age and gender, we show how these demographic variables interact with geography
to affect language use. We also show that the accuracy of text-based
geolocation varies with population demographics, giving the best results for
men above the age of 40.Comment: final version for EMNLP 201
Recommended from our members
Towards a People's Social Epidemiology: Envisioning a More Inclusive and Equitable Future for Social Epi Research and Practice in the 21st Century.
Social epidemiology has made critical contributions to understanding population health. However, translation of social epidemiology science into action remains a challenge, raising concerns about the impacts of the field beyond academia. With so much focus on issues related to social position, discrimination, racism, power, and privilege, there has been surprisingly little deliberation about the extent and value of social inclusion and equity within the field itself. Indeed, the challenge of translation/action might be more readily met through re-envisioning the role of the people within the research/practice enterprise-reimagining what "social" could, or even should, mean for the future of the field. A potential path forward rests at the nexus of social epidemiology, community-based participatory research (CBPR), and information and communication technology (ICT). Here, we draw from social epidemiology, CBPR, and ICT literatures to introduce A People's Social Epi-a multi-tiered framework for guiding social epidemiology in becoming more inclusive, equitable, and actionable for 21st century practice. In presenting this framework, we suggest the value of taking participatory, collaborative approaches anchored in CBPR and ICT principles and technological affordances-especially within the context of place-based and environmental research. We believe that such approaches present opportunities to create a social epidemiology that is of, with, and by the people-not simply about them. In this spirit, we suggest 10 ICT tools to "socialize" social epidemiology and outline 10 ways to move towards A People's Social Epi in practice
Earth observations from DSCOVR EPIC instrument
The National Oceanic and Atmospheric Administration (NOAA) Deep Space Climate Observatory (DSCOVR) spacecraft was launched on 11 February 2015 and in June 2015 achieved its orbit at the first Lagrange point (L1), 1.5 million km from Earth toward the sun. There are two National Aeronautics and Space Administration (NASA) Earth-observing instruments on board: the Earth Polychromatic Imaging Camera (EPIC) and the National Institute of Standards and Technology Advanced Radiometer (NISTAR). The purpose of this paper is to describe various capabilities of the DSCOVR EPIC instrument. EPIC views the entire sunlit Earth from sunrise to sunset at the backscattering direction (scattering angles between 168.5° and 175.5°) with 10 narrowband filters: 317, 325, 340, 388, 443, 552, 680, 688, 764, and 779 nm. We discuss a number of preprocessing steps necessary for EPIC calibration including the geolocation algorithm and the radiometric calibration for each wavelength channel in terms of EPIC counts per second for conversion to reflectance units. The principal EPIC products are total ozone (O3) amount, scene reflectivity, erythemal irradiance, ultraviolet (UV) aerosol properties, sulfur dioxide (SO2) for volcanic eruptions, surface spectral reflectance, vegetation properties, and cloud products including cloud height. Finally, we describe the observation of horizontally oriented ice crystals in clouds and the unexpected use of the O2 B-band absorption for vegetation properties.The NASA GSFC DSCOVR project is funded by NASA Earth Science Division. We gratefully acknowledge the work by S. Taylor and B. Fisher for help with the SO2 retrievals and Marshall Sutton, Carl Hostetter, and the EPIC NISTAR project for help with EPIC data. We also would like to thank the EPIC Cloud Algorithm team, especially Dr. Gala Wind, for the contribution to the EPIC cloud products. (NASA Earth Science Division)Accepted manuscrip
The astrometric Gaia-FUN-SSO observation campaign of 99 942 Apophis
Astrometric observations performed by the Gaia Follow-Up Network for Solar
System Objects (Gaia-FUN-SSO) play a key role in ensuring that moving objects
first detected by ESA's Gaia mission remain recoverable after their discovery.
An observation campaign on the potentially hazardous asteroid (99 942) Apophis
was conducted during the asteroid's latest period of visibility, from
12/21/2012 to 5/2/2013, to test the coordination and evaluate the overall
performance of the Gaia-FUN-SSO . The 2732 high quality astrometric
observations acquired during the Gaia-FUN-SSO campaign were reduced with the
Platform for Reduction of Astronomical Images Automatically (PRAIA), using the
USNO CCD Astrograph Catalogue 4 (UCAC4) as a reference. The astrometric
reduction process and the precision of the newly obtained measurements are
discussed. We compare the residuals of astrometric observations that we
obtained using this reduction process to data sets that were individually
reduced by observers and accepted by the Minor Planet Center. We obtained 2103
previously unpublished astrometric positions and provide these to the
scientific community. Using these data we show that our reduction of this
astrometric campaign with a reliable stellar catalog substantially improves the
quality of the astrometric results. We present evidence that the new data will
help to reduce the orbit uncertainty of Apophis during its close approach in
2029. We show that uncertainties due to geolocations of observing stations, as
well as rounding of astrometric data can introduce an unnecessary degradation
in the quality of the resulting astrometric positions. Finally, we discuss the
impact of our campaign reduction on the recovery process of newly discovered
asteroids.Comment: Accepted for publication in A&
Passport: Enabling Accurate Country-Level Router Geolocation using Inaccurate Sources
When does Internet traffic cross international borders? This question has
major geopolitical, legal and social implications and is surprisingly difficult
to answer. A critical stumbling block is a dearth of tools that accurately map
routers traversed by Internet traffic to the countries in which they are
located. This paper presents Passport: a new approach for efficient, accurate
country-level router geolocation and a system that implements it. Passport
provides location predictions with limited active measurements, using machine
learning to combine information from IP geolocation databases, router
hostnames, whois records, and ping measurements. We show that Passport
substantially outperforms existing techniques, and identify cases where paths
traverse countries with implications for security, privacy, and performance
Passport: enabling accurate country-level router geolocation using inaccurate sources
When does Internet traffic cross international borders? This question has major geopolitical, legal and social implications and is surprisingly difficult to answer. A critical stumbling block is a dearth of tools that accurately map routers traversed by Internet traffic to the countries in which they are located. This paper presents Passport: a new approach for efficient, accurate country-level router geolocation and a system that implements it. Passport provides location predictions with limited active measurements, using machine learning to combine information from IP geolocation databases, router hostnames, whois records, and ping measurements. We show that Passport substantially outperforms existing techniques, and identify cases where paths traverse countries with implications for security, privacy, and performance.First author draf
Investigating Full-Waveform Lidar Data for Detection and Recognition of Vertical Objects
A recent innovation in commercially-available topographic lidar systems is the ability to record return waveforms at high sampling frequencies. These “full-waveform” systems provide up to two orders of magnitude more data than “discrete-return” systems. However, due to the relatively limited capabilities of current processing and analysis software, more data does not always translate into more or better information for object extraction applications. In this paper, we describe a new approach for exploiting full waveform data to improve detection and recognition of vertical objects, such as trees, poles, buildings, towers, and antennas. Each waveform is first deconvolved using an expectation-maximization (EM) algorithm to obtain a train of spikes in time, where each spike corresponds to an individual laser reflection. The output is then georeferenced to create extremely dense, detailed X,Y,Z,I point clouds, where I denotes intensity. A tunable parameter is used to control the number of spikes in the deconvolved waveform, and, hence, the point density of the output point cloud. Preliminary results indicate that the average number of points on vertical objects using this method is several times higher than using discrete-return lidar data. The next steps in this ongoing research will involve voxelizing the lidar point cloud to obtain a high-resolution volume of intensity values and computing a 3D wavelet representation. The final step will entail performing vertical object detection/recognition in the wavelet domain using a multiresolution template matching approach
RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning
Anonymized electronic medical records are an increasingly popular source of
research data. However, these datasets often lack race and ethnicity
information. This creates problems for researchers modeling human disease, as
race and ethnicity are powerful confounders for many health exposures and
treatment outcomes; race and ethnicity are closely linked to
population-specific genetic variation. We showed that deep neural networks
generate more accurate estimates for missing racial and ethnic information than
competing methods (e.g., logistic regression, random forest). RIDDLE yielded
significantly better classification performance across all metrics that were
considered: accuracy, cross-entropy loss (error), and area under the curve for
receiver operating characteristic plots (all ). We made specific
efforts to interpret the trained neural network models to identify, quantify,
and visualize medical features which are predictive of race and ethnicity. We
used these characterizations of informative features to perform a systematic
comparison of differential disease patterns by race and ethnicity. The fact
that clinical histories are informative for imputing race and ethnicity could
reflect (1) a skewed distribution of blue- and white-collar professions across
racial and ethnic groups, (2) uneven accessibility and subjective importance of
prophylactic health, (3) possible variation in lifestyle, such as dietary
habits, and (4) differences in background genetic variation which predispose to
diseases
- …