Internet data has surfaced as a primary source for investigation of different
aspects of human behavior. A crucial step in such studies is finding a suitable
cohort (i.e., a set of users) that shares a common trait of interest to
researchers. However, direct identification of users sharing this trait is
often impossible, as the data available to researchers is usually anonymized to
preserve user privacy. To facilitate research on specific topics of interest,
especially in medicine, we introduce an algorithm for identifying a trait of
interest in anonymous users. We illustrate how a small set of labeled examples,
together with statistical information about the entire population, can be
aggregated to obtain labels on unseen examples. We validate our approach using
labeled data from the political domain.
We provide two applications of the proposed algorithm to the medical domain.
In the first, we demonstrate how to identify users whose search patterns
indicate they might be suffering from certain types of cancer. In the second,
we detail an algorithm to predict the distribution of diseases given their
incidence in a subset of the population at study, making it possible to predict
disease spread from partial epidemiological data