10 research outputs found
A Method for the Automated, Reliable Retrieval of Publication-Citation Records
BACKGROUND: Publication records and citation indices often are used to evaluate academic performance. For this reason, obtaining or computing them accurately is important. This can be difficult, largely due to a lack of complete knowledge of an individual's publication list and/or lack of time available to manually obtain or construct the publication-citation record. While online publication search engines have somewhat addressed these problems, using raw search results can yield inaccurate estimates of publication-citation records and citation indices. METHODOLOGY: In this paper, we present a new, automated method that produces estimates of an individual's publication-citation record from an individual's name and a set of domain-specific vocabulary that may occur in the individual's publication titles. Because this vocabulary can be harvested directly from a research web page or online (partial) publication list, our method delivers an easy way to obtain estimates of a publication-citation record and the relevant citation indices. Our method works by applying a series of stringent name and content filters to the raw publication search results returned by an online publication search engine. In this paper, our method is run using Google Scholar, but the underlying filters can be easily applied to any existing publication search engine. When compared against a manually constructed data set of individuals and their publication-citation records, our method provides significant improvements over raw search results. The estimated publication-citation records returned by our method have an average sensitivity of 98% and specificity of 72% (in contrast to raw search result specificity of less than 10%). When citation indices are computed using these records, the estimated indices are within of the true value 10%, compared to raw search results which have overestimates of, on average, 75%. CONCLUSIONS: These results confirm that our method provides significantly improved estimates over raw search results, and these can either be used directly for large-scale (departmental or university) analysis or further refined manually to quickly give accurate publication-citation records
Demographic Inference and Representative Population Estimates from Multilingual Social Media Data
Social media provide access to behavioural data at an unprecedented scale and
granularity. However, using these data to understand phenomena in a broader
population is difficult due to their non-representativeness and the bias of
statistical inference tools towards dominant languages and groups. While
demographic attribute inference could be used to mitigate such bias, current
techniques are almost entirely monolingual and fail to work in a global
environment. We address these challenges by combining multilingual demographic
inference with post-stratification to create a more representative population
sample. To learn demographic attributes, we create a new multimodal deep neural
architecture for joint classification of age, gender, and organization-status
of social media users that operates in 32 languages. This method substantially
outperforms current state of the art while also reducing algorithmic bias. To
correct for sampling biases, we propose fully interpretable multilevel
regression methods that estimate inclusion probabilities from inferred joint
population counts and ground-truth population counts. In a large experiment
over multilingual heterogeneous European regions, we show that our demographic
inference and bias correction together allow for more accurate estimates of
populations and make a significant step towards representative social sensing
in downstream applications with multilingual social media.Comment: 12 pages, 10 figures, Proceedings of the 2019 World Wide Web
Conference (WWW '19
Relating the expression-based and sequence-based estimates of regulation in the gap gene system of Drosophila melanogaster
Quantitative analysis of Drosophila melanogaster gap gene expression data reveals valuable information about the nature and strengths of interactions in the gap gene network. We first explore different models for fitting the spatiotemporal gene expression data of Drosophila gap gene system and validate our results by computational analysis and comparison with the existing literature. A fundamental problem in systems biology is to associate these results with the inherent cause of gene regulation, namely the binding of the transcription factors (TF) to their respective binding sites. In order to relate these expression-based estimates of gap gene regulation with the sequence-based information of TF binding site composition, we also explore two related problems of (i) finding a set of regulatory weights that is proportional to the binding site occupancy matrix of the transcription factors in current literature and (ii) finding a set of position weight matrices of the TFs that produce a new binding site occupancy matrix showing a greater level of proportionality with our regulatory weights. Our solution to the first problem yielded a regulatory weight matrix incapable of explaining the true causes of gene expression profile despite its relative numerical accuracy in predicting the gene expressions. On the other hand, the second optimization problem could be solved up to a reasonable level of accuracy, but further analysis on the result demonstrated that this optimization problem may be under-constrained. We devise a simple regularization strategy that helps us to reduce the under-constrained nature of the problem
Homophily and Latent Attribute Inference: Inferring Latent Attributes of Twitter Users from Neighbors
In this paper, we extend existing work on latent attribute inference by leveraging the principle of homophily: we evaluate the inference accuracy gained by augmenting the user features with features derived from the Twitter profiles and postings of her friends. We consider three attributes which have varying degrees of assortativity: gender, age, and political affiliation. Our approach yields a significant and robust increase in accuracy for both age and political affiliation, indicating that our approach boosts performance for attributes with moderate to high assortativity. Furthermore, different neighborhood subsets yielded optimal performance for different attributes, suggesting that different subsamples of the user's neighborhood characterize different aspects of the user herself. Finally, inferences using only the features of a user's neighbors outperformed those based on the user's features alone. This suggests that the neighborhood context alone carries substantial information about the user