296 research outputs found

    Examining Information on Social Media: Topic Modelling, Trend Prediction and Community Classification

    Get PDF
    In the past decade, the use of social media networks (e.g. Twitter) increased dramatically becoming the main channels for the mass public to express their opinions, ideas and preferences, especially during an election or a referendum. Both researchers and the public are interested in understanding what topics are discussed during a real social event, what are the trends of the discussed topics and what is the future topical trend. Indeed, modelling such topics as well as trends offer opportunities for social scientists to continue a long-standing research, i.e. examine the information exchange between people in different communities. We argue that computing science approaches can adequately assist social scientists to extract topics from social media data, to predict their topical trends, or to classify a social media user (e.g. a Twitter user) into a community. However, while topic modelling approaches and classification techniques have been widely used, challenges still exist, such as 1) existing topic modelling approaches can generate topics lacking of coherence for social media data; 2) it is not easy to evaluate the coherence of topics; 3) it can be challenging to generate a large training dataset for developing a social media user classifier. Hence, we identify four tasks to solve these problems and assist social scientists. Initially, we aim to propose topic coherence metrics that effectively evaluate the coherence of topics generated by topic modelling approaches. Such metrics are required to align with human judgements. Since topic modelling approaches cannot always generate useful topics, it is necessary to present users with the most coherent topics using the coherence metrics. Moreover, an effective coherence metric helps us evaluate the performance of our proposed topic modelling approaches. The second task is to propose a topic modelling approach that generates more coherent topics for social media data. We argue that the use of time dimension of social media posts helps a topic modelling approach to distinguish the word usage differences over time, and thus allows to generate topics with higher coherence as well as their trends. A more coherent topic with its trend allows social scientists to quickly identify the topic subject and to focus on analysing the connections between the extracted topics with the social events, e.g., an election. Third, we aim to model and predict the topical trend. Given the timestamps of social media posts within topics, a topical trend can be modelled as a continuous distribution over time. Therefore, we argue that the future trends of topics can be predicted by estimating the density function of their continuous time distribution. By examining the future topical trend, social scientists can ensure the timeliness of their focused events. Politicians and policymakers can keep abreast of the topics that remain salient over time. Finally, we aim to offer a general method that can quickly obtain a large training dataset for constructing a social media user classifier. A social media post contains hashtags and entities. These hashtags (e.g. "#YesScot" in Scottish Independence Referendum) and entities (e.g., job title or parties' name) can reflect the community affiliation of a social media user. We argue that a large and reliable training dataset can be obtained by distinguishing the usage of these hashtags and entities. Using the obtained training dataset, a social media user community classifier can be quickly achieved, and then used as input to assist in examining the different topics discussed in communities. In conclusion, we have identified four aspects for assisting social scientists to better understand the discussed topics on social media networks. We believe that the proposed tools and approaches can help to examine the exchanges of topics among communities on social media networks

    Stem-cell-based gene therapy for HIV infection.

    Get PDF
    Despite the enormous success of combined anti-retroviral therapy, HIV infection is still a lifelong disease and continues to spread rapidly worldwide. There is a pressing need to develop a treatment that will cure HIV infection. Recent progress in stem cell manipulation and advancements in humanized mouse models have allowed rapid developments of gene therapy for HIV treatment. In this review, we will discuss two aspects of HIV gene therapy using human hematopoietic stem cells. The first is to generate immune systems resistant to HIV infection while the second strategy involves enhancing anti-HIV immunity to eliminate HIV infected cells

    Using Word Embedding to Evaluate the Coherence of Topics from Twitter Data

    Get PDF
    Scholars often seek to understand topics discussed on Twitter using topic modelling approaches. Several coherence metrics have been proposed for evaluating the coherence of the topics generated by these approaches, including the pre-calculated Pointwise Mutual Information (PMI) of word pairs and the Latent Semantic Analysis (LSA) word representation vectors. As Twitter data contains abbreviations and a number of peculiarities (e.g. hashtags), it can be challenging to train effective PMI data or LSA word representation. Recently, Word Embedding (WE) has emerged as a particularly effective approach for capturing the similarity among words. Hence, in this paper, we propose new Word Embedding-based topic coherence metrics. To determine the usefulness of these new metrics, we compare them with the previous PMI/LSA-based metrics. We also conduct a large-scale crowdsourced user study to determine whether the new Word Embedding-based metrics better align with human preferences. Using two Twitter datasets, our results show that the WE-based metrics can capture the coherence of topics in tweets more robustly and efficiently than the PMI/LSA-based ones

    Real-Time Vehicle Emission Estimation Using Traffic Data

    Get PDF
    The current state of climate change should be addressed by all sectors that contribute to it. One of the major contributors is the transportation sector, which generates a quarter of greenhouse gas emissions in North America. Most of these transportation related emissions are from road vehicles; as result, how to manage and control traffic or vehicular emissions is therefore becoming a major concern for the governments, the public and the transportation authorities. One of the key requirements to emission management and control is the ability to quantify the magnitude of emissions by traffic of an existing or future network under specific road plans, designs and traffic management schemes. Unfortunately, vehicular traffic emissions are difficult to quantify or predict, which has led a significant number of efforts over the past decades to address this challenge. Three general methods have been proposed in literature. The first method is for determining the traffic emissions of an existing road network with the idea of measuring the tail-pipe emissions of individual vehicles directly. This approach, while most accurate, is costly and difficult to scale as it would require all vehicles being equipped with tail-pipe emission sensors. The second approach is applying ambient pollutant sensors to measure the emissions generated by the traffic near the sensors. This method is only approximate as the vehicle-generated emissions can easily be confounded by other nearby emitters and weather and environmental conditions. Note that both of these methods are measurement-based and can only be used to evaluate the existing conditions (e.g., after a traffic project is implemented), which means that it cannot be used for evaluating alternative transportation projects at the planning stage. The last method is model-based with the idea of developing models that can be used to estimate traffic emissions. The emission models in this method link the amount of emissions being generated by a group of vehicles to their operations details as well as other influencing factors such as weather, fuel and road geometry. This last method is the most scalable, both spatially and temporally, and also most flexible as it can meet the needs of both monitoring (using field data) and prediction. Typically, traffic emissions are modelled on a macroscopic scale based on the distance travelled by vehicles and their average speeds. However, for traffic management applications, a model of higher granularity would be preferred so that impacts of different traffic control schemes can be captured. Furthermore, recent advances in vehicle detection technology has significantly increased the spatiotemporal resolutions of traffic data. For example, video-based vehicle detection can provide more details about vehicle movements and vehicle types than previous methods like inductive loop detection. Using such detection data, the vehicle movements, referred to as trajectories, can be determined on a second-by-second basis. These vehicle trajectories can then be used to estimate the emissions produced by the vehicles. In this research, we have proposed a new approach that can be used to estimate traffic generated emissions in real time using high resolution traffic data. The essential component of the proposed emission estimation method is the process to reconstruct vehicle trajectories based on available data and some assumptions on the expected vehicle motions including cruising, acceleration and deceleration, and car-following. The reconstructed trajectories containing instantaneous speed and acceleration data are then used to estimate emissions using the MOVES emission simulator. Furthermore, a simplified rate-based module was developed to replace the MOVES software for direct emission calculation, leading to significant improvement in the computational efficiency of the proposed method. The proposed method was tested in a simulated environment using the well-known traffic simulator - Vissim. In the Vissim model, the traffic activities, signal timing, and vehicle detection were simulated and both the original vehicle trajectories and detection data recorded. To evaluate the proposed method, two sets of emission estimates are compared: the “ground truth” set of estimates comes from the originally simulated vehicle trajectories, and the set from trajectories reconstructed using the detection data. Results show that the performance of the proposed method depends on many factors, such as traffic volumes, the placement of detectors, and which greenhouse gas is being estimated. Sensitivity analyses were performed to see whether the proposed method is sufficiently sensitive to the impacts of traffic control schemes. The results from the sensitivity analyses indicate that the proposed method can capture impacts of signal timing changes and signal coordination but is insufficiently sensitive to speed limit changes. Further research is recommended to validate the proposed method using field studies. Another recommendation, which falls outside of this area of research, would be to investigate the feasibility of equipping vehicles with devices that can record their instantaneous fuel consumption and location data. With this information, traffic controllers would be better informed for emission estimation than they would be with only detection data

    The Relationship of Delivery Method, Birth Weight and Race on Infant Mortality

    Get PDF
    Infant mortality is defined as the number of deaths per 1000 births. The U.S. infant mortality rate in 2014 was reported as 5.8 deaths per 1000 births which is very high compared to other countries such as Japan where the rate 2.1 deaths per 1000 births. The leading causes of infant death are congenital malformations, SIDS, low birthweight, pre-term births and maternal complications. For this project, I will analyze birthweight in addition to other factors related to infant death. My research aims to see how the factors of delivery method, birthweight, and race influence infant mortality to see how it can be reduced and to identify groups that are most vulnerable to experiencing high infant death rates. To evaluate this, I analyzed 2007-2016 U.S. infant mortality data from the CDC and created bar charts relating race, birthweight, and delivery method to the death rate. Also, I ran ANOVAs to find significant differences between the variables. I found out that the vaginal delivery method has the lower death rate compared to the C-section delivery method. The ANOVAs revealed that there is a significant difference between race and death rate. American Indians who were born through C-sections have the highest death rate out of all the other races and delivery methods. Small infants delivered through the C-section method are correlated with lower death rates. Large infants delivered through the vaginal method are correlated with lower death rates. I found that American Indians who were born through C-sections have the highest death rate out of all the other races and delivery methods. These results can serve as the beginning of a more comprehensive look into infant mortality

    On the Reproducibility and Generalisation of the Linear Transformation of Word Embeddings

    Get PDF
    Linear transformation is a way to learn a linear relationship between two word embeddings, such that words in the two different embedding spaces can be semantically related. In this paper, we examine the reproducibility and generalisation of the linear transformation of word embeddings. Linear transformation is particularly useful when translating word embedding models in different languages, since it can capture the semantic relationships between two models. We first reproduce two linear transformation approaches, a recent one using orthogonal transformation and the original one using simple matrix transformation. Previous findings on a machine translation task are re-examined, validating that linear transformation is indeed an effective way to transform word embedding models in different languages. In particular, we show that the orthogonal transformation can better relate the different embedding models. Following the verification of previous findings, we then study the generalisation of linear transformation in a multi-language Twitter election classification task. We observe that the orthogonal transformation outperforms the matrix transformation. In particular, it significantly outperforms the random classifier by at least 10% under the F1 metric across English and Spanish datasets. In addition, we also provide best practices when using linear transformation for multi-language Twitter election classification

    On Refining Twitter Lists as Ground Truth Data for Multi-Community User Classification

    Get PDF
    To help scholars and businesses understand and analyse Twitter users, it is useful to have classifiers that can identify the communities that a given user belongs to, e.g. business or politics. Obtaining high quality training data is an important step towards producing an effective multi-community classifier. An efficient approach for creating such ground truth data is to extract users from existing public Twitter lists, where those lists represent different communities, e.g. a list of journalists. However, ground truth datasets obtained using such lists can be noisy, since not all users that belong to a community are good training examples for that community. In this paper, we conduct a thorough failure analysis of a ground truth dataset generated using Twitter lists. We discuss how some categories of users collected from these Twitter public lists could negatively affect the classification performance and therefore should not be used for training. Through experiments with 3 classifiers and 5 communities, we show that removing ambiguous users based on their tweets and profile can indeed result in a 10% increase in F1 performance
    corecore