17 research outputs found

    Improving microblog retrieval from exterior corpus by automatically constructing a microblogging corpus

    Get PDF
    A large-scale training corpus consisting of microblogs belonging to a desired category is important for highaccuracy microblog retrieval. Obtaining such a large-scale microblgging corpus manually is very time and laborconsuming. Therefore, some models for the automatic retrieval of microblogs from an exterior corpus have been proposed. However, these approaches may fail in considering microblog-specific features. To alleviate this issue, we propose a methodology that constructs a simulated microblogging corpus rather than directly building a model from the exterior corpus. The performance of our model is better since the microblog-special knowledge of the microblogging corpus is used in the end by the retrieval model. Experimental results on real-world microblogs demonstrate the superiority of our technique compared to the previous approaches.postprin

    The role of geographic knowledge in sub-city level geolocation algorithms

    Get PDF
    Geolocation of microblog messages has been largely investigated in the lit- erature. Many solutions have been proposed that achieve good results at the city-level. Existing approaches are mainly data-driven (i.e., they rely on a training phase). However, the development of algorithms for geolocation at sub-city level is still an open problem also due to the absence of good training datasets. In this thesis, we investigate the role that external geographic know- ledge can play in geolocation approaches. We show how di)erent geographical data sources can be combined with a semantic layer to achieve reasonably accurate sub-city level geolocation. Moreover, we propose a knowledge-based method, called Sherloc, to accurately geolocate messages at sub-city level, by exploiting the presence in the message of toponyms possibly referring to the speci*c places in the target geographical area. Sherloc exploits the semantics associated with toponyms contained in gazetteers and embeds them into a metric space that captures the semantic distance among them. This allows toponyms to be represented as points and indexed by a spatial access method, allowing us to identify the semantically closest terms to a microblog message, that also form a cluster with respect to their spatial locations. In contrast to state-of-the-art methods, Sherloc requires no prior training, it is not limited to geolocating on a *xed spatial grid and it experimentally demonstrated its ability to infer the location at sub-city level with higher accuracy

    Improving Microblog Retrieval from Exterior Corpus by Automatically Constructing Microblogging Corpus

    No full text
    A large-scale training corpus consisting of microblogs belonging to a desired category is important for high-accuracy microblog retrieval. Obtaining such a large-scale microblgging corpus manually is very time and labor-consuming. Therefore, some models for the automatic retrieval of microblogs froman exterior corpus have been proposed. However, these approaches may fail in considering microblog-specific features. To alleviate this issue, we propose a methodology that constructs a simulated microblogging corpus rather than directly building a model from the exterior corpus. The performance of our model is better since the microblog-special knowledge of the microblogging corpus is used in the end by the retrieval model. Experimental results on real-world microblogs demonstrate the superiority of our technique compared to the previous approaches

    Spatial and Temporal Sentiment Analysis of Twitter data

    Get PDF
    The public have used Twitter world wide for expressing opinions. This study focuses on spatio-temporal variation of georeferenced Tweets’ sentiment polarity, with a view to understanding how opinions evolve on Twitter over space and time and across communities of users. More specifically, the question this study tested is whether sentiment polarity on Twitter exhibits specific time-location patterns. The aim of the study is to investigate the spatial and temporal distribution of georeferenced Twitter sentiment polarity within the area of 1 km buffer around the Curtin Bentley campus boundary in Perth, Western Australia. Tweets posted in campus were assigned into six spatial zones and four time zones. A sentiment analysis was then conducted for each zone using the sentiment analyser tool in the Starlight Visual Information System software. The Feature Manipulation Engine was employed to convert non-spatial files into spatial and temporal feature class. The spatial and temporal distribution of Twitter sentiment polarity patterns over space and time was mapped using Geographic Information Systems (GIS). Some interesting results were identified. For example, the highest percentage of positive Tweets occurred in the social science area, while science and engineering and dormitory areas had the highest percentage of negative postings. The number of negative Tweets increases in the library and science and engineering areas as the end of the semester approaches, reaching a peak around an exam period, while the percentage of negative Tweets drops at the end of the semester in the entertainment and sport and dormitory area. This study will provide some insights into understanding students and staff ’s sentiment variation on Twitter, which could be useful for university teaching and learning management

    European Handbook of Crowdsourced Geographic Information

    Get PDF
    "This book focuses on the study of the remarkable new source of geographic information that has become available in the form of user-generated content accessible over the Internet through mobile and Web applications. The exploitation, integration and application of these sources, termed volunteered geographic information (VGI) or crowdsourced geographic information (CGI), offer scientists an unprecedented opportunity to conduct research on a variety of topics at multiple scales and for diversified objectives. The Handbook is organized in five parts, addressing the fundamental questions: What motivates citizens to provide such information in the public domain, and what factors govern/predict its validity?What methods might be used to validate such information? Can VGI be framed within the larger domain of sensor networks, in which inert and static sensors are replaced or combined by intelligent and mobile humans equipped with sensing devices? What limitations are imposed on VGI by differential access to broadband Internet, mobile phones, and other communication technologies, and by concerns over privacy? How do VGI and crowdsourcing enable innovation applications to benefit human society? Chapters examine how crowdsourcing techniques and methods, and the VGI phenomenon, have motivated a multidisciplinary research community to identify both fields of applications and quality criteria depending on the use of VGI. Besides harvesting tools and storage of these data, research has paid remarkable attention to these information resources, in an age when information and participation is one of the most important drivers of development. The collection opens questions and points to new research directions in addition to the findings that each of the authors demonstrates. Despite rapid progress in VGI research, this Handbook also shows that there are technical, social, political and methodological challenges that require further studies and research.

    Gender Labels in Flux: The Role of Women in Gender Discourse in Post-Reform China

    Get PDF
    With the boom of networked digital communication, verbal misogyny permeates Chinese social media, reflecting and reinforcing a sexist gender order in society at large. At the same time, a new generation of Chinese women is seizing digital platforms to counterstrike linguistic sexism in a gender discourse warfare. How has the role of Chinese women in gender discourse changed from passive targets of gender labeling to active agents of feminist activism? My dissertation attempts to answer this question by analyzing the changing gender dynamics in the shifting social labels in contemporary China (1980 to present). Following the groundwork on verbal sexism in wireless China by Jing-Schmidt and Peng (2018), I take an interdisciplinary approach to gender label analysis, integrating the sociolinguistic principle of the mutual constitution of language and society, critical discourse analysis of gender labels as vehicles of power, feminist linguistics, and a socio-technological view on grassroots digital communication. Not only will this dissertation fill a gap in the interdisciplinary research on gender and language in Chinese, it is also the first to use data mining from digitized press and social media, supplemented with survey data on the perceptions of the social meanings of gender labels. This dissertation is an interdisciplinary digital humanities project and has implications for both gender research and social actions toward gender equality

    Más allá del corpus: Big Data en la investigación lingüística. Evolución, análisis y predicción del uso de la lengua a través de Twitter

    Get PDF
    Esta tesis doctoral se inscribe en una línea de investigación relacionada con las nuevas tecnologías aplicadas a la Lingüística de Corpus y Big Data. Se pretende utilizar la información representada conceptualmente como Big Data (el producto tangible y no estructurado de las interrelaciones humanas a través de las nuevas tecnologías y de las redes sociales) para analizar el uso y la evolución del lenguaje. La hipótesis de partida de esta investigación es la utilidad de la información disponible en las redes sociales y, en concreto, en Twitter, como vehículo principal, para estudiar la evolución histórica y el estado inmediato del lenguaje, así como para realizar predicciones futuras sobre el comportamiento del mismo y las aplicaciones que esto pueda tener en cualquier ámbito de estudio de la Lingüística. Dado que toda esta información se encuentra en soporte digital, diseñaremos una herramienta basada en la idea anterior para demostrar la veracidad de la hipótesis

    Property price prediction: a model utilising sentiment analysis

    Get PDF
    The increase in the use of social media has led many researchers and companies to investigate the potential uses of the data that is generated by these social media platforms. This research study investigates how the use of sentiment variables, obtained from the social media platform Twitter, can be used to augment housing transfer data in order to develop a predictive model. The Design Science Research (DSR) methodology was followed, guided by a Social Media Framework. Experimentation was required within the Design Cycle of the DSR methodology, which lead to the adoption of the Experimental Research methodology within this cycle. An initial literature review identified regression models for property price prediction. Through experimentation, Gradient Boosting regression was identified as an optimal regression model for this purpose. Thereafter a review of sentiment analysis models was conducted which resulted in the proposal of a CNN-LSTM model for the classification of Tweets. Initial experimentation conducted with this proposed model resulted in an obtained accuracy comparable to the top performing sentiment analysis models identified. A dataset obtained through SemEval, a series of evaluations of computational semantic analysis systems, was used for this phase. For the final experimentation, The CNN-LSTM model was used to obtain sentiment variables from Tweets that were collected from the Western Cape Province in 2017. This property dataset was augmented with the sentiment variables, after which experimentation was conducted by applying Gradient Boosting regression. The augmentation was done in two ways, either based on suburb pertaining to the property, or to the month in which the property was transferred. The results indicate that a model for Property Price Prediction Utilising Sentiment Analysis demonstrates a small improvement when suburb-based sentiment, obtained from Tweets with a minimum threshold per suburb, is utilised. An important finding was the fact that, when geo-coordinates are removed from the dataset, the sentiment variables replace them in the regression results, producing the same level as accuracy as when the coordinates are included

    Tune your brown clustering, please

    Get PDF
    Brown clustering, an unsupervised hierarchical clustering technique based on ngram mutual information, has proven useful in many NLP applications. However, most uses of Brown clustering employ the same default configuration; the appropriateness of this configuration has gone predominantly unexplored. Accordingly, we present information for practitioners on the behaviour of Brown clustering in order to assist hyper-parametre tuning, in the form of a theoretical model of Brown clustering utility. This model is then evaluated empirically in two sequence labelling tasks over two text types. We explore the dynamic between the input corpus size, chosen number of classes, and quality of the resulting clusters, which has an impact for any approach using Brown clustering. In every scenario that we examine, our results reveal that the values most commonly used for the clustering are sub-optimal
    corecore