9 research outputs found

    Temporal Topic Analysis with Endogenous and Exogenous Processes

    Full text link
    We consider the problem of modeling temporal textual data taking endogenous and exogenous processes into account. Such text documents arise in real world applications, including job advertisements and economic news articles, which are influenced by the fluctuations of the general economy. We propose a hierarchical Bayesian topic model which imposes a "group-correlated" hierarchical structure on the evolution of topics over time incorporating both processes, and show that this model can be estimated from Markov chain Monte Carlo sampling methods. We further demonstrate that this model captures the intrinsic relationships between the topic distribution and the time-dependent factors, and compare its performance with latent Dirichlet allocation (LDA) and two other related models. The model is applied to two collections of documents to illustrate its empirical performance: online job advertisements from DirectEmployers Association and journalists' postings on BusinessInsider.com

    Improving Construction Project Schedules before Execution

    Get PDF
    The construction industry has been forever blighted by delay and disruption. To address this problem, this study proposes the Fitzsimmons Method (FM method) to improve the scheduling performance of activities on the Critical Path before the project execution. The proposed FM method integrates Bayesian Networks to estimate the conditional probability of activity delay given its predecessor and Support Vector Machines to estimate the time delay. The FM method was trained on 302 completed infrastructure construction projects and validated on a £40 million completed road construction project. Compared with traditional Monte Carlo Simulation results, the proposed FM method is 52% more accurate in predicting the projects’ time delay. The proposed FM method contributes to leveraging the vast quantities of data available to improve the estimation of time risk on infrastructure and construction projects

    Development of a national-scale real-time Twitter data mining pipeline for social geodata on the potential impacts of flooding on communities

    Get PDF
    International audienceSocial media, particularly Twitter, is increasingly used to improve resilience during extreme weather events/emergency management situations, including floods: by communicating potential risks and their impacts, and informing agencies and responders. In this paper, we developed a prototype national-scale Twitter data mining pipeline for improved stakeholder situational awareness during flooding events across Great Britain, by retrieving relevant social geodata, grounded in environmental data sources (flood warnings and river levels). With potential users we identified and addressed three research questions to develop this application, whose components constitute a modular architecture for real-time dashboards. First, polling national flood warning and river level Web data sources to obtain at-risk locations. Secondly, real-time retrieval of geotagged tweets, proximate to at-risk areas. Thirdly, filtering flood-relevant tweets with natural language processing and machine learning libraries, using word embeddings of tweets. We demonstrated the national-scale social geodata pipeline using over 420,000 georeferenced tweets obtained between 20-29th June 2016. Highlights • Prototype real-time social geodata pipeline for flood events and demonstration dataset • National-scale flood warnings/river levels set 'at-risk areas' in Twitter API queries • Monitoring multiple locations (without keywords) retrieved current, geotagged tweets • Novel application of word embeddings in flooding context identified relevant tweets • Pipeline extracts tweets to visualise using open-source libraries (SciKit Learn/Gensim) Keywords Flood management; Twitter; volunteered geographic information; natural language processing; word embeddings; social geodata. Hardware required: Intel i3 or mid-performance PC with multicore processor and SSD main drive, 8Gb memory recommended. Software required: Python and library dependencies specified in Appendix A1.2.1, (viii) environment.yml Software availability: All source code can be found at GitHub public repositorie

    Exploring acceptance of autonomous vehicle policies using KeyBERT and SNA: Targeting engineering students

    Full text link
    This study aims to explore user acceptance of Autonomous Vehicle (AV) policies with improved text-mining methods. Recently, South Korean policymakers have viewed Autonomous Driving Car (ADC) and Autonomous Driving Robot (ADR) as next-generation means of transportation that will reduce the cost of transporting passengers and goods. They support the construction of V2I and V2V communication infrastructures for ADC and recognize that ADR is equivalent to pedestrians to promote its deployment into sidewalks. To fill the gap where end-user acceptance of these policies is not well considered, this study applied two text-mining methods to the comments of graduate students in the fields of Industrial, Mechanical, and Electronics-Electrical-Computer. One is the Co-occurrence Network Analysis (CNA) based on TF-IWF and Dice coefficient, and the other is the Contextual Semantic Network Analysis (C-SNA) based on both KeyBERT, which extracts keywords that contextually represent the comments, and double cosine similarity. The reason for comparing these approaches is to balance interest not only in the implications for the AV policies but also in the need to apply quality text mining to this research domain. Significantly, the limitation of frequency-based text mining, which does not reflect textual context, and the trade-off of adjusting thresholds in Semantic Network Analysis (SNA) were considered. As the results of comparing the two approaches, the C-SNA provided the information necessary to understand users' voices using fewer nodes and features than the CNA. The users who pre-emptively understood the AV policies based on their engineering literacy and the given texts revealed potential risks of the AV accident policies. This study adds suggestions to manage these risks to support the successful deployment of AVs on public roads.Comment: 29 pages with 11 figure

    Descubrimiento de tópicos a partir de textos en español sobre enfermedades en México

    Get PDF
    65 páginas. Maestría en Ciencias de la Computación.En las redes sociales existe una gran cantidad de información que puede llegar a ser valiosa sobre numerosos temas. Por ejemplo, en el dominio de las enfermedades muchas personas en todo el mundo publican diversa información acerca de ellas entre lo que destacan padecimientos, signos, síntomas, procedimientos, medicamentos y tratamientos. Esta información se encuentra en textos de manera desorganizada lo que complica a los lectores a encontrar información valiosa y el realizar un análisis manual de la misma resulta un proceso tedioso, difícil y que consume mucho tiempo. Para esto recurrimos a los sistemas computacionales, algoritmos o métodos de análisis de textos para encontrar tópicos o temas de interés. Es por ello que en este trabajo se presenta un enfoque para el descubrimiento de tópicos a partir de textos en español sobre tres enfermedades (Diabetes, Cáncer y COVID-19) en México, utilizando los algoritmos LDA (Latent Dirichlet Allocation) ampliamente utilizado en la literatura, y BTM (Biterm Topic Model ) una alternativa que agrupa dos términos para encontrar los tópicos. Este enfoque tiene como hipótesis que el uso de frases sobre palabras para ingresar a los algoritmos mejora los resultados de coherencia de los tópicos. Una evaluación de resultados experimentales fue llevada a cabo basada en la métrica de coherencia de tópicos. Esta evaluación ha mostrado que el uso de frases es más efectiva que usar palabras simples para descubrir tópicos. Además, se han logrado los mejores resultados de coherencia por enfermedad como sigue: 0.7421 con el algoritmo BTM para 100 tópicos sobre el COVID; 0.6755 con el algoritmo BTM para 80 tópicos sobre el Cáncer; y 0.6357 con el algoritmo BTM para 80 tópicos sobre la Diabetes.In social networks there is a large amount of information that can reach to be valuable on numerous topics. For example, in the domain of diseases Many people around the world publish various information about them, including conditions, signs, symptoms, procedures, medications and treatments. This information is found in texts in a disorganized way making it difficult for readers to find valuable information and perform an analysis manual of it is a tedious, difficult and time-consuming process. For this we resort to computational systems, algorithms or methods of analysis of texts to find topics or topics of interest. That is why in this work an approach for the discovery is presented of topics from texts in Spanish on three diseases (Diabetes, Cancer and COVID-19) in Mexico, using LDA (Latent Dirichlet Allocation) algorithms widely used in the literature, and BTM (Biterm Topic Model) an alternative that groups two terms to find the topics. This approach has as hypothesis that the use of phrases over words to enter algorithms improves the coherence results of the topics. An evaluation of experimental results was carried out based on the topic coherence metric. This evaluation has shown that the use of phrases it is more effective than using single words to discover topics. In addition, they have achieved the best consistency results by disease as follows: 0.7421 with the BTM algorithm for 100 topics on COVID; 0.6755 with the BTM algorithm for 80 topics on Cancer; and 0.6357 with the BTM algorithm for 80 topics about Diabetes

    Endless Data

    Get PDF
    Small and Medium Enterprises (SMEs), as well as micro teams, face an uphill task when delivering software to the Cloud. While rapid release methods such as Continuous Delivery can speed up the delivery cycle: software quality, application uptime and information management remain key concerns. This work looks at four aspects of software delivery: crowdsourced testing, Cloud outage modelling, collaborative chat discourse modelling, and collaborative chat discourse segmentation. For each aspect, we consider business related questions around how to improve software quality and gain more significant insights into collaborative data while respecting the rapid release paradigm

    The Palgrave Handbook of Digital Russia Studies

    Get PDF
    This open access handbook presents a multidisciplinary and multifaceted perspective on how the ‘digital’ is simultaneously changing Russia and the research methods scholars use to study Russia. It provides a critical update on how Russian society, politics, economy, and culture are reconfigured in the context of ubiquitous connectivity and accounts for the political and societal responses to digitalization. In addition, it answers practical and methodological questions in handling Russian data and a wide array of digital methods. The volume makes a timely intervention in our understanding of the changing field of Russian Studies and is an essential guide for scholars, advanced undergraduate and graduate students studying Russia today

    The Palgrave Handbook of Digital Russia Studies

    Get PDF
    This open access handbook presents a multidisciplinary and multifaceted perspective on how the ‘digital’ is simultaneously changing Russia and the research methods scholars use to study Russia. It provides a critical update on how Russian society, politics, economy, and culture are reconfigured in the context of ubiquitous connectivity and accounts for the political and societal responses to digitalization. In addition, it answers practical and methodological questions in handling Russian data and a wide array of digital methods. The volume makes a timely intervention in our understanding of the changing field of Russian Studies and is an essential guide for scholars, advanced undergraduate and graduate students studying Russia today
    corecore