9 research outputs found
Temporal Topic Analysis with Endogenous and Exogenous Processes
We consider the problem of modeling temporal textual data taking endogenous
and exogenous processes into account. Such text documents arise in real world
applications, including job advertisements and economic news articles, which
are influenced by the fluctuations of the general economy. We propose a
hierarchical Bayesian topic model which imposes a "group-correlated"
hierarchical structure on the evolution of topics over time incorporating both
processes, and show that this model can be estimated from Markov chain Monte
Carlo sampling methods. We further demonstrate that this model captures the
intrinsic relationships between the topic distribution and the time-dependent
factors, and compare its performance with latent Dirichlet allocation (LDA) and
two other related models. The model is applied to two collections of documents
to illustrate its empirical performance: online job advertisements from
DirectEmployers Association and journalists' postings on BusinessInsider.com
Improving Construction Project Schedules before Execution
The construction industry has been forever
blighted by delay and disruption. To address this
problem, this study proposes the Fitzsimmons
Method (FM method) to improve the scheduling
performance of activities on the Critical Path before
the project execution. The proposed FM method
integrates Bayesian Networks to estimate the
conditional probability of activity delay given its
predecessor and Support Vector Machines to
estimate the time delay. The FM method was trained
on 302 completed infrastructure construction
projects and validated on a £40 million completed
road construction project. Compared with
traditional Monte Carlo Simulation results, the
proposed FM method is 52% more accurate in
predicting the projects’ time delay. The proposed
FM method contributes to leveraging the vast
quantities of data available to improve the
estimation of time risk on infrastructure and
construction projects
Development of a national-scale real-time Twitter data mining pipeline for social geodata on the potential impacts of flooding on communities
International audienceSocial media, particularly Twitter, is increasingly used to improve resilience during extreme weather events/emergency management situations, including floods: by communicating potential risks and their impacts, and informing agencies and responders. In this paper, we developed a prototype national-scale Twitter data mining pipeline for improved stakeholder situational awareness during flooding events across Great Britain, by retrieving relevant social geodata, grounded in environmental data sources (flood warnings and river levels). With potential users we identified and addressed three research questions to develop this application, whose components constitute a modular architecture for real-time dashboards. First, polling national flood warning and river level Web data sources to obtain at-risk locations. Secondly, real-time retrieval of geotagged tweets, proximate to at-risk areas. Thirdly, filtering flood-relevant tweets with natural language processing and machine learning libraries, using word embeddings of tweets. We demonstrated the national-scale social geodata pipeline using over 420,000 georeferenced tweets obtained between 20-29th June 2016. Highlights • Prototype real-time social geodata pipeline for flood events and demonstration dataset • National-scale flood warnings/river levels set 'at-risk areas' in Twitter API queries • Monitoring multiple locations (without keywords) retrieved current, geotagged tweets • Novel application of word embeddings in flooding context identified relevant tweets • Pipeline extracts tweets to visualise using open-source libraries (SciKit Learn/Gensim) Keywords Flood management; Twitter; volunteered geographic information; natural language processing; word embeddings; social geodata. Hardware required: Intel i3 or mid-performance PC with multicore processor and SSD main drive, 8Gb memory recommended. Software required: Python and library dependencies specified in Appendix A1.2.1, (viii) environment.yml Software availability: All source code can be found at GitHub public repositorie
Exploring acceptance of autonomous vehicle policies using KeyBERT and SNA: Targeting engineering students
This study aims to explore user acceptance of Autonomous Vehicle (AV)
policies with improved text-mining methods. Recently, South Korean policymakers
have viewed Autonomous Driving Car (ADC) and Autonomous Driving Robot (ADR) as
next-generation means of transportation that will reduce the cost of
transporting passengers and goods. They support the construction of V2I and V2V
communication infrastructures for ADC and recognize that ADR is equivalent to
pedestrians to promote its deployment into sidewalks. To fill the gap where
end-user acceptance of these policies is not well considered, this study
applied two text-mining methods to the comments of graduate students in the
fields of Industrial, Mechanical, and Electronics-Electrical-Computer. One is
the Co-occurrence Network Analysis (CNA) based on TF-IWF and Dice coefficient,
and the other is the Contextual Semantic Network Analysis (C-SNA) based on both
KeyBERT, which extracts keywords that contextually represent the comments, and
double cosine similarity. The reason for comparing these approaches is to
balance interest not only in the implications for the AV policies but also in
the need to apply quality text mining to this research domain. Significantly,
the limitation of frequency-based text mining, which does not reflect textual
context, and the trade-off of adjusting thresholds in Semantic Network Analysis
(SNA) were considered. As the results of comparing the two approaches, the
C-SNA provided the information necessary to understand users' voices using
fewer nodes and features than the CNA. The users who pre-emptively understood
the AV policies based on their engineering literacy and the given texts
revealed potential risks of the AV accident policies. This study adds
suggestions to manage these risks to support the successful deployment of AVs
on public roads.Comment: 29 pages with 11 figure
Descubrimiento de tópicos a partir de textos en español sobre enfermedades en México
65 páginas. MaestrÃa en Ciencias de la Computación.En las redes sociales existe una gran cantidad de información que puede llegar a ser valiosa sobre numerosos temas. Por ejemplo, en el dominio de las enfermedades muchas personas en todo el mundo publican diversa información acerca de ellas entre lo que destacan padecimientos, signos, sÃntomas, procedimientos, medicamentos y tratamientos. Esta información se encuentra en textos de manera desorganizada lo que complica a los lectores a encontrar información valiosa y el realizar un análisis manual de la misma resulta un proceso tedioso, difÃcil y que consume mucho tiempo. Para esto recurrimos a los sistemas computacionales, algoritmos o métodos de análisis de textos para encontrar tópicos o temas de interés. Es por ello que en este trabajo se presenta un enfoque para el descubrimiento de tópicos a partir de textos en español sobre tres enfermedades (Diabetes, Cáncer y COVID-19) en México, utilizando los algoritmos LDA (Latent Dirichlet Allocation) ampliamente utilizado en la literatura, y BTM (Biterm Topic Model ) una alternativa que agrupa dos términos para encontrar los tópicos. Este enfoque tiene como hipótesis que el uso de frases sobre palabras para ingresar a los algoritmos mejora los resultados de coherencia de los tópicos. Una evaluación de resultados experimentales fue llevada a cabo basada en la métrica de coherencia de tópicos. Esta evaluación ha mostrado que el uso de frases es más efectiva que usar palabras simples para descubrir tópicos. Además, se han logrado los mejores resultados de coherencia por enfermedad como sigue: 0.7421 con el algoritmo BTM para 100 tópicos sobre el COVID; 0.6755 con el algoritmo BTM para 80 tópicos sobre el Cáncer; y 0.6357 con el algoritmo BTM para 80 tópicos sobre la Diabetes.In social networks there is a large amount of information that can reach to be valuable on numerous topics. For example, in the domain of diseases Many people around the world publish various information about them, including conditions, signs, symptoms, procedures, medications and treatments. This information is found in texts in a disorganized way making it difficult for readers to find valuable information and perform an analysis manual of it is a tedious, difficult and time-consuming process. For this we resort to computational systems, algorithms or methods of analysis of texts to find topics or topics of interest. That is why in this work an approach for the discovery is presented of topics from texts in Spanish on three diseases (Diabetes, Cancer and COVID-19) in Mexico,
using LDA (Latent Dirichlet Allocation) algorithms widely used in the literature, and BTM (Biterm Topic Model) an alternative that groups two terms to find the topics. This approach has as hypothesis that the use of phrases over words to enter algorithms improves the coherence results of the topics. An evaluation of experimental results was carried out based on the topic coherence metric. This evaluation has shown that the use of phrases it is more effective than using single words to discover topics. In addition, they have achieved the best consistency results by disease as follows: 0.7421 with the BTM algorithm for 100 topics on COVID; 0.6755 with the BTM algorithm for 80 topics on Cancer; and 0.6357 with the BTM algorithm for 80 topics about Diabetes
Endless Data
Small and Medium Enterprises (SMEs), as well as micro teams, face an uphill
task when delivering software to the Cloud. While rapid release methods
such as Continuous Delivery can speed up the delivery cycle: software quality,
application uptime and information management remain key concerns. This
work looks at four aspects of software delivery: crowdsourced testing, Cloud
outage modelling, collaborative chat discourse modelling, and collaborative
chat discourse segmentation. For each aspect, we consider business related
questions around how to improve software quality and gain more significant
insights into collaborative data while respecting the rapid release paradigm
The Palgrave Handbook of Digital Russia Studies
This open access handbook presents a multidisciplinary and multifaceted perspective on how the ‘digital’ is simultaneously changing Russia and the research methods scholars use to study Russia. It provides a critical update on how Russian society, politics, economy, and culture are reconfigured in the context of ubiquitous connectivity and accounts for the political and societal responses to digitalization. In addition, it answers practical and methodological questions in handling Russian data and a wide array of digital methods. The volume makes a timely intervention in our understanding of the changing field of Russian Studies and is an essential guide for scholars, advanced undergraduate and graduate students studying Russia today
The Palgrave Handbook of Digital Russia Studies
This open access handbook presents a multidisciplinary and multifaceted perspective on how the ‘digital’ is simultaneously changing Russia and the research methods scholars use to study Russia. It provides a critical update on how Russian society, politics, economy, and culture are reconfigured in the context of ubiquitous connectivity and accounts for the political and societal responses to digitalization. In addition, it answers practical and methodological questions in handling Russian data and a wide array of digital methods. The volume makes a timely intervention in our understanding of the changing field of Russian Studies and is an essential guide for scholars, advanced undergraduate and graduate students studying Russia today