590 research outputs found
A Survey on Forensics and Compliance Auditing for Critical Infrastructure Protection
The broadening dependency and reliance that modern societies have on essential services
provided by Critical Infrastructures is increasing the relevance of their trustworthiness. However, Critical
Infrastructures are attractive targets for cyberattacks, due to the potential for considerable impact, not just
at the economic level but also in terms of physical damage and even loss of human life. Complementing
traditional security mechanisms, forensics and compliance audit processes play an important role in ensuring
Critical Infrastructure trustworthiness. Compliance auditing contributes to checking if security measures are
in place and compliant with standards and internal policies. Forensics assist the investigation of past security
incidents. Since these two areas significantly overlap, in terms of data sources, tools and techniques, they can
be merged into unified Forensics and Compliance Auditing (FCA) frameworks. In this paper, we survey the
latest developments, methodologies, challenges, and solutions addressing forensics and compliance auditing
in the scope of Critical Infrastructure Protection. This survey focuses on relevant contributions, capable of
tackling the requirements imposed by massively distributed and complex Industrial Automation and Control
Systems, in terms of handling large volumes of heterogeneous data (that can be noisy, ambiguous, and
redundant) for analytic purposes, with adequate performance and reliability. The achieved results produced
a taxonomy in the field of FCA whose key categories denote the relevant topics in the literature. Also, the
collected knowledge resulted in the establishment of a reference FCA architecture, proposed as a generic
template for a converged platform. These results are intended to guide future research on forensics and
compliance auditing for Critical Infrastructure Protection.info:eu-repo/semantics/publishedVersio
A visual analytics platform for competitive intelligence
Silva, D., & Bação, F. (2023). MapIntel: A visual analytics platform for competitive intelligence. Expert Systems, [e13445]. https://doi.org/https://www.authorea.com/doi/full/10.22541/au.166785335.50477185, https://doi.org/10.1111/exsy.13445 --- Funding Information: This work was supported by the (research grant under the DSAIPA/DS/0116/2019 project). Fundação para a Ciência e Tecnologia of Ministério da Ciência e Tecnologia e Ensino SuperiorCompetitive Intelligence allows an organization to keep up with market trends and foresee business opportunities. This practice is mainly performed by analysts scanning for any piece of valuable information in a myriad of dispersed and unstructured sources. Here we present MapIntel, a system for acquiring intelligence from vast collections of text data by representing each document as a multidimensional vector that captures its own semantics. The system is designed to handle complex Natural Language queries and visual exploration of the corpus, potentially aiding overburdened analysts in finding meaningful insights to help decision-making. The system searching module uses a retriever and re-ranker engine that first finds the closest neighbours to the query embedding and then sifts the results through a cross-encoder model that identifies the most relevant documents. The browsing or visualization module also leverages the embeddings by projecting them onto two dimensions while preserving the multidimensional landscape, resulting in a map where semantically related documents form topical clusters which we capture using topic modelling. This map aims at promoting a fast overview of the corpus while allowing a more detailed exploration and interactive information encountering process. We evaluate the system and its components on the 20 newsgroups data set, using the semantic document labels provided, and demonstrate the superiority of Transformer-based components. Finally, we present a prototype of the system in Python and show how some of its features can be used to acquire intelligence from a news article corpus we collected during a period of 8 months.preprintauthorsversionepub_ahead_of_prin
ENHANCING CLOUD SYSTEM RUNTIME TO ADDRESS COMPLEX FAILURES
As the reliance on cloud systems intensifies in our progressively digital world, understanding and reinforcing their reliability becomes more crucial than ever. Despite impressive advancements in augmenting the resilience of cloud systems, the growing incidence of complex failures now poses a substantial challenge to the availability of these systems. With cloud systems continuing to scale and increase in complexity, failures not only become more elusive to detect but can also lead to more catastrophic consequences. Such failures question the foundational premises of conventional fault-tolerance designs, necessitating the creation of novel system designs to counteract them.
This dissertation aims to enhance distributed systems’ capabilities to detect, localize, and react to complex failures at runtime. To this end, this dissertation makes contributions to address three emerging categories of failures in cloud systems. The first part delves into the investigation of partial failures, introducing OmegaGen, a tool adept at generating tailored checkers for detecting and localizing such failures. The second part grapples with silent semantic failures prevalent in cloud systems, showcasing our study findings, and introducing Oathkeeper, a tool that leverages past failures to infer rules and expose these silent issues. The third part explores solutions to slow failures via RESIN, a framework specifically designed to detect, diagnose, and mitigate memory leaks in cloud-scale infrastructures, developed in collaboration with Microsoft Azure. The dissertation concludes by offering insights into future directions for the construction of reliable cloud systems
MemoriEase at the NTCIR-17 Lifelog-5 Task
We present the MemoriEase retrieval system used for our participation in the NTCIR Lifelog-5 Task. We report our method to address the lifelog retrieval problem and discuss our official results of the MemoriEase at Lifelog-5 task. We originally introduced the MemoriEase system for the Lifelog Search Challenge (LSC) as an
interactive lifelog retrieval system. We have modified it to an automatic retrieval system to address the NTCIR Lifelog-5 Task. We
propose the BLIP-2 model as the core embedding model to retrieve
lifelog images from textual queries. The open-sourced Elasticsearch
search engine serves as the main engine in the MemoriEase system.
Some pre-processing and post-processing techniques are applied to
adapt this system to an automatic version and improve the accuracy
of retrieval results. Finally, we discuss the results of the system on
the task, some limitations of the system, and lessons learned from
participating in the Lifelog-5 task for further improvements for the
system in the future
Zemi: Learning Zero-Shot Semi-Parametric Language Models from Multiple Tasks
Although large language models have achieved impressive zero-shot ability,
the huge model size generally incurs high cost. Recently, semi-parametric
language models, which augment a smaller language model with an external
retriever, have demonstrated promising language modeling capabilities. However,
it remains unclear whether such semi-parametric language models can perform
competitively well as their fully-parametric counterparts on zero-shot
generalization to downstream tasks. In this work, we introduce , a
zero-shot semi-parametric language model. To our best knowledge, this is the
first semi-parametric language model that can demonstrate strong zero-shot
performance on a wide range of held-out unseen tasks. We train
with a novel semi-parametric multitask prompted training paradigm, which shows
significant improvement compared with the parametric multitask training as
proposed by T0. Specifically, we augment the multitask training and zero-shot
evaluation with retrieval from a large-scale task-agnostic unlabeled corpus. In
order to incorporate multiple potentially noisy retrieved augmentations, we
further propose a novel module leveraging
perceiver resampler and gated cross-attention. Notably, our proposed
outperforms T0-3B by 16% on all seven evaluation
tasks while being 3.9x smaller in model size.Comment: Accepted as a conference paper at Findings of ACL 202
Stress detection in lifelog data for improved personalized lifelog retrieval system
Stress can be categorized into acute and chronic types, with acute stress having short-term positive effects in managing hazardous situations, while chronic stress can adversely impact mental health. In a biological context, stress elicits a physiological response indicative of the fight-or-flight mechanism, accompanied by measurable changes in physiological signals such as blood volume pulse (BVP), galvanic skin response (GSR), and skin temperature (TEMP). While clinical-grade devices have traditionally been used to measure these signals, recent advancements in sensor technology enable their capture using consumer-grade wearable devices, providing opportunities for research in acute stress detection. Despite these advancements, there has been limited focus on utilizing low-resolution data obtained from sensor technology for early stress detection and evaluating stress detection models under real-world conditions. Moreover, the potential of physiological signals to infer mental stress information remains largely unexplored in lifelog retrieval systems. This thesis addresses these gaps through empirical investigations and explores the potential of utilizing physiological signals for stress detection and their integration within the state-of-the-art (SOTA) lifelog retrieval system. The main contributions of this thesis are as follows. Firstly, statistical analyses are conducted to investigate the feasibility of using low-resolution data for stress detection and emphasize the superiority of subject-dependent models over subject-independent models, thereby proposing the optimal approach to training stress detection models with low-resolution data. Secondly, longitudinal stress lifelog data is collected to evaluate stress detection models in real-world settings. It is proposed that training lifelog models on physiological signals in real-world settings is crucial to avoid detection inaccuracies caused by differences between laboratory and free-living conditions. Finally, a state-of-the-art lifelog interactive retrieval system called \lifeseeker is developed, incorporating the stress-moment filter function. Experimental results demonstrate that integrating this function improves the overall performance of the system in both interactive and non-interactive modes. In summary, this thesis contributes to the understanding of stress detection applied in real-world settings and showcases the potential of integrating stress information for enhancing personalized lifelog retrieval system performance
Using Weak Supervision and Data Augmentation in Question Answering
The onset of the COVID-19 pandemic accentuated the need for access to
biomedical literature to answer timely and disease-specific questions. During
the early days of the pandemic, one of the biggest challenges we faced was the
lack of peer-reviewed biomedical articles on COVID-19 that could be used to
train machine learning models for question answering (QA). In this paper, we
explore the roles weak supervision and data augmentation play in training deep
neural network QA models. First, we investigate whether labels generated
automatically from the structured abstracts of scholarly papers using an
information retrieval algorithm, BM25, provide a weak supervision signal to
train an extractive QA model. We also curate new QA pairs using information
retrieval techniques, guided by the clinicaltrials.gov schema and the
structured abstracts of articles, in the absence of annotated data from
biomedical domain experts. Furthermore, we explore augmenting the training data
of a deep neural network model with linguistic features from external sources
such as lexical databases to account for variations in word morphology and
meaning. To better utilize our training data, we apply curriculum learning to
domain adaptation, fine-tuning our QA model in stages based on characteristics
of the QA pairs. We evaluate our methods in the context of QA models at the
core of a system to answer questions about COVID-19
Artificial Intelligence in the Construction Industry: A Systematic Review of the Entire Construction Value Chain Lifecycle
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY), https://creativecommons.org/licenses/by/4.0/In recent years, there has been a surge in the global digitization of corporate processes and concepts such as digital technology development which is growing at such a quick pace that the construction industry is struggling to catch up with latest developments. A formidable digital technology, artificial intelligence (AI), is recognized as an essential element within the paradigm of digital transformation, having been widely adopted across different industries. Also, AI is anticipated to open a slew of new possibilities for how construction projects are designed and built. To obtain a better knowledge of the trend and trajectory of research concerning AI technology application in the construction industry, this research presents an exhaustive systematic review of seventy articles toward AI applicability to the entire lifecycle of the construction value chain identified via the guidelines outlined by the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA). The review’s findings show foremostly that AI technologies are mostly used in facility management, creating a huge opportunity for the industry to profit by allowing facility managers to take proactive action. Secondly, it shows the potential for design expansion as a key benefit according to most of the selected literature. Finally, it found data augmentation as one of the quickest prospects for technical improvement. This knowledge will assist construction companies across the world in recognizing the efficiency and productivity advantages that AI technologies can provide while helping them make smarter technology investment decisions.Peer reviewe
UTDRM: unsupervised method for training debunked-narrative retrieval models
A key task in the fact-checking workflow is to establish whether the claim under investigation has already been debunked or fact-checked before. This is essentially a retrieval task where a misinformation claim is used as a query to retrieve from a corpus of debunks. Prior debunk retrieval methods have typically been trained on annotated pairs of misinformation claims and debunks. The novelty of this paper is an Unsupervised Method for Training Debunked-Narrative Retrieval Models (UTDRM) in a zero-shot setting, eliminating the need for human-annotated pairs. This approach leverages fact-checking articles for the generation of synthetic claims and employs a neural retrieval model for training. Our experiments show that UTDRM tends to match or exceed the performance of state-of-the-art methods on seven datasets, which demonstrates its effectiveness and broad applicability. The paper also analyses the impact of various factors on UTDRM’s performance, such as the quantity of fact-checking articles utilised, the number of synthetically generated claims employed, the proposed entity inoculation method, and the usage of large language models for retrieval
Current Challenges in the Application of Algorithms in Multi-institutional Clinical Settings
The Coronavirus disease pandemic has highlighted the importance of artificial intelligence in multi-institutional clinical settings. Particularly in situations where the healthcare system is overloaded, and a lot of data is generated, artificial intelligence has great potential to provide automated solutions and to unlock the untapped potential of acquired data. This includes the areas of care, logistics, and diagnosis. For example, automated decision support applications could tremendously help physicians in their daily clinical routine. Especially in radiology and oncology, the exponential growth of imaging data, triggered by a rising number of patients, leads to a permanent overload of the healthcare system, making the use of artificial intelligence inevitable. However, the efficient and advantageous application of artificial intelligence in multi-institutional clinical settings faces several challenges, such as accountability and regulation hurdles, implementation challenges, and fairness considerations. This work focuses on the implementation challenges, which include the following questions: How to ensure well-curated and standardized data, how do algorithms from other domains perform on multi-institutional medical datasets, and how to train more robust and generalizable models? Also, questions of how to interpret results and whether there exist correlations between the performance of the models and the characteristics of the underlying data are part of the work. Therefore, besides presenting a technical solution for manual data annotation and tagging for medical images, a real-world federated learning implementation for image segmentation is introduced. Experiments on a multi-institutional prostate magnetic resonance imaging dataset showcase that models trained by federated learning can achieve similar performance to training on pooled data. Furthermore, Natural Language Processing algorithms with the tasks of semantic textual similarity, text classification, and text summarization are applied to multi-institutional, structured and free-text, oncology reports. The results show that performance gains are achieved by customizing state-of-the-art algorithms to the peculiarities of the medical datasets, such as the occurrence of medications, numbers, or dates. In addition, performance influences are observed depending on the characteristics of the data, such as lexical complexity. The generated results, human baselines, and retrospective human evaluations demonstrate that artificial intelligence algorithms have great potential for use in clinical settings. However, due to the difficulty of processing domain-specific data, there still exists a performance gap between the algorithms and the medical experts. In the future, it is therefore essential to improve the interoperability and standardization of data, as well as to continue working on algorithms to perform well on medical, possibly, domain-shifted data from multiple clinical centers
- …