1,167 research outputs found
Dataflow Programming and Acceleration of Computationally-Intensive Algorithms
The volume of unstructured textual information continues to grow due to recent technological advancements. This resulted in an exponential growth of information generated in various formats, including blogs, posts, social networking, and enterprise documents. Numerous Enterprise Architecture (EA) documents are also created daily, such as reports, contracts, agreements, frameworks, architecture requirements, designs, and operational guides. The processing and computation of this massive amount of unstructured information necessitate substantial computing capabilities and the implementation of new techniques. It is critical to manage this unstructured information through a centralized knowledge management platform. Knowledge management is the process of managing information within an organization. This involves creating, collecting, organizing, and storing information in a way that makes it easily accessible and usable. The research involved the development textual knowledge management system, and two use cases were considered for extracting textual knowledge from documents. The first case study focused on the safety-critical documents of a railway enterprise. Safety is of paramount importance in the railway industry. There are several EA documents including manuals, operational procedures, and technical guidelines that contain critical information. Digitalization of these documents is essential for analysing vast amounts of textual knowledge that exist in these documents to improve the safety and security of railway operations. A case study was conducted between the University of Huddersfield and the Railway Safety Standard Board (RSSB) to analyse EA safety documents using Natural language processing (NLP). A graphical user interface was developed that includes various document processing features such as semantic search, document mapping, text summarization, and visualization of key trends. For the second case study, open-source data was utilized, and textual knowledge was extracted. Several features were also developed, including kernel distribution, analysis offkey trends, and sentiment analysis of words (such as unique, positive, and negative) within the documents. Additionally, a heterogeneous framework was designed using CPU/GPU and FPGAs to analyse the computational performance of document mapping
Unifying context with labeled property graph: A pipeline-based system for comprehensive text representation in NLP
Extracting valuable insights from vast amounts of unstructured digital text presents significant challenges across diverse domains. This research addresses this challenge by proposing a novel pipeline-based system that generates domain-agnostic and task-agnostic text representations. The proposed approach leverages labeled property graphs (LPG) to encode contextual information, facilitating the integration of diverse linguistic elements into a unified representation. The proposed system enables efficient graph-based querying and manipulation by addressing the crucial aspect of comprehensive context modeling and fine-grained semantics. The effectiveness of the proposed system is demonstrated through the implementation of NLP components that operate on LPG-based representations. Additionally, the proposed approach introduces specialized patterns and algorithms to enhance specific NLP tasks, including nominal mention detection, named entity disambiguation, event enrichments, event participant detection, and temporal link detection. The evaluation of the proposed approach, using the MEANTIME corpus comprising manually annotated documents, provides encouraging results and valuable insights into the system\u27s strengths. The proposed pipeline-based framework serves as a solid foundation for future research, aiming to refine and optimize LPG-based graph structures to generate comprehensive and semantically rich text representations, addressing the challenges associated with efficient information extraction and analysis in NLP
Explainable text-based features in predictive models of crowdfunding campaigns
Reward-Based Crowdfunding offers an opportunity for innovative ventures that would not be supported through traditional financing. A key problem for those seeking funding is understanding which features of a crowdfunding campaign will sway the decisions of a sufficient number of funders. Predictive models of fund-raising campaigns used in combination with Explainable AI methods promise to provide such insights. However, previous work on Explainable AI has largely focused on quantitative structured data. In this study, our aim is to construct explainable models of human decisions based on analysis of natural language text, thus contributing to a fast-growing body of research on the use of Explainable AI for text analytics. We propose a novel method to construct predictions based on text via semantic clustering of sentences, which, compared with traditional methods using individual words and phrases, allows complex meaning contained in the text to be operationalised. Using experimental evaluation, we compare our proposed method to keyword extraction and topic modelling, which have traditionally been used in similar applications. Our results demonstrate that the sentence clustering method produces features with significant predictive power, compared to keyword-based methods and topic models, but which are much easier to interpret for human raters. We furthermore conduct a SHAP analysis of the models incorporating sentence clusters, demonstrating concrete insights into the types of natural language content that influence the outcome of crowdfunding campaigns
A Simple and Effective Method of Cross-Lingual Plagiarism Detection
We present a simple cross-lingual plagiarism detection method applicable to a
large number of languages. The presented approach leverages open multilingual
thesauri for candidate retrieval task and pre-trained multilingual BERT-based
language models for detailed analysis. The method does not rely on machine
translation and word sense disambiguation when in use, and therefore is
suitable for a large number of languages, including under-resourced languages.
The effectiveness of the proposed approach is demonstrated for several existing
and new benchmarks, achieving state-of-the-art results for French, Russian, and
Armenian languages
A Comprehensive Survey on Applications of Transformers for Deep Learning Tasks
Transformer is a deep neural network that employs a self-attention mechanism
to comprehend the contextual relationships within sequential data. Unlike
conventional neural networks or updated versions of Recurrent Neural Networks
(RNNs) such as Long Short-Term Memory (LSTM), transformer models excel in
handling long dependencies between input sequence elements and enable parallel
processing. As a result, transformer-based models have attracted substantial
interest among researchers in the field of artificial intelligence. This can be
attributed to their immense potential and remarkable achievements, not only in
Natural Language Processing (NLP) tasks but also in a wide range of domains,
including computer vision, audio and speech processing, healthcare, and the
Internet of Things (IoT). Although several survey papers have been published
highlighting the transformer's contributions in specific fields, architectural
differences, or performance evaluations, there is still a significant absence
of a comprehensive survey paper encompassing its major applications across
various domains. Therefore, we undertook the task of filling this gap by
conducting an extensive survey of proposed transformer models from 2017 to
2022. Our survey encompasses the identification of the top five application
domains for transformer-based models, namely: NLP, Computer Vision,
Multi-Modality, Audio and Speech Processing, and Signal Processing. We analyze
the impact of highly influential transformer-based models in these domains and
subsequently classify them based on their respective tasks using a proposed
taxonomy. Our aim is to shed light on the existing potential and future
possibilities of transformers for enthusiastic researchers, thus contributing
to the broader understanding of this groundbreaking technology
Talking about personal recovery in bipolar disorder: Integrating health research, natural language processing, and corpus linguistics to analyse peer online support forum posts
Background: Personal recovery, ‘living a satisfying, hopeful and contributing lifeeven with the limitations caused by the illness’ (Anthony, 1993) is of particular value in bipolar disorder where symptoms often persist despite treatment. So far, personal recovery has only been studied in researcher-constructed environments (interviews, focus groups). Support forum posts can serve as a complementary naturalistic data source. Objective: The overarching aim of this thesis was to study personal recovery experiences that people living with bipolar disorder have shared in online support forums through integrating health research, NLP, and corpus linguistics in a mixed methods approach within a pragmatic research paradigm, while considering ethical issues and involving people with lived experience. Methods: This mixed-methods study analysed: 1) previous qualitative evidence on personal recovery in bipolar disorder from interviews and focus groups 2) who self-reports a bipolar disorder diagnosis on the online discussion platform Reddit 3) the relationship of mood and posting in mental health-specific Reddit forums (subreddits) 4) discussions of personal recovery in bipolar disorder subreddits. Results: A systematic review of qualitative evidence resulted in the first framework for personal recovery in bipolar disorder, POETIC (Purpose & meaning, Optimism & hope, Empowerment, Tensions, Identity, Connectedness). Mainly young or middle-aged US-based adults self-report a bipolar disorder diagnosis on Reddit. Of these, those experiencing more intense emotions appear to be more likely to post in mental health support subreddits. Their personal recovery-related discussions in bipolar disorder subreddits primarily focussed on three domains: Purpose & meaning (particularly reproductive decisions, work), Connectedness (romantic relationships, social support), Empowerment (self-management, personal responsibility). Support forum data highlighted personal recovery issues that exclusively or more frequently came up online compared to previous evidence from interviews and focus groups. Conclusion: This project is the first to analyse non-reactive data on personal recovery in bipolar disorder. Indicating the key areas that people focus on in personal recovery when posting freely and the language they use provides a helpful starting point for formal and informal carers to understand the concerns of people diagnosed with bipolar disorder and to consider how best to offer support
Workshop Proceedings of the 12th edition of the KONVENS conference
The 2014 issue of KONVENS is even more a forum for exchange: its main topic is the interaction between Computational Linguistics and Information Science, and the synergies such interaction, cooperation and integrated views can produce. This topic at the crossroads of different research traditions which deal with natural language as a container of knowledge, and with methods to extract and manage knowledge that is linguistically represented is close to the heart of many researchers at the Institut für Informationswissenschaft und Sprachtechnologie of Universität Hildesheim: it has long been one of the institute’s research topics, and it has received even more attention over the last few years
Evaluation of Distributional Semantic Models of Ancient Greek:Preliminary Results and a Road Map for Future Work
We evaluate four count-based and predictive distributional semantic models of Ancient Greek against AGREE, a composite benchmark of human judgements, to assess their ability to retrieve semantic relatedness. On the basis of the observations deriving from the analysis of the results, we design a procedure for a largerscale intrinsic evaluation of count-based and predictive language models, including syntactic embeddings. We also propose possible ways of exploiting the different layers of the whole AGREE benchmark (including both humanand machine-generated data) and different evaluation metrics
Driving Context into Text-to-Text Privatization
\textit{Metric Differential Privacy} enables text-to-text privatization by
adding calibrated noise to the vector of a word derived from an embedding space
and projecting this noisy vector back to a discrete vocabulary using a nearest
neighbor search. Since words are substituted without context, this mechanism is
expected to fall short at finding substitutes for words with ambiguous
meanings, such as \textit{'bank'}. To account for these ambiguous words, we
leverage a sense embedding and incorporate a sense disambiguation step prior to
noise injection. We encompass our modification to the privatization mechanism
with an estimation of privacy and utility. For word sense disambiguation on the
\textit{Words in Context} dataset, we demonstrate a substantial increase in
classification accuracy by
A Tale of Two Approaches: Comparing Top-Down and Bottom-Up Strategies for Analyzing and Visualizing High-Dimensional Data
The proliferation of high-throughput and sensory technologies in various fields has led to a considerable increase in data volume, complexity, and diversity. Traditional data storage, analysis, and visualization methods are struggling to keep pace with the growth of modern data sets, necessitating innovative approaches to overcome the challenges of managing, analyzing, and visualizing data across various disciplines.
One such approach is utilizing novel storage media, such as deoxyribonucleic acid~(DNA), which presents efficient, stable, compact, and energy-saving storage option. Researchers are exploring the potential use of DNA as a storage medium for long-term storage of significant cultural and scientific materials.
In addition to novel storage media, scientists are also focussing on developing new techniques that can integrate multiple data modalities and leverage machine learning algorithms to identify complex relationships and patterns in vast data sets. These newly-developed data management and analysis approaches have the potential to unlock previously unknown insights into various phenomena and to facilitate more effective translation of basic research findings to practical and clinical applications.
Addressing these challenges necessitates different problem-solving approaches. Researchers are developing novel tools and techniques that require different viewpoints. Top-down and bottom-up approaches are essential techniques that offer valuable perspectives for managing, analyzing, and visualizing complex high-dimensional multi-modal data sets. This cumulative dissertation explores the challenges associated with handling such data and highlights top-down, bottom-up, and integrated approaches that are being developed to manage, analyze, and visualize this data. The work is conceptualized in two parts, each reflecting the two problem-solving approaches and their uses in published studies. The proposed work showcases the importance of understanding both approaches, the steps of reasoning about the problem within them, and their concretization and application in various domains
- …