545 research outputs found
Linking social media, medical literature, and clinical notes using deep learning.
Researchers analyze data, information, and knowledge through many sources, formats, and methods. The dominant data format includes text and images. In the healthcare industry, professionals generate a large quantity of unstructured data. The complexity of this data and the lack of computational power causes delays in analysis. However, with emerging deep learning algorithms and access to computational powers such as graphics processing unit (GPU) and tensor processing units (TPUs), processing text and images is becoming more accessible. Deep learning algorithms achieve remarkable results in natural language processing (NLP) and computer vision. In this study, we focus on NLP in the healthcare industry and collect data not only from electronic medical records (EMRs) but also medical literature and social media. We propose a framework for linking social media, medical literature, and EMRs clinical notes using deep learning algorithms. Connecting data sources requires defining a link between them, and our key is finding concepts in the medical text. The National Library of Medicine (NLM) introduces a Unified Medical Language System (UMLS) and we use this system as the foundation of our own system. We recognize social mediaâs dynamic nature and apply supervised and semi-supervised methodologies to generate concepts. Named entity recognition (NER) allows efficient extraction of information, or entities, from medical literature, and we extend the model to process the EMRsâ clinical notes via transfer learning. The results include an integrated, end-to-end, web-based system solution that unifies social media, literature, and clinical notes, and improves access to medical knowledge for the public and experts
High-Performance Modelling and Simulation for Big Data Applications
This open access book was prepared as a Final Publication of the COST Action IC1406 âHigh-Performance Modelling and Simulation for Big Data Applications (cHiPSet)â project. Long considered important pillars of the scientific method, Modelling and Simulation have evolved from traditional discrete numerical methods to complex data-intensive continuous analytical optimisations. Resolution, scale, and accuracy have become essential to predict and analyse natural and complex systems in science and engineering. When their level of abstraction raises to have a better discernment of the domain at hand, their representation gets increasingly demanding for computational and data resources. On the other hand, High Performance Computing typically entails the effective use of parallel and distributed processing units coupled with efficient storage, communication and visualisation systems to underpin complex data-intensive applications in distinct scientific and technical domains. It is then arguably required to have a seamless interaction of High Performance Computing with Modelling and Simulation in order to store, compute, analyse, and visualise large data sets in science and engineering. Funded by the European Commission, cHiPSet has provided a dynamic trans-European forum for their members and distinguished guests to openly discuss novel perspectives and topics of interests for these two communities. This cHiPSet compendium presents a set of selected case studies related to healthcare, biological data, computational advertising, multimedia, finance, bioinformatics, and telecommunications
An online adaptive learning algorithm for optimal trade execution in high-frequency markets
A thesis submitted in fulfilment of the requirements for the degree of Doctor of Philosophy
in the Faculty of Science, School of Computer Science and Applied Mathematics
University of the Witwatersrand. October 2016.Automated algorithmic trade execution is a central problem in modern financial markets,
however finding and navigating optimal trajectories in this system is a non-trivial
task. Many authors have developed exact analytical solutions by making simplifying
assumptions regarding governing dynamics, however for practical feasibility and robustness,
a more dynamic approach is needed to capture the spatial and temporal system
complexity and adapt as intraday regimes change.
This thesis aims to consolidate four key ideas: 1) the financial market as a complex
adaptive system, where purposeful agents with varying system visibility collectively and
simultaneously create and perceive their environment as they interact with it; 2) spin
glass models as a tractable formalism to model phenomena in this complex system; 3) the
multivariate Hawkes process as a candidate governing process for limit order book events;
and 4) reinforcement learning as a framework for online, adaptive learning. Combined
with the data and computational challenges of developing an efficient, machine-scale
trading algorithm, we present a feasible scheme which systematically encodes these ideas.
We first determine the efficacy of the proposed learning framework, under the conjecture
of approximate Markovian dynamics in the equity market. We find that a simple lookup
table Q-learning algorithm, with discrete state attributes and discrete actions, is able
to improve post-trade implementation shortfall by adapting a typical static arrival-price
volume trajectory with respect to prevailing market microstructure features streaming
from the limit order book.
To enumerate a scale-specific state space whilst avoiding the curse of dimensionality, we
propose a novel approach to detect the intraday temporal financial market state at each
decision point in the Q-learning algorithm, inspired by the complex adaptive system
paradigm. A physical analogy to the ferromagnetic Potts model at thermal equilibrium
is used to develop a high-speed maximum likelihood clustering algorithm, appropriate
for measuring critical or near-critical temporal states in the financial system. State
features are studied to extract time-scale-specific state signature vectors, which serve as
low-dimensional state descriptors and enable online state detection.
To assess the impact of agent interactions on the system, a multivariate Hawkes process is
used to measure the resiliency of the limit order book with respect to liquidity-demand
events of varying size. By studying the branching ratios associated with key quote
replenishment intensities following trades, we ensure that the limit order book is expected
to be resilient with respect to the maximum permissible trade executed by the agent.
Finally we present a feasible scheme for unsupervised state discovery, state detection
and online learning for high-frequency quantitative trading agents faced with a multifeatured,
asynchronous market data feed. We provide a technique for enumerating the
state space at the scale at which the agent interacts with the system, incorporating the
effects of a live trading agent on limit order book dynamics into the market data feed,
and hence the perceived state evolution.LG201
Online semi-supervised learning in non-stationary environments
Existing Data Stream Mining (DSM) algorithms assume the availability of labelled and
balanced data, immediately or after some delay, to extract worthwhile knowledge from the
continuous and rapid data streams. However, in many real-world applications such as
Robotics, Weather Monitoring, Fraud Detection Systems, Cyber Security, and Computer
Network Traffic Flow, an enormous amount of high-speed data is generated by Internet of
Things sensors and real-time data on the Internet. Manual labelling of these data streams
is not practical due to time consumption and the need for domain expertise. Another
challenge is learning under Non-Stationary Environments (NSEs), which occurs due to
changes in the data distributions in a set of input variables and/or class labels. The problem
of Extreme Verification Latency (EVL) under NSEs is referred to as Initially Labelled Non-Stationary Environment (ILNSE). This is a challenging task because the learning algorithms
have no access to the true class labels directly when the concept evolves. Several approaches
exist that deal with NSE and EVL in isolation. However, few algorithms address both issues
simultaneously. This research directly responds to ILNSEâs challenge in proposing two
novel algorithms âPredictor for Streaming Data with Scarce Labelsâ (PSDSL) and
Heterogeneous Dynamic Weighted Majority (HDWM) classifier. PSDSL is an Online Semi-Supervised Learning (OSSL) method for real-time DSM and is closely related to label
scarcity issues in online machine learning.
The key capabilities of PSDSL include learning from a small amount of labelled data in an
incremental or online manner and being available to predict at any time. To achieve this,
PSDSL utilises both labelled and unlabelled data to train the prediction models, meaning it
continuously learns from incoming data and updates the model as new labelled or
unlabelled data becomes available over time. Furthermore, it can predict under NSE
conditions under the scarcity of class labels. PSDSL is built on top of the HDWM classifier,
which preserves the diversity of the classifiers. PSDSL and HDWM can intelligently switch
and adapt to the conditions. The PSDSL adapts to learning states between self-learning,
micro-clustering and CGC, whichever approach is beneficial, based on the characteristics of
the data stream. HDWM makes use of âseedâ learners of different types in an ensemble to
maintain its diversity. The ensembles are simply the combination of predictive models
grouped to improve the predictive performance of a single classifier.
PSDSL is empirically evaluated against COMPOSE, LEVELIW, SCARGC and MClassification
on benchmarks, NSE datasets as well as Massive Online Analysis (MOA) data streams and real-world datasets. The results showed that PSDSL performed significantly better than
existing approaches on most real-time data streams including randomised data instances.
PSDSL performed significantly better than âStaticâ i.e. the classifier is not updated after it is
trained with the first examples in the data streams. When applied to MOA-generated data
streams, PSDSL ranked highest (1.5) and thus performed significantly better than SCARGC,
while SCARGC performed the same as the Static. PSDSL achieved better average prediction
accuracies in a short time than SCARGC.
The HDWM algorithm is evaluated on artificial and real-world data streams against existing
well-known approaches such as the heterogeneous WMA and the homogeneous Dynamic
DWM algorithm. The results showed that HDWM performed significantly better than WMA
and DWM. Also, when recurring concept drifts were present, the predictive performance of
HDWM showed an improvement over DWM. In both drift and real-world streams,
significance tests and post hoc comparisons found significant differences between
algorithms, HDWM performed significantly better than DWM and WMA when applied to
MOA data streams and 4 real-world datasets Electric, Spam, Sensor and Forest cover. The
seeding mechanism and dynamic inclusion of new base learners in the HDWM algorithms
benefit from the use of both forgetting and retaining the models. The algorithm also
provides the independence of selecting the optimal base classifier in its ensemble depending
on the problem.
A new approach, Envelope-Clustering is introduced to resolve the cluster overlap conflicts
during the cluster labelling process. In this process, PSDSL transforms the centroidsâ
information of micro-clusters into micro-instances and generates new clusters called
Envelopes. The nearest envelope clusters assist the conflicted micro-clusters and
successfully guide the cluster labelling process after the concept drifts in the absence of true
class labels. PSDSL has been evaluated on real-world problem âkeystroke dynamicsâ, and
the results show that PSDSL achieved higher prediction accuracy (85.3%) and SCARGC
(81.6%), while the Static (49.0%) significantly degrades the performance due to changes in
the users typing pattern. Furthermore, the predictive accuracies of SCARGC are found
highly fluctuated between (41.1% to 81.6%) based on different values of parameter âkâ
(number of clusters), while PSDSL automatically determine the best values for this
parameter
TBCOV: Two Billion Multilingual COVID-19 Tweets with Sentiment, Entity, Geo, and Gender Labels
The widespread usage of social networks during mass convergence events, such
as health emergencies and disease outbreaks, provides instant access to
citizen-generated data that carry rich information about public opinions,
sentiments, urgent needs, and situational reports. Such information can help
authorities understand the emergent situation and react accordingly. Moreover,
social media plays a vital role in tackling misinformation and disinformation.
This work presents TBCOV, a large-scale Twitter dataset comprising more than
two billion multilingual tweets related to the COVID-19 pandemic collected
worldwide over a continuous period of more than one year. More importantly,
several state-of-the-art deep learning models are used to enrich the data with
important attributes, including sentiment labels, named-entities (e.g.,
mentions of persons, organizations, locations), user types, and gender
information. Last but not least, a geotagging method is proposed to assign
country, state, county, and city information to tweets, enabling a myriad of
data analysis tasks to understand real-world issues. Our sentiment and trend
analyses reveal interesting insights and confirm TBCOV's broad coverage of
important topics.Comment: 20 pages, 13 figures, 8 table
Building Blocks for IoT Analytics Internet-of-Things Analytics
Internet-of-Things (IoT) Analytics are an integral element of most IoT applications, as it provides the means to extract knowledge, drive actuation services and optimize decision making. IoT analytics will be a major contributor to IoT business value in the coming years, as it will enable organizations to process and fully leverage large amounts of IoT data, which are nowadays largely underutilized. The Building Blocks of IoT Analytics is devoted to the presentation the main technology building blocks that comprise advanced IoT analytics systems. It introduces IoT analytics as a special case of BigData analytics and accordingly presents leading edge technologies that can be deployed in order to successfully confront the main challenges of IoT analytics applications. Special emphasis is paid in the presentation of technologies for IoT streaming and semantic interoperability across diverse IoT streams. Furthermore, the role of cloud computing and BigData technologies in IoT analytics are presented, along with practical tools for implementing, deploying and operating non-trivial IoT applications. Along with the main building blocks of IoT analytics systems and applications, the book presents a series of practical applications, which illustrate the use of these technologies in the scope of pragmatic applications. Technical topics discussed in the book include: Cloud Computing and BigData for IoT analyticsSearching the Internet of ThingsDevelopment Tools for IoT Analytics ApplicationsIoT Analytics-as-a-ServiceSemantic Modelling and Reasoning for IoT AnalyticsIoT analytics for Smart BuildingsIoT analytics for Smart CitiesOperationalization of IoT analyticsEthical aspects of IoT analyticsThis book contains both research oriented and applied articles on IoT analytics, including several articles reflecting work undertaken in the scope of recent European Commission funded projects in the scope of the FP7 and H2020 programmes. These articles present results of these projects on IoT analytics platforms and applications. Even though several articles have been contributed by different authors, they are structured in a well thought order that facilitates the reader either to follow the evolution of the book or to focus on specific topics depending on his/her background and interest in IoT and IoT analytics technologies. The compilation of these articles in this edited volume has been largely motivated by the close collaboration of the co-authors in the scope of working groups and IoT events organized by the Internet-of-Things Research Cluster (IERC), which is currently a part of EU's Alliance for Internet of Things Innovation (AIOTI)
Improving Computer Network Operations Through Automated Interpretation of State
Networked systems today are hyper-scaled entities that provide core functionality for distributed services and applications spanning personal, business, and government use. It is critical to maintain correct operation of these networks to avoid adverse business outcomes. The advent of programmable networks has provided much needed fine-grained network control, enabling providers and operators alike to build some innovative networking architectures and solutions. At the same time, they have given rise to new challenges in network management. These architectures, coupled with a multitude of devices, protocols, virtual overlays on top of physical data-plane etc. make network management a highly challenging task. Existing network management methodologies have not evolved at the same pace as the technologies and architectures. Current network management practices do not provide adequate solutions for highly dynamic, programmable environments. We have a long way to go in developing management methodologies that can meaningfully contribute to networks becoming self-healing entities. The goal of my research is to contribute to the design and development of networks towards transforming them into self-healing entities.
Network management includes a multitude of tasks, not limited to diagnosis and troubleshooting, but also performance engineering and tuning, security analysis etc. This research explores novel methods of utilizing network state to enhance networking capabilities. It is constructed around hypotheses based on careful analysis of practical deficiencies in the field. I try to generate real-world impact with my research by tackling problems that are prevalent in deployed networks, and that bear practical relevance to the current state of networking. The overarching goal of this body of work is to examine various approaches that could help enhance network management paradigms, providing administrators with a better understanding of the underlying state of the network, thus leading to more informed decision-making. The research looks into two distinct areas of network management, troubleshooting and routing, presenting novel approaches to accomplishing certain goals in each of these areas, demonstrating that they can indeed enhance the network management experience
- âŠ