4,896 research outputs found
Using Neural Networks for Relation Extraction from Biomedical Literature
Using different sources of information to support automated extracting of
relations between biomedical concepts contributes to the development of our
understanding of biological systems. The primary comprehensive source of these
relations is biomedical literature. Several relation extraction approaches have
been proposed to identify relations between concepts in biomedical literature,
namely, using neural networks algorithms. The use of multichannel architectures
composed of multiple data representations, as in deep neural networks, is
leading to state-of-the-art results. The right combination of data
representations can eventually lead us to even higher evaluation scores in
relation extraction tasks. Thus, biomedical ontologies play a fundamental role
by providing semantic and ancestry information about an entity. The
incorporation of biomedical ontologies has already been proved to enhance
previous state-of-the-art results.Comment: Artificial Neural Networks book (Springer) - Chapter 1
STEM: stacked threshold-based entity matching for knowledge base generation
One of the major issues encountered in the generation of knowledge bases is the integration of data coming from
a collection of heterogeneous data sources. A key essential task when integrating data instances is the entity matching. Entity
matching is based on the definition of a similarity measure among entities and on the classification of the entity pair as a match
if the similarity exceeds a certain threshold. This parameter introduces a trade-off between the precision and the recall of the
algorithm, as higher values of the threshold lead to higher precision and lower recall, and lower values lead to higher recall
and lower precision. In this paper, we propose a stacking approach for threshold-based classifiers. It runs several instances of
classifiers corresponding to different thresholds and use their predictions as a feature vector for a supervised learner. We show that
this approach is able to break the trade-off between the precision and recall of the algorithm, increasing both at the same time and
enhancing the overall performance of the algorithm. We also show that this hybrid approach performs better and is less dependent
on the amount of available training data with respect to a supervised learning approach that directly uses propertiesâ similarity
values. In order to test the generality of the claim, we have run experimental tests using two different threshold-based classifiers
on two different data sets. Finally, we show a concrete use case describing the implementation of the proposed approach in the
generation of the 3cixty Nice knowledge base
Knowledge Discovery and Management within Service Centers
These days, most enterprise service centers deploy Knowledge Discovery and Management (KDM) systems to address the challenge of timely delivery of a resourceful service request resolution while efficiently utilizing the huge amount of data. These KDM systems facilitate prompt response to the critical service requests and if possible then try to prevent the service requests getting triggered in the first place. Nevertheless, in most cases, information required for a request resolution is dispersed and suppressed under the mountain of irrelevant information over the Internet in unstructured and heterogeneous formats. These heterogeneous data sources and formats complicate the access to reusable knowledge and increase the response time required to reach a resolution. Moreover, the state-of-the art methods neither support effective integration of domain knowledge with the KDM systems nor promote the assimilation of reusable knowledge or Intellectual Capital (IC). With the goal of providing an improved service request resolution within the shortest possible time, this research proposes an IC Management System. The proposed tool efficiently utilizes domain knowledge in the form of semantic web technology to extract the most valuable information from those raw unstructured data and uses that knowledge to formulate service resolution model as a combination of efficient data search, classification, clustering, and recommendation methods. Our proposed solution also handles the technology categorization of a service request which is very crucial in the request resolution process. The system has been extensively evaluated with several experiments and has been used in a real enterprise customer service center
Site-Specific Rules Extraction in Precision Agriculture
El incremento sostenible en la produccioÌn alimentaria para satisfacer las necesidades de una poblacioÌn mundial en aumento es un verdadero reto cuando tenemos en cuenta el impacto constante de plagas y enfermedades en los cultivos. Debido a las importantes peÌrdidas econoÌmicas que se producen, el uso de tratamientos quiÌmicos es demasiado alto; causando contaminacioÌn del medio ambiente y resistencia a distintos tratamientos. En este contexto, la comunidad agriÌcola divisa la aplicacioÌn de tratamientos maÌs especiÌficos para cada lugar, asiÌ como la validacioÌn automaÌtica con la conformidad legal. Sin embargo, la especificacioÌn de estos tratamientos se encuentra en regulaciones expresadas en lenguaje natural. Por este motivo, traducir regulaciones a una representacioÌn procesable por maÌquinas estaÌ tomando cada vez maÌs importancia en la agricultura de precisioÌn.Actualmente, los requisitos para traducir las regulaciones en reglas formales estaÌn lejos de ser cumplidos; y con el raÌpido desarrollo de la ciencia agriÌcola, la verificacioÌn manual de la conformidad legal se torna inabordable.En esta tesis, el objetivo es construir y evaluar un sistema de extraccioÌn de reglas para destilar de manera efectiva la informacioÌn relevante de las regulaciones y transformar las reglas de lenguaje natural a un formato estructurado que pueda ser procesado por maÌquinas. Para ello, hemos separado la extraccioÌn de reglas en dos pasos. El primero es construir una ontologiÌa del dominio; un modelo para describir los desoÌrdenes que producen las enfermedades en los cultivos y sus tratamientos. El segundo paso es extraer informacioÌn para poblar la ontologiÌa. Puesto que usamos teÌcnicas de aprendizaje automaÌtico, implementamos la metodologiÌa MATTER para realizar el proceso de anotacioÌn de regulaciones. Una vez creado el corpus, construimos un clasificador de categoriÌas de reglas que discierne entre obligaciones y prohibiciones; y un sistema para la extraccioÌn de restricciones en reglas, que reconoce informacioÌn relevante para retener el isomorfismo con la regulacioÌn original. Para estos componentes, empleamos, entre otra teÌcnicas de aprendizaje profundo, redes neuronales convolucionales y âLong Short- Term Memoryâ. AdemaÌs, utilizamos como baselines algoritmos maÌs tradicionales como âsupport-vector machinesâ y ârandom forestsâ.Como resultado, presentamos la ontologiÌa PCT-O, que ha sido alineada con otras ontologiÌas como NCBI, PubChem, ChEBI y Wikipedia. El modelo puede ser utilizado para la identificacioÌn de desoÌrdenes, el anaÌlisis de conflictos entre tratamientos y la comparacioÌn entre legislaciones de distintos paiÌses. Con respecto a los sistemas de extraccioÌn, evaluamos empiÌricamente el comportamiento con distintas meÌtricas, pero la meÌtrica F1 es utilizada para seleccionar los mejores sistemas. En el caso del clasificador de categoriÌas de reglas, el mejor sistema obtiene un macro F1 de 92,77% y un F1 binario de 85,71%. Este sistema usa una red âbidirectional long short-term memoryâ con âword embeddingsâ como entrada. En relacioÌn al extractor de restricciones de reglas, el mejor sistema obtiene un micro F1 de 88,3%. Este extractor utiliza como entrada una combinacioÌn de âcharacter embeddingsâ junto a âword embeddingsâ y una red neuronal âbidirectional long short-term memoryâ.<br /
Mining complex trees for hidden fruit : a graphâbased computational solution to detect latent criminal networks : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Information Technology at Massey University, Albany, New Zealand.
The detection of crime is a complex and difficult endeavour. Public and private organisations â focusing on law enforcement, intelligence, and compliance â commonly apply the rational isolated actor approach premised on observability and materiality. This is manifested largely as conducting entity-level risk management sourcing âleadsâ from reactive covert human intelligence sources and/or proactive sources by applying simple rules-based models. Focusing on discrete observable and material actors simply ignores that criminal activity exists within a complex system deriving its fundamental structural fabric from the complex interactions between actors - with those most unobservable likely to be both criminally proficient and influential. The graph-based computational solution developed to detect latent criminal networks is a response to the inadequacy of the rational isolated actor approach that ignores the connectedness and complexity of criminality.
The core computational solution, written in the R language, consists of novel entity resolution, link discovery, and knowledge discovery technology. Entity resolution enables the fusion of multiple datasets with high accuracy (mean F-measure of 0.986 versus competitors 0.872), generating a graph-based expressive view of the problem. Link discovery is comprised of link prediction and link inference, enabling the high-performance detection (accuracy of ~0.8 versus relevant published models ~0.45) of unobserved relationships such as identity fraud. Knowledge discovery uses the fused graph generated and applies the âGraphExtractâ algorithm to create a set of subgraphs representing latent functional criminal groups, and a mesoscopic graph representing how this set of criminal groups are interconnected. Latent knowledge is generated from a range of metrics including the âSuper-brokerâ metric and attitude prediction.
The computational solution has been evaluated on a range of datasets that mimic an applied setting, demonstrating a scalable (tested on ~18 million node graphs) and performant (~33 hours runtime on a non-distributed platform) solution that successfully detects relevant latent functional criminal groups in around 90% of cases sampled and enables the contextual understanding of the broader criminal system through the mesoscopic graph and associated metadata. The augmented data assets generated provide a multi-perspective systems view of criminal activity that enable advanced informed decision making across the microscopic mesoscopic macroscopic spectrum
record linkage of banks and municipalities through multiple criteria and neural networks
Record linkage aims to identify records from multiple data sources that refer to the same entity of the real world. It is a well known data quality process studied since the second half of the last century, with an established pipeline and a rich literature of case studies mainly covering census, administrative or health domains. In this paper, a method to recognize matching records from real municipalities and banks through multiple similarity criteria and a Neural Network classifier is proposed: starting from a labeled subset of the available data, first several similarity measures are combined and weighted to build a feature vector, then a Multi-Layer Perceptron (MLP) network is trained and tested to find matching pairs. For validation, seven real datasets have been used (three from banks and four from municipalities), purposely chosen in the same geographical area to increase the probability of matches. The training only involved two municipalities, while testing involved all sources (municipalities vs. municipalities, banks vs banks and and municipalities vs. banks). The proposed method scored remarkable results in terms of both precision and recall, clearly outperforming threshold-based competitors
Beyond Volume: The Impact of Complex Healthcare Data on the Machine Learning Pipeline
From medical charts to national census, healthcare has traditionally operated
under a paper-based paradigm. However, the past decade has marked a long and
arduous transformation bringing healthcare into the digital age. Ranging from
electronic health records, to digitized imaging and laboratory reports, to
public health datasets, today, healthcare now generates an incredible amount of
digital information. Such a wealth of data presents an exciting opportunity for
integrated machine learning solutions to address problems across multiple
facets of healthcare practice and administration. Unfortunately, the ability to
derive accurate and informative insights requires more than the ability to
execute machine learning models. Rather, a deeper understanding of the data on
which the models are run is imperative for their success. While a significant
effort has been undertaken to develop models able to process the volume of data
obtained during the analysis of millions of digitalized patient records, it is
important to remember that volume represents only one aspect of the data. In
fact, drawing on data from an increasingly diverse set of sources, healthcare
data presents an incredibly complex set of attributes that must be accounted
for throughout the machine learning pipeline. This chapter focuses on
highlighting such challenges, and is broken down into three distinct
components, each representing a phase of the pipeline. We begin with attributes
of the data accounted for during preprocessing, then move to considerations
during model building, and end with challenges to the interpretation of model
output. For each component, we present a discussion around data as it relates
to the healthcare domain and offer insight into the challenges each may impose
on the efficiency of machine learning techniques.Comment: Healthcare Informatics, Machine Learning, Knowledge Discovery: 20
Pages, 1 Figur
- âŠ