124 research outputs found
A graph-based framework for data retrieved from criminal-related documents
A digitalização das empresas e dos serviços tem potenciado o tratamento e análise de um crescente volume
de dados provenientes de fontes heterogeneas, com desafios emergentes, nomeadamente ao nível da representação
do conhecimento. Também os Órgãos de Polícia Criminal (OPC) enfrentam o mesmo desafio,
tendo em conta o volume de dados não estruturados, provenientes de relatórios policiais, sendo analisados
manualmente pelo investigadores criminais, consumindo tempo e recursos.
Assim, a necessidade de extrair e representar os dados não estruturados existentes em documentos relacionados
com o crime, de uma forma automática, permitindo a redução da análise manual efetuada pelos
investigadores criminais. Apresenta-se como um desafio para a ciência dos computadores, dando a possibilidade
de propor uma alternativa computacional que permita extrair e representar os dados, adaptando
ou propondo métodos computacionais novos.
Actualmente existem vários métodos computacionais aplicados ao domínio criminal, nomeadamente a identificação
e classificação de entidades nomeadas, por exemplo narcóticos, ou a extracção de relações entre
entidades relevantes para a investigação criminal. Estes métodos são maioritariamente aplicadas à lingua
inglesa, e em Portugal não há muita atenção à investigação nesta área, inviabilizando a sua aplicação no
contexto da investigação criminal.
Esta tese propõe uma solução integrada para a representação dos dados não estruturados existentes em
documentos, usando um conjunto de métodos computacionais: Preprocessamento de Documentos, que
agrupa uma tarefa de Extracção, Transformação e Carregamento adaptado aos documentos relacionados
com o crime, seguido por um pipeline de Processamento de Linguagem Natural aplicado à lingua portuguesa,
para uma análise sintática e semântica dos dados textuais; Método de Extracção de Informação 5W1H
que agrupa métodos de Reconhecimento de Entidades Nomeadas, a detecção da função semântica e a
extracção de termos criminais; Preenchimento da Base de Dados de Grafos e Enriquecimento, permitindo
a representação dos dados obtidos numa base de dados de grafos Neo4j. Globalmente a solução integrada apresenta resultados promissores, cujos resultados foram validados usando
protótipos desemvolvidos para o efeito. Demonstrou-se ainda a viabilidade da extracção dos dados não
estruturados, a sua interpretação sintática e semântica, bem como a representação na base de dados de
grafos; Abstract:
The digitalization of companies processes has enhanced the treatment and analysis of a growing volume
of data from heterogeneous sources, with emerging challenges, namely those related to knowledge representation.
The Criminal Police has similar challenges, considering the amount of unstructured data from
police reports manually analyzed by criminal investigators, with the corresponding time and resources.
There is a need to automatically extract and represent the unstructured data existing in criminal-related
documents and reduce the manual analysis by criminal investigators. Computer science faces a challenge
to apply emergent computational models that can be an alternative to extract and represent the data using
new or existing methods.
A broad set of computational methods have been applied to the criminal domain, such as the identification
and classification named-entities (NEs) or extraction of relations between the entities that are relevant for
the criminal investigation, like narcotics. However, these methods have mainly been used in the English
language. In Portugal, the research on this domain, applying computational methods, lacks related works,
making its application in criminal investigation unfeasible.
This thesis proposes an integrated solution for the representation of unstructured data retrieved from
documents, using a set of computational methods, such as Preprocessing Criminal-Related Documents
module. This module is supported by Extraction, Transformation, and Loading tasks. Followed by a
Natural Language Processing pipeline applied to the Portuguese language, for syntactic and semantic
analysis of textual data. Next, the 5W1H Information Extraction Method combines the Named-Entity
Recognition, Semantic Role Labelling, and Criminal Terms Extraction tasks. Finally, the Graph Database
Population and Enrichment allows us the representation of data retrieved into a Neo4j graph database.
Globally, the framework presents promising results that were validated using prototypes developed for this
purpose. In addition, the feasibility of extracting unstructured data, its syntactic and semantic interpretation,
and the graph database representation has also been demonstrated
Towards generic relation extraction
A vast amount of usable electronic data is in the form of unstructured text. The relation
extraction task aims to identify useful information in text (e.g., PersonW works
for OrganisationX, GeneY encodes ProteinZ) and recode it in a format such as a relational
database that can be more effectively used for querying and automated reasoning.
However, adapting conventional relation extraction systems to new domains
or tasks requires significant effort from annotators and developers. Furthermore, previous
adaptation approaches based on bootstrapping start from example instances of
the target relations, thus requiring that the correct relation type schema be known in
advance. Generic relation extraction (GRE) addresses the adaptation problem by applying
generic techniques that achieve comparable accuracy when transferred, without
modification of model parameters, across domains and tasks.
Previous work on GRE has relied extensively on various lexical and shallow syntactic
indicators. I present new state-of-the-art models for GRE that incorporate governordependency
information. I also introduce a dimensionality reduction step into the GRE
relation characterisation sub-task, which serves to capture latent semantic information
and leads to significant improvements over an unreduced model. Comparison of dimensionality
reduction techniques suggests that latent Dirichlet allocation (LDA) – a
probabilistic generative approach – successfully incorporates a larger and more interdependent
feature set than a model based on singular value decomposition (SVD) and
performs as well as or better than SVD on all experimental settings. Finally, I will
introduce multi-document summarisation as an extrinsic test bed for GRE and present
results which demonstrate that the relative performance of GRE models is consistent
across tasks and that the GRE-based representation leads to significant improvements
over a standard baseline from the literature.
Taken together, the experimental results 1) show that GRE can be improved using
dependency parsing and dimensionality reduction, 2) demonstrate the utility of GRE
for the content selection step of extractive summarisation and 3) validate the GRE
claim of modification-free adaptation for the first time with respect to both domain and
task. This thesis also introduces data sets derived from publicly available corpora for
the purpose of rigorous intrinsic evaluation in the news and biomedical domains
Tackling Dierent Business Process Perspectives
Business Process Management (BPM) has emerged as a discipline to design, control, analyze, and optimize business operations. Conceptual models lie at the core of BPM. In particular, business process models have been taken up by organizations as a means to describe the main activities that are performed to achieve a specific business goal. Process models generally cover different perspectives that underlie separate yet interrelated representations for analyzing and presenting process information. Being primarily driven by process improvement objectives, traditional business process modeling languages focus on capturing the control flow perspective of business processes, that is, the temporal and logical coordination of activities. Such approaches are usually characterized as \u201cactivity-centric\u201d. Nowadays, activity-centric process modeling languages, such as the Business Process Model and Notation (BPMN) standard, are still the most used in practice and benefit from industrial tool support. Nevertheless, evidence shows that such process modeling languages still lack of support for modeling non-control-flow perspectives, such as the temporal, informational, and decision perspectives, among others. This thesis centres on the BPMN standard and addresses the modeling the temporal, informational, and decision perspectives of process models, with particular attention to processes enacted in healthcare domains. Despite being partially interrelated, the main contributions of this thesis may be partitioned according to the modeling perspective they concern. The temporal perspective deals with the specification, management, and formal verification of temporal constraints. In this thesis, we address the specification and run-time management of temporal constraints in BPMN, by taking advantage of process modularity and of event handling mechanisms included in the standard. Then, we propose three different mappings from BPMN to formal models, to validate the behavior of the proposed process models and to check whether they are dynamically controllable. The informational perspective represents the information entities consumed, produced or manipulated by a process. This thesis focuses on the conceptual connection between processes and data, borrowing concepts from the database domain to enable the representation of which part of a database schema is accessed by a certain process activity. This novel conceptual view is then employed to detect potential data inconsistencies arising when the same data are accessed erroneously by different process activities. The decision perspective encompasses the modeling of the decision-making related to a process, considering where decisions are made in the process and how decision outcomes affect process execution. In this thesis, we investigate the use of the Decision Model and Notation (DMN) standard in conjunction with BPMN starting from a pattern-based approach to ease the derivation of DMN decision models from the data represented in BPMN processes. Besides, we propose a methodology that focuses on the integrated use of BPMN and DMN for modeling decision-intensive care pathways in a real-world application domain
The Birth of Musicology from the Spirit of Evolution: Ernst Haeckel's Entwicklungslehre as Central Component of Guido Adler's Methodology for Musicology
Between about 1860 and the first world war, musicology became an academic discipline, practiced by scholars and supported by the university infrastructure. The decisive methodological change that allowed for this transition from mostly private scholarship to "academicization" was the declared adoption of the scientific method, especially in German-language music research. Among other "music scientists" like Hermann von Helmholtz and Friedrich Chrysander, the Viennese musicologist Guido Adler (1855-1941) is particularly important because, in 1885, he codified the research methods of this new academic discipline in the article "Umfang, Methode und Ziel der Musikwissenschaft" (The Scope, Method, and Aim of Music Science). Adler's methodological proposals have shaped musicological research habits since, perhaps most famously by separating what he calls "historical" and "systematic" musicologies. While his painting musicology as a science—and therefore as worthy of inclusion in the academy—was successful, Adler's scientific inspiration for this methodological move has been obscured, partly because the later incarnations of his methodology—like style criticism—drew heavily on contemporary art history rather than on any model from the natural sciences.In this dissertation, I show that Adler's initial methodological stimulus derived from biology, and in that discipline from a restructuring of research methods in the wake of Charles Darwin's proposal of evolution by natural selection. Adler was aware of Darwin's achievements but his direct sources of biological information were popular and scholarly publications by the German biologist Ernst Haeckel (1834-1919). Copied passages from one of Haeckel's early articles are preserved in Adler's hand, he was friends with several of Haeckel's students, and—most importantly—his early methodology resembles strongly Haeckel's methodological suggestions for biology. Adler's early musicology was conceived in the spirit of evolution, which promised natural scientists an empirically valid way of reconstructing history by comparative, systematic study. This dissertation demonstrates on what biographical grounds and through which methodical conceits Adler transformed Haeckel's biology into a working model for musicological research
- …