124 research outputs found

    A graph-based framework for data retrieved from criminal-related documents

    Get PDF
    A digitalização das empresas e dos serviços tem potenciado o tratamento e análise de um crescente volume de dados provenientes de fontes heterogeneas, com desafios emergentes, nomeadamente ao nível da representação do conhecimento. Também os Órgãos de Polícia Criminal (OPC) enfrentam o mesmo desafio, tendo em conta o volume de dados não estruturados, provenientes de relatórios policiais, sendo analisados manualmente pelo investigadores criminais, consumindo tempo e recursos. Assim, a necessidade de extrair e representar os dados não estruturados existentes em documentos relacionados com o crime, de uma forma automática, permitindo a redução da análise manual efetuada pelos investigadores criminais. Apresenta-se como um desafio para a ciência dos computadores, dando a possibilidade de propor uma alternativa computacional que permita extrair e representar os dados, adaptando ou propondo métodos computacionais novos. Actualmente existem vários métodos computacionais aplicados ao domínio criminal, nomeadamente a identificação e classificação de entidades nomeadas, por exemplo narcóticos, ou a extracção de relações entre entidades relevantes para a investigação criminal. Estes métodos são maioritariamente aplicadas à lingua inglesa, e em Portugal não há muita atenção à investigação nesta área, inviabilizando a sua aplicação no contexto da investigação criminal. Esta tese propõe uma solução integrada para a representação dos dados não estruturados existentes em documentos, usando um conjunto de métodos computacionais: Preprocessamento de Documentos, que agrupa uma tarefa de Extracção, Transformação e Carregamento adaptado aos documentos relacionados com o crime, seguido por um pipeline de Processamento de Linguagem Natural aplicado à lingua portuguesa, para uma análise sintática e semântica dos dados textuais; Método de Extracção de Informação 5W1H que agrupa métodos de Reconhecimento de Entidades Nomeadas, a detecção da função semântica e a extracção de termos criminais; Preenchimento da Base de Dados de Grafos e Enriquecimento, permitindo a representação dos dados obtidos numa base de dados de grafos Neo4j. Globalmente a solução integrada apresenta resultados promissores, cujos resultados foram validados usando protótipos desemvolvidos para o efeito. Demonstrou-se ainda a viabilidade da extracção dos dados não estruturados, a sua interpretação sintática e semântica, bem como a representação na base de dados de grafos; Abstract: The digitalization of companies processes has enhanced the treatment and analysis of a growing volume of data from heterogeneous sources, with emerging challenges, namely those related to knowledge representation. The Criminal Police has similar challenges, considering the amount of unstructured data from police reports manually analyzed by criminal investigators, with the corresponding time and resources. There is a need to automatically extract and represent the unstructured data existing in criminal-related documents and reduce the manual analysis by criminal investigators. Computer science faces a challenge to apply emergent computational models that can be an alternative to extract and represent the data using new or existing methods. A broad set of computational methods have been applied to the criminal domain, such as the identification and classification named-entities (NEs) or extraction of relations between the entities that are relevant for the criminal investigation, like narcotics. However, these methods have mainly been used in the English language. In Portugal, the research on this domain, applying computational methods, lacks related works, making its application in criminal investigation unfeasible. This thesis proposes an integrated solution for the representation of unstructured data retrieved from documents, using a set of computational methods, such as Preprocessing Criminal-Related Documents module. This module is supported by Extraction, Transformation, and Loading tasks. Followed by a Natural Language Processing pipeline applied to the Portuguese language, for syntactic and semantic analysis of textual data. Next, the 5W1H Information Extraction Method combines the Named-Entity Recognition, Semantic Role Labelling, and Criminal Terms Extraction tasks. Finally, the Graph Database Population and Enrichment allows us the representation of data retrieved into a Neo4j graph database. Globally, the framework presents promising results that were validated using prototypes developed for this purpose. In addition, the feasibility of extracting unstructured data, its syntactic and semantic interpretation, and the graph database representation has also been demonstrated

    Towards generic relation extraction

    Get PDF
    A vast amount of usable electronic data is in the form of unstructured text. The relation extraction task aims to identify useful information in text (e.g., PersonW works for OrganisationX, GeneY encodes ProteinZ) and recode it in a format such as a relational database that can be more effectively used for querying and automated reasoning. However, adapting conventional relation extraction systems to new domains or tasks requires significant effort from annotators and developers. Furthermore, previous adaptation approaches based on bootstrapping start from example instances of the target relations, thus requiring that the correct relation type schema be known in advance. Generic relation extraction (GRE) addresses the adaptation problem by applying generic techniques that achieve comparable accuracy when transferred, without modification of model parameters, across domains and tasks. Previous work on GRE has relied extensively on various lexical and shallow syntactic indicators. I present new state-of-the-art models for GRE that incorporate governordependency information. I also introduce a dimensionality reduction step into the GRE relation characterisation sub-task, which serves to capture latent semantic information and leads to significant improvements over an unreduced model. Comparison of dimensionality reduction techniques suggests that latent Dirichlet allocation (LDA) – a probabilistic generative approach – successfully incorporates a larger and more interdependent feature set than a model based on singular value decomposition (SVD) and performs as well as or better than SVD on all experimental settings. Finally, I will introduce multi-document summarisation as an extrinsic test bed for GRE and present results which demonstrate that the relative performance of GRE models is consistent across tasks and that the GRE-based representation leads to significant improvements over a standard baseline from the literature. Taken together, the experimental results 1) show that GRE can be improved using dependency parsing and dimensionality reduction, 2) demonstrate the utility of GRE for the content selection step of extractive summarisation and 3) validate the GRE claim of modification-free adaptation for the first time with respect to both domain and task. This thesis also introduces data sets derived from publicly available corpora for the purpose of rigorous intrinsic evaluation in the news and biomedical domains

    Tackling Dierent Business Process Perspectives

    Get PDF
    Business Process Management (BPM) has emerged as a discipline to design, control, analyze, and optimize business operations. Conceptual models lie at the core of BPM. In particular, business process models have been taken up by organizations as a means to describe the main activities that are performed to achieve a specific business goal. Process models generally cover different perspectives that underlie separate yet interrelated representations for analyzing and presenting process information. Being primarily driven by process improvement objectives, traditional business process modeling languages focus on capturing the control flow perspective of business processes, that is, the temporal and logical coordination of activities. Such approaches are usually characterized as \u201cactivity-centric\u201d. Nowadays, activity-centric process modeling languages, such as the Business Process Model and Notation (BPMN) standard, are still the most used in practice and benefit from industrial tool support. Nevertheless, evidence shows that such process modeling languages still lack of support for modeling non-control-flow perspectives, such as the temporal, informational, and decision perspectives, among others. This thesis centres on the BPMN standard and addresses the modeling the temporal, informational, and decision perspectives of process models, with particular attention to processes enacted in healthcare domains. Despite being partially interrelated, the main contributions of this thesis may be partitioned according to the modeling perspective they concern. The temporal perspective deals with the specification, management, and formal verification of temporal constraints. In this thesis, we address the specification and run-time management of temporal constraints in BPMN, by taking advantage of process modularity and of event handling mechanisms included in the standard. Then, we propose three different mappings from BPMN to formal models, to validate the behavior of the proposed process models and to check whether they are dynamically controllable. The informational perspective represents the information entities consumed, produced or manipulated by a process. This thesis focuses on the conceptual connection between processes and data, borrowing concepts from the database domain to enable the representation of which part of a database schema is accessed by a certain process activity. This novel conceptual view is then employed to detect potential data inconsistencies arising when the same data are accessed erroneously by different process activities. The decision perspective encompasses the modeling of the decision-making related to a process, considering where decisions are made in the process and how decision outcomes affect process execution. In this thesis, we investigate the use of the Decision Model and Notation (DMN) standard in conjunction with BPMN starting from a pattern-based approach to ease the derivation of DMN decision models from the data represented in BPMN processes. Besides, we propose a methodology that focuses on the integrated use of BPMN and DMN for modeling decision-intensive care pathways in a real-world application domain

    The Birth of Musicology from the Spirit of Evolution: Ernst Haeckel's Entwicklungslehre as Central Component of Guido Adler's Methodology for Musicology

    Get PDF
    Between about 1860 and the first world war, musicology became an academic discipline, practiced by scholars and supported by the university infrastructure. The decisive methodological change that allowed for this transition from mostly private scholarship to "academicization" was the declared adoption of the scientific method, especially in German-language music research. Among other "music scientists" like Hermann von Helmholtz and Friedrich Chrysander, the Viennese musicologist Guido Adler (1855-1941) is particularly important because, in 1885, he codified the research methods of this new academic discipline in the article "Umfang, Methode und Ziel der Musikwissenschaft" (The Scope, Method, and Aim of Music Science). Adler's methodological proposals have shaped musicological research habits since, perhaps most famously by separating what he calls "historical" and "systematic" musicologies. While his painting musicology as a science—and therefore as worthy of inclusion in the academy—was successful, Adler's scientific inspiration for this methodological move has been obscured, partly because the later incarnations of his methodology—like style criticism—drew heavily on contemporary art history rather than on any model from the natural sciences.In this dissertation, I show that Adler's initial methodological stimulus derived from biology, and in that discipline from a restructuring of research methods in the wake of Charles Darwin's proposal of evolution by natural selection. Adler was aware of Darwin's achievements but his direct sources of biological information were popular and scholarly publications by the German biologist Ernst Haeckel (1834-1919). Copied passages from one of Haeckel's early articles are preserved in Adler's hand, he was friends with several of Haeckel's students, and—most importantly—his early methodology resembles strongly Haeckel's methodological suggestions for biology. Adler's early musicology was conceived in the spirit of evolution, which promised natural scientists an empirically valid way of reconstructing history by comparative, systematic study. This dissertation demonstrates on what biographical grounds and through which methodical conceits Adler transformed Haeckel's biology into a working model for musicological research
    corecore