2 research outputs found
An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing
Criminal investigations collect and analyze the facts related to a crime, from which the investigators can deduce evidence to be used in court. It is a multidisciplinary and applied science, which includes interviews, interrogations, evidence collection, preservation of the chain of custody, and other methods and techniques of investigation. These techniques produce both digital and paper documents that have to be carefully analyzed to identify correlations and interactions among suspects, places, license plates, and other entities that are mentioned in the investigation. The computerized processing of these documents is a helping hand to the criminal investigation, as it allows the automatic identification of entities and their relations, being some of which difficult to identify manually. There exists a wide set of dedicated tools, but they have a major limitation: they are unable to process criminal reports in the Portuguese language, as an annotated corpus for that purpose does not exist. This paper presents an annotated corpus, composed of a collection of anonymized crime-related documents, which were extracted from official and open sources. The dataset was produced as the result of an exploratory initiative to collect crime-related data from websites and conditioned-access police reports. The dataset was evaluated and a mean precision of 0.808, recall of 0.722, and F1-score of 0.733 were obtained with the classification of the annotated named-entities present in the crime-related documents. This corpus can be employed to benchmark Machine Learning (ML) and Natural Language Processing (NLP) methods and tools to detect and correlate entities in the documents. Some examples are sentence detection, named-entity recognition, and identification of terms related to the criminal domain
A graph-based framework for data retrieved from criminal-related documents
A digitalização das empresas e dos serviços tem potenciado o tratamento e análise de um crescente volume
de dados provenientes de fontes heterogeneas, com desafios emergentes, nomeadamente ao nÃvel da representação
do conhecimento. Também os Órgãos de PolÃcia Criminal (OPC) enfrentam o mesmo desafio,
tendo em conta o volume de dados não estruturados, provenientes de relatórios policiais, sendo analisados
manualmente pelo investigadores criminais, consumindo tempo e recursos.
Assim, a necessidade de extrair e representar os dados não estruturados existentes em documentos relacionados
com o crime, de uma forma automática, permitindo a redução da análise manual efetuada pelos
investigadores criminais. Apresenta-se como um desafio para a ciência dos computadores, dando a possibilidade
de propor uma alternativa computacional que permita extrair e representar os dados, adaptando
ou propondo métodos computacionais novos.
Actualmente existem vários métodos computacionais aplicados ao domÃnio criminal, nomeadamente a identificação
e classificação de entidades nomeadas, por exemplo narcóticos, ou a extracção de relações entre
entidades relevantes para a investigação criminal. Estes métodos são maioritariamente aplicadas à lingua
inglesa, e em Portugal não há muita atenção à investigação nesta área, inviabilizando a sua aplicação no
contexto da investigação criminal.
Esta tese propõe uma solução integrada para a representação dos dados não estruturados existentes em
documentos, usando um conjunto de métodos computacionais: Preprocessamento de Documentos, que
agrupa uma tarefa de Extracção, Transformação e Carregamento adaptado aos documentos relacionados
com o crime, seguido por um pipeline de Processamento de Linguagem Natural aplicado à lingua portuguesa,
para uma análise sintática e semântica dos dados textuais; Método de Extracção de Informação 5W1H
que agrupa métodos de Reconhecimento de Entidades Nomeadas, a detecção da função semântica e a
extracção de termos criminais; Preenchimento da Base de Dados de Grafos e Enriquecimento, permitindo
a representação dos dados obtidos numa base de dados de grafos Neo4j. Globalmente a solução integrada apresenta resultados promissores, cujos resultados foram validados usando
protótipos desemvolvidos para o efeito. Demonstrou-se ainda a viabilidade da extracção dos dados não
estruturados, a sua interpretação sintática e semântica, bem como a representação na base de dados de
grafos; Abstract:
The digitalization of companies processes has enhanced the treatment and analysis of a growing volume
of data from heterogeneous sources, with emerging challenges, namely those related to knowledge representation.
The Criminal Police has similar challenges, considering the amount of unstructured data from
police reports manually analyzed by criminal investigators, with the corresponding time and resources.
There is a need to automatically extract and represent the unstructured data existing in criminal-related
documents and reduce the manual analysis by criminal investigators. Computer science faces a challenge
to apply emergent computational models that can be an alternative to extract and represent the data using
new or existing methods.
A broad set of computational methods have been applied to the criminal domain, such as the identification
and classification named-entities (NEs) or extraction of relations between the entities that are relevant for
the criminal investigation, like narcotics. However, these methods have mainly been used in the English
language. In Portugal, the research on this domain, applying computational methods, lacks related works,
making its application in criminal investigation unfeasible.
This thesis proposes an integrated solution for the representation of unstructured data retrieved from
documents, using a set of computational methods, such as Preprocessing Criminal-Related Documents
module. This module is supported by Extraction, Transformation, and Loading tasks. Followed by a
Natural Language Processing pipeline applied to the Portuguese language, for syntactic and semantic
analysis of textual data. Next, the 5W1H Information Extraction Method combines the Named-Entity
Recognition, Semantic Role Labelling, and Criminal Terms Extraction tasks. Finally, the Graph Database
Population and Enrichment allows us the representation of data retrieved into a Neo4j graph database.
Globally, the framework presents promising results that were validated using prototypes developed for this
purpose. In addition, the feasibility of extracting unstructured data, its syntactic and semantic interpretation,
and the graph database representation has also been demonstrated