4,852 research outputs found
Oblivion: Mitigating Privacy Leaks by Controlling the Discoverability of Online Information
Search engines are the prevalently used tools to collect information about
individuals on the Internet. Search results typically comprise a variety of
sources that contain personal information -- either intentionally released by
the person herself, or unintentionally leaked or published by third parties,
often with detrimental effects on the individual's privacy. To grant
individuals the ability to regain control over their disseminated personal
information, the European Court of Justice recently ruled that EU citizens have
a right to be forgotten in the sense that indexing systems, must offer them
technical means to request removal of links from search results that point to
sources violating their data protection rights. As of now, these technical
means consist of a web form that requires a user to manually identify all
relevant links upfront and to insert them into the web form, followed by a
manual evaluation by employees of the indexing system to assess if the request
is eligible and lawful.
We propose a universal framework Oblivion to support the automation of the
right to be forgotten in a scalable, provable and privacy-preserving manner.
First, Oblivion enables a user to automatically find and tag her disseminated
personal information using natural language processing and image recognition
techniques and file a request in a privacy-preserving manner. Second, Oblivion
provides indexing systems with an automated and provable eligibility mechanism,
asserting that the author of a request is indeed affected by an online
resource. The automated ligibility proof ensures censorship-resistance so that
only legitimately affected individuals can request the removal of corresponding
links from search results. We have conducted comprehensive evaluations, showing
that Oblivion is capable of handling 278 removal requests per second, and is
hence suitable for large-scale deployment
Collaborative Development and Evaluation of Text-processing Workflows in a UIMA-supported Web-based Workbench
Challenges in creating comprehensive text-processing worklows include a lack of the interoperability of individual components coming from different providers and/or a requirement imposed on the end users to know programming techniques to compose such workflows. In this paper we demonstrate Argo, a web-based system that addresses these issues in several ways. It supports the widely adopted Unstructured Information Management Architecture (UIMA), which handles the problem of interoperability; it provides a web browser-based interface for developing workflows by drawing diagrams composed of a selection of available processing components; and it provides novel user-interactive analytics such as the annotation editor which constitutes a bridge between automatic processing and manual correction. These features extend the target audience of Argo to users with a limited or no technical background. Here, we focus specifically on the construction of advanced workflows, involving multiple branching and merging points, to facilitate various comparative evalutions. Together with the use of user-collaboration capabilities supported in Argo, we demonstrate several use cases including visual inspections, comparisions of multiple processing segments or complete solutions against a reference standard, inter-annotator agreement, and shared task mass evaluations. Ultimetely, Argo emerges as a one-stop workbench for defining, processing, editing and evaluating text processing tasks
A Web-Based Tool for Analysing Normative Documents in English
Our goal is to use formal methods to analyse normative documents written in
English, such as privacy policies and service-level agreements. This requires
the combination of a number of different elements, including information
extraction from natural language, formal languages for model representation,
and an interface for property specification and verification. We have worked on
a collection of components for this task: a natural language extraction tool, a
suitable formalism for representing such documents, an interface for building
models in this formalism, and methods for answering queries asked of a given
model. In this work, each of these concerns is brought together in a web-based
tool, providing a single interface for analysing normative texts in English.
Through the use of a running example, we describe each component and
demonstrate the workflow established by our tool
Combining information seeking services into a meta supply chain of facts
The World Wide Web has become a vital supplier of information that allows organizations to carry on such tasks as business intelligence, security monitoring, and risk assessments. Having a quick and reliable supply of correct facts from perspective is often mission critical. By following design science guidelines, we have explored ways to recombine facts from multiple sources, each with possibly different levels of responsiveness and accuracy, into one robust supply chain. Inspired by prior research on keyword-based meta-search engines (e.g., metacrawler.com), we have adapted the existing question answering algorithms for the task of analysis and triangulation of facts. We present a first prototype for a meta approach to fact seeking. Our meta engine sends a user's question to several fact seeking services that are publicly available on the Web (e.g., ask.com, brainboost.com, answerbus.com, NSIR, etc.) and analyzes the returned results jointly to identify and present to the user those that are most likely to be factually correct. The results of our evaluation on the standard test sets widely used in prior research support the evidence for the following: 1) the value-added of the meta approach: its performance surpasses the performance of each supplier, 2) the importance of using fact seeking services as suppliers to the meta engine rather than keyword driven search portals, and 3) the resilience of the meta approach: eliminating a single service does not noticeably impact the overall performance. We show that these properties make the meta-approach a more reliable supplier of facts than any of the currently available stand-alone services
Anonimização automatizada de contratos jurídicos em português
With the introduction of the General Data Protection Regulation, many organizations
were left with a large amount of documents containing public information
that should have been private. Given that we are talking about quite large quantities
of documents, it would be a waste of resources to edit them manually. The
objective of this dissertation is the development of an autonomous system for the
anonymization of sensitive information in contracts written in Portuguese.
This system uses Google Cloud Vision, an API to apply the OCR tecnology, to
extract any text present in a document. As there is a possibility that these documents
are poorly readable, an image pre-processing is done using the OpenCV
library to increase the readability of the text present in the images. Among others,
the application of binarization, skew correction and noise removal algorithms were
explored.
Once the text has been extracted, it will be interpreted by an NLP library. In this
project we chose to use spaCy, which contains a Portuguese pipeline trained with
the WikiNer and UD Portuguese Bosque datasets. This library not only allows a
very complete identification of the part of speech, but also contains four different
categories of named entity recognition in its model. In addition to the processing
carried out using the spaCy library, and since the Portuguese language does not
have a great support, some rule-based algorithms were implemented in order to
identify other types of more specific information such as identification number and
postal codes. In the end, the information considered confidential is covered by
a black rectangle drawn by OpenCV through the coordinates returned by Google
Cloud Vision OCR and a new PDF is generated.Com a introdução do Regulamento Geral de Proteção de Dados, muitas organizações
ficaram com uma grande quantidade de documentos contendo informações
públicas que deveriam ser privadas. Dado que estamos a falar de quantidades
bastante elevadas de documentos, seria um desperdício de recursos editá-los manualmente.
O objetivo desta dissertação é o desenvovimento de um sistema autónomo
de anonimização de informação sensível em contratos escritos na língua
Portuguesa.
Este sistema utiliza a Google Cloud Vision, uma API de OCR, para extrair qualquer
texto presente num documento. Como existe a possibilidade desses documentos
serem pouco legíveis, é feito um pré-processamento de imagem através da biblioteca
OpenCV para aumentar a legibilidade do texto presente nas imagens. Entre
outros, foi explorada a aplicação de algoritmos de binarização, correção da inclinação
e remoção de ruído.
Uma vez extraído o texto, este será interpretado por uma biblioteca de nlp, neste
projeto optou-se pelo uso do spaCy, que contém um pipeline português treinado
com os conjuntos de dados WikiNer e UD Portuguese Bosque. Esta biblioteca
não permite apenas uma identificação bastante completa da parte do discurso,
mas também contém quatro categorias diferentes de reconhecimento de entidade
nomeada no seu modelo. Para além do processamento efetuado com o recurso à
biblioteca de spaCy, e uma vez que a língua portuguesa não tem um grande suporte,
foram implementados alguns algoritmos baseados em regras de modo a identificar
outros tipos de informação mais especifica como número de identificação e códigos
postais. No final, as informações consideradas confidenciais são cobertas por um
retângulo preto desenhado pelo OpenCV através das coordenadas retornadas pelo
OCR do Google Cloud Vision e será gerado um novo PDF.Mestrado em Engenharia de Computadores e Telemátic
- …