43 research outputs found
Doctor of Philosophy
dissertationIn the past few years, we have seen a tremendous increase in digital data being generated. By 2011, storage vendors had shipped 905 PB of purpose-built backup appliances. By 2013, the number of objects stored in Amazon S3 had reached 2 trillion. Facebook had stored 20 PB of photos by 2010. All of these require an efficient storage solution. To improve space efficiency, compression and deduplication are being widely used. Compression works by identifying repeated strings and replacing them with more compact encodings while deduplication partitions data into fixed-size or variable-size chunks and removes duplicate blocks. While we have seen great improvements in space efficiency from these two approaches, there are still some limitations. First, traditional compressors are limited in their ability to detect redundancy across a large range since they search for redundant data in a fine-grain level (string level). For deduplication, metadata embedded in an input file changes more frequently, and this introduces more unnecessary unique chunks, leading to poor deduplication. Cloud storage systems suffer from unpredictable and inefficient performance because of interference among different types of workloads. This dissertation proposes techniques to improve the effectiveness of traditional compressors and deduplication in improving space efficiency, and a new IO scheduling algorithm to improve performance predictability and efficiency for cloud storage systems. The common idea is to utilize similarity. To improve the effectiveness of compression and deduplication, similarity in content is used to transform an input file into a compression- or deduplication-friendly format. We propose Migratory Compression, a generic data transformation that identifies similar data in a coarse-grain level (block level) and then groups similar blocks together. It can be used as a preprocessing stage for any traditional compressor. We find metadata have a huge impact in reducing the benefit of deduplication. To isolate the impact from metadata, we propose to separate metadata from data. Three approaches are presented for use cases with different constrains. For the commonly used tar format, we propose Migratory Tar: a data transformation and also a new tar format that deduplicates better. We also present a case study where we use deduplication to reduce storage consumption for storing disk images, while at the same time achieving high performance in image deployment. Finally, we apply the same principle of utilizing similarity in IO scheduling to prevent interference between random and sequential workloads, leading to efficient, consistent, and predictable performance for sequential workloads and a high disk utilization
A Big Data Architecture for Early Identification and Categorization of Dark Web Sites
The dark web has become notorious for its association with illicit activities
and there is a growing need for systems to automate the monitoring of this
space. This paper proposes an end-to-end scalable architecture for the early
identification of new Tor sites and the daily analysis of their content. The
solution is built using an Open Source Big Data stack for data serving with
Kubernetes, Kafka, Kubeflow, and MinIO, continuously discovering onion
addresses in different sources (threat intelligence, code repositories, web-Tor
gateways, and Tor repositories), downloading the HTML from Tor and
deduplicating the content using MinHash LSH, and categorizing with the BERTopic
modeling (SBERT embedding, UMAP dimensionality reduction, HDBSCAN document
clustering and c-TF-IDF topic keywords). In 93 days, the system identified
80,049 onion services and characterized 90% of them, addressing the challenge
of Tor volatility. A disproportionate amount of repeated content is found, with
only 6.1% unique sites. From the HTML files of the dark sites, 31 different
low-topics are extracted, manually labeled, and grouped into 11 high-level
topics. The five most popular included sexual and violent content,
repositories, search engines, carding, cryptocurrencies, and marketplaces.
During the experiments, we identified 14 sites with 13,946 clones that shared a
suspiciously similar mirroring rate per day, suggesting an extensive common
phishing network. Among the related works, this study is the most
representative characterization of onion services based on topics to date
Entities with quantities : extraction, search, and ranking
Quantities are more than numeric values. They denote measures of the world’s entities such as heights of buildings, running times of athletes, energy efficiency of car models or energy production of power plants, all expressed in numbers with associated units. Entity-centric search and question answering (QA) are well supported by modern search engines. However, they do not work well when the queries involve quantity filters, such as searching for athletes who ran 200m under 20 seconds or companies with quarterly revenue above $2 Billion. State-of-the-art systems fail to understand the quantities, including the condition (less than, above, etc.), the unit of interest (seconds, dollar, etc.), and the context of the quantity (200m race, quarterly revenue, etc.). QA systems based on structured knowledge bases (KBs) also fail as quantities are poorly covered by state-of-the-art KBs. In this dissertation, we developed new methods to advance the state-of-the-art on quantity knowledge extraction and search.Zahlen sind mehr als nur numerische Werte. Sie beschreiben Maße von Entitäten wie die Höhe von Gebäuden, die Laufzeit von Sportlern, die Energieeffizienz von Automodellen oder die Energieerzeugung von Kraftwerken - jeweils ausgedrückt durch Zahlen mit zugehörigen Einheiten. Entitätszentriete Anfragen und direktes Question-Answering werden von Suchmaschinen häufig gut unterstützt. Sie funktionieren jedoch nicht gut, wenn die Fragen Zahlenfilter beinhalten, wie z. B. die Suche nach Sportlern, die 200m unter 20 Sekunden gelaufen sind, oder nach Unternehmen mit einem Quartalsumsatz von über 2 Milliarden US-Dollar. Selbst moderne Systeme schaffen es nicht, Quantitäten, einschließlich der genannten Bedingungen (weniger als, über, etc.), der Maßeinheiten (Sekunden, Dollar, etc.) und des Kontexts (200-Meter-Rennen, Quartalsumsatz usw.), zu verstehen. Auch QA-Systeme, die auf strukturierten Wissensbanken (“Knowledge Bases”, KBs) aufgebaut sind, versagen, da quantitative Eigenschaften von modernen KBs kaum erfasst werden. In dieser Dissertation werden neue Methoden entwickelt, um den Stand der Technik zur Wissensextraktion und -suche von Quantitäten voranzutreiben. Unsere Hauptbeiträge sind die folgenden: • Zunächst präsentieren wir Qsearch [Ho et al., 2019, Ho et al., 2020] – ein System, das mit erweiterten Fragen mit Quantitätsfiltern umgehen kann, indem es Hinweise verwendet, die sowohl in der Frage als auch in den Textquellen vorhanden sind. Qsearch umfasst zwei Hauptbeiträge. Der erste Beitrag ist ein tiefes neuronales Netzwerkmodell, das für die Extraktion quantitätszentrierter Tupel aus Textquellen entwickelt wurde. Der zweite Beitrag ist ein neuartiges Query-Matching-Modell zum Finden und zur Reihung passender Tupel. • Zweitens, um beim Vorgang heterogene Tabellen einzubinden, stellen wir QuTE [Ho et al., 2021a, Ho et al., 2021b] vor – ein System zum Extrahieren von Quantitätsinformationen aus Webquellen, insbesondere Ad-hoc Webtabellen in HTML-Seiten. Der Beitrag von QuTE umfasst eine Methode zur Verknüpfung von Quantitäts- und Entitätsspalten, für die externe Textquellen genutzt werden. Zur Beantwortung von Fragen kontextualisieren wir die extrahierten Entitäts-Quantitäts-Paare mit informativen Hinweisen aus der Tabelle und stellen eine neue Methode zur Konsolidierung und verbesserteer Reihung von Antwortkandidaten durch Inter-Fakten-Konsistenz vor. • Drittens stellen wir QL [Ho et al., 2022] vor – eine Recall-orientierte Methode zur Anreicherung von Knowledge Bases (KBs) mit quantitativen Fakten. Moderne KBs wie Wikidata oder YAGO decken viele Entitäten und ihre relevanten Informationen ab, übersehen aber oft wichtige quantitative Eigenschaften. QL ist frage-gesteuert und basiert auf iterativem Lernen mit zwei Hauptbeiträgen, um die KB-Abdeckung zu verbessern. Der erste Beitrag ist eine Methode zur Expansion von Fragen, um einen größeren Pool an Faktenkandidaten zu erfassen. Der zweite Beitrag ist eine Technik zur Selbstkonsistenz durch Berücksichtigung der Werteverteilungen von Quantitäten
An Evaluation on Large Language Model Outputs: Discourse and Memorization
We present an empirical evaluation of various outputs generated by nine of
the most widely-available large language models (LLMs). Our analysis is done
with off-the-shelf, readily-available tools. We find a correlation between
percentage of memorized text, percentage of unique text, and overall output
quality, when measured with respect to output pathologies such as
counterfactual and logically-flawed statements, and general failures like not
staying on topic. Overall, 80.0% of the outputs evaluated contained memorized
data, but outputs containing the most memorized content were also more likely
to be considered of high quality. We discuss and evaluate mitigation
strategies, showing that, in the models evaluated, the rate of memorized text
being output is reduced. We conclude with a discussion on potential
implications around what it means to learn, to memorize, and to evaluate
quality text.Comment: Preprint. Under revie
Named Entity Resolution in Personal Knowledge Graphs
Entity Resolution (ER) is the problem of determining when two entities refer
to the same underlying entity. The problem has been studied for over 50 years,
and most recently, has taken on new importance in an era of large,
heterogeneous 'knowledge graphs' published on the Web and used widely in
domains as wide ranging as social media, e-commerce and search. This chapter
will discuss the specific problem of named ER in the context of personal
knowledge graphs (PKGs). We begin with a formal definition of the problem, and
the components necessary for doing high-quality and efficient ER. We also
discuss some challenges that are expected to arise for Web-scale data. Next, we
provide a brief literature review, with a special focus on how existing
techniques can potentially apply to PKGs. We conclude the chapter by covering
some applications, as well as promising directions for future research.Comment: To appear as a book chapter by the same name in an upcoming (Oct.
2023) book `Personal Knowledge Graphs (PKGs): Methodology, tools and
applications' edited by Tiwari et a
Using Semantic Linking to Understand Persons' Networks Extracted from Text
In this work, we describe a methodology to interpret large persons' networks extracted from text by classifying cliques using the DBpedia ontology. The approach relies on a combination of NLP, Semantic web technologies, and network analysis. The classification methodology that first starts from single nodes and then generalizes to cliques is effective in terms of performance and is able to deal also with nodes that are not linked to Wikipedia. The gold standard manually developed for evaluation shows that groups of co-occurring entities share in most of the cases a category that can be automatically assigned. This holds for both languages considered in this study. The outcome of this work may be of interest to enhance the readability of large networks and to provide an additional semantic layer on top of cliques. This would greatly help humanities scholars when dealing with large amounts of textual data that need to be interpreted or categorized. Furthermore, it represents an unsupervised approach to automatically extend DBpedia starting from a corpus
SoK: Memorization in General-Purpose Large Language Models
Large Language Models (LLMs) are advancing at a remarkable pace, with myriad
applications under development. Unlike most earlier machine learning models,
they are no longer built for one specific application but are designed to excel
in a wide range of tasks. A major part of this success is due to their huge
training datasets and the unprecedented number of model parameters, which allow
them to memorize large amounts of information contained in the training data.
This memorization goes beyond mere language, and encompasses information only
present in a few documents. This is often desirable since it is necessary for
performing tasks such as question answering, and therefore an important part of
learning, but also brings a whole array of issues, from privacy and security to
copyright and beyond. LLMs can memorize short secrets in the training data, but
can also memorize concepts like facts or writing styles that can be expressed
in text in many different ways. We propose a taxonomy for memorization in LLMs
that covers verbatim text, facts, ideas and algorithms, writing styles,
distributional properties, and alignment goals. We describe the implications of
each type of memorization - both positive and negative - for model performance,
privacy, security and confidentiality, copyright, and auditing, and ways to
detect and prevent memorization. We further highlight the challenges that arise
from the predominant way of defining memorization with respect to model
behavior instead of model weights, due to LLM-specific phenomena such as
reasoning capabilities or differences between decoding algorithms. Throughout
the paper, we describe potential risks and opportunities arising from
memorization in LLMs that we hope will motivate new research directions
Proceedings of the First Workshop on Computing News Storylines (CNewsStory 2015)
This volume contains the proceedings of the 1st Workshop on Computing News Storylines (CNewsStory
2015) held in conjunction with the 53rd Annual Meeting of the Association for Computational
Linguistics and the 7th International Joint Conference on Natural Language Processing (ACL-IJCNLP
2015) at the China National Convention Center in Beijing, on July 31st 2015.
Narratives are at the heart of information sharing. Ever since people began to share their experiences,
they have connected them to form narratives. The study od storytelling and the field of literary theory
called narratology have developed complex frameworks and models related to various aspects of
narrative such as plots structures, narrative embeddings, characters’ perspectives, reader response, point
of view, narrative voice, narrative goals, and many others. These notions from narratology have been
applied mainly in Artificial Intelligence and to model formal semantic approaches to narratives (e.g.
Plot Units developed by Lehnert (1981)). In recent years, computational narratology has qualified as an
autonomous field of study and research. Narrative has been the focus of a number of workshops and
conferences (AAAI Symposia, Interactive Storytelling Conference (ICIDS), Computational Models of
Narrative). Furthermore, reference annotation schemes for narratives have been proposed (NarrativeML
by Mani (2013)).
The workshop aimed at bringing together researchers from different communities working on
representing and extracting narrative structures in news, a text genre which is highly used in NLP
but which has received little attention with respect to narrative structure, representation and analysis.
Currently, advances in NLP technology have made it feasible to look beyond scenario-driven, atomic
extraction of events from single documents and work towards extracting story structures from multiple
documents, while these documents are published over time as news streams. Policy makers, NGOs,
information specialists (such as journalists and librarians) and others are increasingly in need of tools
that support them in finding salient stories in large amounts of information to more effectively implement
policies, monitor actions of “big players” in the society and check facts. Their tasks often revolve around
reconstructing cases either with respect to specific entities (e.g. person or organizations) or events (e.g.
hurricane Katrina). Storylines represent explanatory schemas that enable us to make better selections
of relevant information but also projections to the future. They form a valuable potential for exploiting
news data in an innovative way.JRC.G.2-Global security and crisis managemen
PandeMedia: an annotated corpus of digital media for issue salience
Tese de mestrado, Ciência de Dados, 2022, Universidade de Lisboa, Faculdade de CiênciasThe ubiquitous sharing of information via the Internet has shifted much of society’s communication and information-seeking to digital spaces, such as news websites and social networks. As the Web
represents a massive hub of information dissemination and discussion, it has also made possible the extraction of great amounts of highly detailed data to answer complex questions on human behaviour and
preferences.
This shift towards online life was exaggerated during the earlier phases of the COVID-19 pandemic,
when many countries were in lockdown and in-person contact was severely limited. Therefore, in addition
to the ongoing political, economic, and public health crisis, there were, on the one hand, new opportunities
to study human behaviour thought digital data, including support for public health measures or trust in
science, while, on the other hand, the deluge of new data and the fast-changing nature of the pandemic
created new challenges to data science research, particularly the need to build quality pipelines for data
extraction, collection, and future analysis.
In this thesis, we focus on the important issue of salience of science and scientists during a health
crisis and ask how to build a pipeline to select, store, extract and analyse longitudinal digital media data,
that might allow for long-term study of media effects on salience. Therefore, this project has two main
components: first, we showcase a data pipeline that makes use of media and social media data, available
online, to build a media corpus of news and tweets with millions of documents, spanning billions of
tokens, corresponding to more than two years of coverage and multiple media sources and topics; second,
we show how this corpus can be leveraged to study the problem of salience, and use the visibility of
science during the earlier phases of the COVID-19 pandemic as a case-study, comparing between salience
in traditional versus social media. Overall, we present both a transparent and scaleable pipeline and a
specific application of this approach, to tackle the question of how science visibility changed during this
massive crisis. We use different media types and sources to potentiate text mining and other analytical
purposes, offering a digital data-centric computational methodology to investigate questions in the social
sciences.Os dados tomam, nos dias de hoje, um papel central no funcionamento das sociedades humanas. Com o
desenvolvimento das tecnologias digitais, aliadas à ubíqua conetividade à Internet, em particular à World
Wide Web (WWW), vivemos na chamada “era da informação” . Este paradigma da sociedade alicerça-se
no fenómeno tipicamente referido como datafication, que se refere ao processo já enraizado e inerente à
vida quotidiana através do qual a nossa atividade humana e formas de participação na sociedade são convertidas em dados. Esta produção em larga escala e em tempo real de dados funciona como o combustível
para um amplo leque de aplicações nos mais variados domínios, desde a indústria, à investigação científica, à saúde, entre outros. Deste modo, testemunhamos uma crescente procura, e mesmo necessidade,
de grandes coleções de dados, para alimentarem os diferentes setores de atividade.
A Web representa talvez o maior volume de dados amplamente disponível ao público em geral. É nos
websites e nas aplicações online que uma grande parte da população realiza diariamente um conjunto de
tarefas e ações, sejam estas de caráter profissional ou lúdico. Os nossos hábitos de consumo de informação
são assegurados predominantemente por estes espaços digitais, como as redes sociais ou as plataformas
digitais de media tradicionais. Da mesma forma, as nossas interações sociais mediadas por dispositivos
digitais são cada vez mais frequentes. A Web é, portanto, um reservatório de potenciais descobertas e de
informação valiosa, que pode ser eventualmente extraída através da exploração dos dados que contém.
Pela sua própria natureza, a Web levanta grandes desafios relativos às formas de capturar este valor
presente nos dados digitais. Enormes volumes de dados podem ser rapidamente e facilmente identificados e extraídos. No entanto, não existe um processo de acréscimo de valor a estes dados sem que passem
primeiramente por uma fase de organização. Para que seja possível extrair conhecimento dos dados obtidos, é necessário que estes apresentam a devida organização e qualidade. As maiores dificuldades nas
metodologias de colheita e gestão de dados digitais passam por assegurar precisamente esta qualidade. Os
dados da Web são naturalmente muito heterogéneos, visto resultarem da convergência de imensas fontes
de informação. São também, na sua maioria, não estruturados, nomeadamente em formatos textuais que
precisam de ser interpretados computacionalmente e compartimentalizados para facilitar futura análise.
Muitas vezes, existem também dados em falta ou que apresentam uma qualidade tão baixa que são inviáveis para as finalidades em mente. Para além destes fatores intrínsecos aos dados em si, as questões
que os rodeiam são também cruciais a considerar: a capacidade de detetar e localizar os dados pretendidos, a capacidade de aceder a estes dados, e o grau de disponibilidade destes dados, quando acessíveis.
Deve também ter-se em consideração as questões éticas, de privacidade e de direitos de autor associadas aos dados passíveis de serem colecionados. ... automatizar processos de colheita para fontes e tipos de
dados tão diversos quanto aqueles que se encontram disponíveis na Web.
A pandemia causada pelo SARS-CoV-2, agente da COVID-19, representa uma crise de enormes
proporções nas esferas política, económica e de saúde pública. Com a população do mundo restrita nos
seus comportamentos e hábitos de modo a prevenir um agravamento da propagação do vírus, as pessoas
recorreram ao digital como meio de comunicação e de obtenção e disseminação de informação (e desinformação). Assim, os media e as redes sociais foram relevantes pontos de convergência de uma grande
parte da atenção do público, levantando questões importantes sobre a perceção pública dos especialistas
científicos e sobre a saliência de certos tópicos de discussão.
Num contexto mais alargado, podemos perspetivar a crise pandémica como um desafio no domínio
das tecnologias da informação. No desenvolver desta emergência de saúde pública, temos vindo a ser
confrontados com vários dos desafios presentes em data science: dados complexos, na escala de populações inteiras, a serem produzidos em tempo real por múltiplas fontes, com diferentes estruturas e
formatos, e que sofrem uma rápida desatualização, requerem rápida análise, mas também processos de
limpeza e melhoramento robustos.
Todos estes fatores nos levam à nossa questão principal: numa crise que evolui tão rapidamente como
a pandemia da COVID-19, como podemos construir uma pipeline que nos permita responder aos desafios
da coleção e gestão de dados, de modo a criar um dataset de media digital para análise?
Para extrair os dados necessários, recorremos a três fontes distintas: a plataforma open-source Media
Cloud, a base de dados Internet Archive, e o API da rede social Twitter. Começámos por definir dezoito
tópicos distintos, constituídos por palavras-chaves para uso na pesquisa pelos artigos e posts de media.
Alguns tópicos são relacionados com a pandemia, enquanto outros funcionam como potenciais controlos
positivos e negativos. A coesão semântica de cada tópico foi assegurada através do uso da base de dados
léxica WordNet, que fornece significados e relações de palavras. Os metadados inicialmente obtidos
foram processados e utilizados para identificar as fontes primárias dos dados de notícias. A partir de Web
scraping, obtivemos dados brutos de artigos de media dos Estados Unidos da América disponíveis online,
de Janeiro de 2019 a Janeiro de 2021 (inclusive). Estes foram subsequentemente transformados, passando
por um processo de filtragem, limpeza e formatação, que é acompanhado de uma análise exploratória
dos dados e visualização de dados para efeitos de diagnóstico do processo completo. Os dados da rede
social foram extraídos através de um API próprio, especificando parâmetros para restringir resultados
aos Estados Unidos e ao intervalo de tempo anteriormente definido. Os dados devidamente tratados
foram posteriormente armazenados na base de dados desenhada e contruída para o propósito. A base
de dados foi concebida com quatro tabelas, que incluem os dados de notícias, os dados da rede social
Twitter, os metadados das pesquisas originais e metadados sobre as fontes das notícias, e feita através do
sistema de gestão de bases de dados PostgreSQL. Para otimizar o desempenho das pesquisas no nosso
conjunto de dados, procedemos à construção de índices para campos específicos, nomeadamente campos
de texto, que são o nosso interesse principal. Utilizando as funcionalidades disponíveis, foram construídas
representações vetoriais do texto das notícias, e a partir destas foi contruído um índice apropriado para
pesquisa em dados textuais, que reduziu o tempo de pesquisa por um fator nas dezenas de milhares
de vezes. Demonstramos ainda a pesquisa preliminar de dados longitudinais para efeitos de estudo da saliência de diferentes tópicos nos meios de comunicação. Foram aplicadas diferentes metodologias
estatísticas de análise de séries temporais para responder às questões a abordar. Através do uso de médias
móveis, os sinais foram clarificados para melhor visualização. Os testes de estacionaridade serviram
de diagnóstico para as transformações a aplicar aos dados de modo a garantir a validade de análises
posteriores. Com testes de causalidade de Granger, foi possível estabelecer relações entre séries temporais
com base no poder preditivo e assim compreender a dinâmica de interação de diferentes media. Com
recurso a técnicas de deteção de pontos de quebra, conseguimos defender a ideia de que existiram períodos
de mudança dos padrões observados nos media que coincidem com o despoletar da crise pandémica.
Assim, potenciada por uma pipeline customizada, robusta e transparente, conseguimos gerar um corpus de media, contendo milhões de documentos, que albergam milhares de milhões de tokens, correspondendo a um período de tempo superior a dois anos e múltiplas fontes de notícias e tópicos, permitindo
assim potenciar finalidades de mineração de texto (text mining) e outros propósitos analíticos, oferecendo uma metodologia computacional centrada nos dados digitais para investigar este tipo de questões
nas ciências sociais
Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems
Large language models (LLMs) have strong capabilities in solving diverse
natural language processing tasks. However, the safety and security issues of
LLM systems have become the major obstacle to their widespread application.
Many studies have extensively investigated risks in LLM systems and developed
the corresponding mitigation strategies. Leading-edge enterprises such as
OpenAI, Google, Meta, and Anthropic have also made lots of efforts on
responsible LLMs. Therefore, there is a growing need to organize the existing
studies and establish comprehensive taxonomies for the community. In this
paper, we delve into four essential modules of an LLM system, including an
input module for receiving prompts, a language model trained on extensive
corpora, a toolchain module for development and deployment, and an output
module for exporting LLM-generated content. Based on this, we propose a
comprehensive taxonomy, which systematically analyzes potential risks
associated with each module of an LLM system and discusses the corresponding
mitigation strategies. Furthermore, we review prevalent benchmarks, aiming to
facilitate the risk assessment of LLM systems. We hope that this paper can help
LLM participants embrace a systematic perspective to build their responsible
LLM systems