43 research outputs found

    Doctor of Philosophy

    Get PDF
    dissertationIn the past few years, we have seen a tremendous increase in digital data being generated. By 2011, storage vendors had shipped 905 PB of purpose-built backup appliances. By 2013, the number of objects stored in Amazon S3 had reached 2 trillion. Facebook had stored 20 PB of photos by 2010. All of these require an efficient storage solution. To improve space efficiency, compression and deduplication are being widely used. Compression works by identifying repeated strings and replacing them with more compact encodings while deduplication partitions data into fixed-size or variable-size chunks and removes duplicate blocks. While we have seen great improvements in space efficiency from these two approaches, there are still some limitations. First, traditional compressors are limited in their ability to detect redundancy across a large range since they search for redundant data in a fine-grain level (string level). For deduplication, metadata embedded in an input file changes more frequently, and this introduces more unnecessary unique chunks, leading to poor deduplication. Cloud storage systems suffer from unpredictable and inefficient performance because of interference among different types of workloads. This dissertation proposes techniques to improve the effectiveness of traditional compressors and deduplication in improving space efficiency, and a new IO scheduling algorithm to improve performance predictability and efficiency for cloud storage systems. The common idea is to utilize similarity. To improve the effectiveness of compression and deduplication, similarity in content is used to transform an input file into a compression- or deduplication-friendly format. We propose Migratory Compression, a generic data transformation that identifies similar data in a coarse-grain level (block level) and then groups similar blocks together. It can be used as a preprocessing stage for any traditional compressor. We find metadata have a huge impact in reducing the benefit of deduplication. To isolate the impact from metadata, we propose to separate metadata from data. Three approaches are presented for use cases with different constrains. For the commonly used tar format, we propose Migratory Tar: a data transformation and also a new tar format that deduplicates better. We also present a case study where we use deduplication to reduce storage consumption for storing disk images, while at the same time achieving high performance in image deployment. Finally, we apply the same principle of utilizing similarity in IO scheduling to prevent interference between random and sequential workloads, leading to efficient, consistent, and predictable performance for sequential workloads and a high disk utilization

    A Big Data Architecture for Early Identification and Categorization of Dark Web Sites

    Full text link
    The dark web has become notorious for its association with illicit activities and there is a growing need for systems to automate the monitoring of this space. This paper proposes an end-to-end scalable architecture for the early identification of new Tor sites and the daily analysis of their content. The solution is built using an Open Source Big Data stack for data serving with Kubernetes, Kafka, Kubeflow, and MinIO, continuously discovering onion addresses in different sources (threat intelligence, code repositories, web-Tor gateways, and Tor repositories), downloading the HTML from Tor and deduplicating the content using MinHash LSH, and categorizing with the BERTopic modeling (SBERT embedding, UMAP dimensionality reduction, HDBSCAN document clustering and c-TF-IDF topic keywords). In 93 days, the system identified 80,049 onion services and characterized 90% of them, addressing the challenge of Tor volatility. A disproportionate amount of repeated content is found, with only 6.1% unique sites. From the HTML files of the dark sites, 31 different low-topics are extracted, manually labeled, and grouped into 11 high-level topics. The five most popular included sexual and violent content, repositories, search engines, carding, cryptocurrencies, and marketplaces. During the experiments, we identified 14 sites with 13,946 clones that shared a suspiciously similar mirroring rate per day, suggesting an extensive common phishing network. Among the related works, this study is the most representative characterization of onion services based on topics to date

    Entities with quantities : extraction, search, and ranking

    Get PDF
    Quantities are more than numeric values. They denote measures of the world’s entities such as heights of buildings, running times of athletes, energy efficiency of car models or energy production of power plants, all expressed in numbers with associated units. Entity-centric search and question answering (QA) are well supported by modern search engines. However, they do not work well when the queries involve quantity filters, such as searching for athletes who ran 200m under 20 seconds or companies with quarterly revenue above $2 Billion. State-of-the-art systems fail to understand the quantities, including the condition (less than, above, etc.), the unit of interest (seconds, dollar, etc.), and the context of the quantity (200m race, quarterly revenue, etc.). QA systems based on structured knowledge bases (KBs) also fail as quantities are poorly covered by state-of-the-art KBs. In this dissertation, we developed new methods to advance the state-of-the-art on quantity knowledge extraction and search.Zahlen sind mehr als nur numerische Werte. Sie beschreiben Maße von Entitäten wie die Höhe von Gebäuden, die Laufzeit von Sportlern, die Energieeffizienz von Automodellen oder die Energieerzeugung von Kraftwerken - jeweils ausgedrückt durch Zahlen mit zugehörigen Einheiten. Entitätszentriete Anfragen und direktes Question-Answering werden von Suchmaschinen häufig gut unterstützt. Sie funktionieren jedoch nicht gut, wenn die Fragen Zahlenfilter beinhalten, wie z. B. die Suche nach Sportlern, die 200m unter 20 Sekunden gelaufen sind, oder nach Unternehmen mit einem Quartalsumsatz von über 2 Milliarden US-Dollar. Selbst moderne Systeme schaffen es nicht, Quantitäten, einschließlich der genannten Bedingungen (weniger als, über, etc.), der Maßeinheiten (Sekunden, Dollar, etc.) und des Kontexts (200-Meter-Rennen, Quartalsumsatz usw.), zu verstehen. Auch QA-Systeme, die auf strukturierten Wissensbanken (“Knowledge Bases”, KBs) aufgebaut sind, versagen, da quantitative Eigenschaften von modernen KBs kaum erfasst werden. In dieser Dissertation werden neue Methoden entwickelt, um den Stand der Technik zur Wissensextraktion und -suche von Quantitäten voranzutreiben. Unsere Hauptbeiträge sind die folgenden: • Zunächst präsentieren wir Qsearch [Ho et al., 2019, Ho et al., 2020] – ein System, das mit erweiterten Fragen mit Quantitätsfiltern umgehen kann, indem es Hinweise verwendet, die sowohl in der Frage als auch in den Textquellen vorhanden sind. Qsearch umfasst zwei Hauptbeiträge. Der erste Beitrag ist ein tiefes neuronales Netzwerkmodell, das für die Extraktion quantitätszentrierter Tupel aus Textquellen entwickelt wurde. Der zweite Beitrag ist ein neuartiges Query-Matching-Modell zum Finden und zur Reihung passender Tupel. • Zweitens, um beim Vorgang heterogene Tabellen einzubinden, stellen wir QuTE [Ho et al., 2021a, Ho et al., 2021b] vor – ein System zum Extrahieren von Quantitätsinformationen aus Webquellen, insbesondere Ad-hoc Webtabellen in HTML-Seiten. Der Beitrag von QuTE umfasst eine Methode zur Verknüpfung von Quantitäts- und Entitätsspalten, für die externe Textquellen genutzt werden. Zur Beantwortung von Fragen kontextualisieren wir die extrahierten Entitäts-Quantitäts-Paare mit informativen Hinweisen aus der Tabelle und stellen eine neue Methode zur Konsolidierung und verbesserteer Reihung von Antwortkandidaten durch Inter-Fakten-Konsistenz vor. • Drittens stellen wir QL [Ho et al., 2022] vor – eine Recall-orientierte Methode zur Anreicherung von Knowledge Bases (KBs) mit quantitativen Fakten. Moderne KBs wie Wikidata oder YAGO decken viele Entitäten und ihre relevanten Informationen ab, übersehen aber oft wichtige quantitative Eigenschaften. QL ist frage-gesteuert und basiert auf iterativem Lernen mit zwei Hauptbeiträgen, um die KB-Abdeckung zu verbessern. Der erste Beitrag ist eine Methode zur Expansion von Fragen, um einen größeren Pool an Faktenkandidaten zu erfassen. Der zweite Beitrag ist eine Technik zur Selbstkonsistenz durch Berücksichtigung der Werteverteilungen von Quantitäten

    An Evaluation on Large Language Model Outputs: Discourse and Memorization

    Full text link
    We present an empirical evaluation of various outputs generated by nine of the most widely-available large language models (LLMs). Our analysis is done with off-the-shelf, readily-available tools. We find a correlation between percentage of memorized text, percentage of unique text, and overall output quality, when measured with respect to output pathologies such as counterfactual and logically-flawed statements, and general failures like not staying on topic. Overall, 80.0% of the outputs evaluated contained memorized data, but outputs containing the most memorized content were also more likely to be considered of high quality. We discuss and evaluate mitigation strategies, showing that, in the models evaluated, the rate of memorized text being output is reduced. We conclude with a discussion on potential implications around what it means to learn, to memorize, and to evaluate quality text.Comment: Preprint. Under revie

    Named Entity Resolution in Personal Knowledge Graphs

    Full text link
    Entity Resolution (ER) is the problem of determining when two entities refer to the same underlying entity. The problem has been studied for over 50 years, and most recently, has taken on new importance in an era of large, heterogeneous 'knowledge graphs' published on the Web and used widely in domains as wide ranging as social media, e-commerce and search. This chapter will discuss the specific problem of named ER in the context of personal knowledge graphs (PKGs). We begin with a formal definition of the problem, and the components necessary for doing high-quality and efficient ER. We also discuss some challenges that are expected to arise for Web-scale data. Next, we provide a brief literature review, with a special focus on how existing techniques can potentially apply to PKGs. We conclude the chapter by covering some applications, as well as promising directions for future research.Comment: To appear as a book chapter by the same name in an upcoming (Oct. 2023) book `Personal Knowledge Graphs (PKGs): Methodology, tools and applications' edited by Tiwari et a

    Using Semantic Linking to Understand Persons' Networks Extracted from Text

    Get PDF
    In this work, we describe a methodology to interpret large persons' networks extracted from text by classifying cliques using the DBpedia ontology. The approach relies on a combination of NLP, Semantic web technologies, and network analysis. The classification methodology that first starts from single nodes and then generalizes to cliques is effective in terms of performance and is able to deal also with nodes that are not linked to Wikipedia. The gold standard manually developed for evaluation shows that groups of co-occurring entities share in most of the cases a category that can be automatically assigned. This holds for both languages considered in this study. The outcome of this work may be of interest to enhance the readability of large networks and to provide an additional semantic layer on top of cliques. This would greatly help humanities scholars when dealing with large amounts of textual data that need to be interpreted or categorized. Furthermore, it represents an unsupervised approach to automatically extend DBpedia starting from a corpus

    SoK: Memorization in General-Purpose Large Language Models

    Full text link
    Large Language Models (LLMs) are advancing at a remarkable pace, with myriad applications under development. Unlike most earlier machine learning models, they are no longer built for one specific application but are designed to excel in a wide range of tasks. A major part of this success is due to their huge training datasets and the unprecedented number of model parameters, which allow them to memorize large amounts of information contained in the training data. This memorization goes beyond mere language, and encompasses information only present in a few documents. This is often desirable since it is necessary for performing tasks such as question answering, and therefore an important part of learning, but also brings a whole array of issues, from privacy and security to copyright and beyond. LLMs can memorize short secrets in the training data, but can also memorize concepts like facts or writing styles that can be expressed in text in many different ways. We propose a taxonomy for memorization in LLMs that covers verbatim text, facts, ideas and algorithms, writing styles, distributional properties, and alignment goals. We describe the implications of each type of memorization - both positive and negative - for model performance, privacy, security and confidentiality, copyright, and auditing, and ways to detect and prevent memorization. We further highlight the challenges that arise from the predominant way of defining memorization with respect to model behavior instead of model weights, due to LLM-specific phenomena such as reasoning capabilities or differences between decoding algorithms. Throughout the paper, we describe potential risks and opportunities arising from memorization in LLMs that we hope will motivate new research directions

    Proceedings of the First Workshop on Computing News Storylines (CNewsStory 2015)

    Get PDF
    This volume contains the proceedings of the 1st Workshop on Computing News Storylines (CNewsStory 2015) held in conjunction with the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2015) at the China National Convention Center in Beijing, on July 31st 2015. Narratives are at the heart of information sharing. Ever since people began to share their experiences, they have connected them to form narratives. The study od storytelling and the field of literary theory called narratology have developed complex frameworks and models related to various aspects of narrative such as plots structures, narrative embeddings, characters’ perspectives, reader response, point of view, narrative voice, narrative goals, and many others. These notions from narratology have been applied mainly in Artificial Intelligence and to model formal semantic approaches to narratives (e.g. Plot Units developed by Lehnert (1981)). In recent years, computational narratology has qualified as an autonomous field of study and research. Narrative has been the focus of a number of workshops and conferences (AAAI Symposia, Interactive Storytelling Conference (ICIDS), Computational Models of Narrative). Furthermore, reference annotation schemes for narratives have been proposed (NarrativeML by Mani (2013)). The workshop aimed at bringing together researchers from different communities working on representing and extracting narrative structures in news, a text genre which is highly used in NLP but which has received little attention with respect to narrative structure, representation and analysis. Currently, advances in NLP technology have made it feasible to look beyond scenario-driven, atomic extraction of events from single documents and work towards extracting story structures from multiple documents, while these documents are published over time as news streams. Policy makers, NGOs, information specialists (such as journalists and librarians) and others are increasingly in need of tools that support them in finding salient stories in large amounts of information to more effectively implement policies, monitor actions of “big players” in the society and check facts. Their tasks often revolve around reconstructing cases either with respect to specific entities (e.g. person or organizations) or events (e.g. hurricane Katrina). Storylines represent explanatory schemas that enable us to make better selections of relevant information but also projections to the future. They form a valuable potential for exploiting news data in an innovative way.JRC.G.2-Global security and crisis managemen

    PandeMedia: an annotated corpus of digital media for issue salience

    Get PDF
    Tese de mestrado, Ciência de Dados, 2022, Universidade de Lisboa, Faculdade de CiênciasThe ubiquitous sharing of information via the Internet has shifted much of society’s communication and information-seeking to digital spaces, such as news websites and social networks. As the Web represents a massive hub of information dissemination and discussion, it has also made possible the extraction of great amounts of highly detailed data to answer complex questions on human behaviour and preferences. This shift towards online life was exaggerated during the earlier phases of the COVID-19 pandemic, when many countries were in lockdown and in-person contact was severely limited. Therefore, in addition to the ongoing political, economic, and public health crisis, there were, on the one hand, new opportunities to study human behaviour thought digital data, including support for public health measures or trust in science, while, on the other hand, the deluge of new data and the fast-changing nature of the pandemic created new challenges to data science research, particularly the need to build quality pipelines for data extraction, collection, and future analysis. In this thesis, we focus on the important issue of salience of science and scientists during a health crisis and ask how to build a pipeline to select, store, extract and analyse longitudinal digital media data, that might allow for long-term study of media effects on salience. Therefore, this project has two main components: first, we showcase a data pipeline that makes use of media and social media data, available online, to build a media corpus of news and tweets with millions of documents, spanning billions of tokens, corresponding to more than two years of coverage and multiple media sources and topics; second, we show how this corpus can be leveraged to study the problem of salience, and use the visibility of science during the earlier phases of the COVID-19 pandemic as a case-study, comparing between salience in traditional versus social media. Overall, we present both a transparent and scaleable pipeline and a specific application of this approach, to tackle the question of how science visibility changed during this massive crisis. We use different media types and sources to potentiate text mining and other analytical purposes, offering a digital data-centric computational methodology to investigate questions in the social sciences.Os dados tomam, nos dias de hoje, um papel central no funcionamento das sociedades humanas. Com o desenvolvimento das tecnologias digitais, aliadas à ubíqua conetividade à Internet, em particular à World Wide Web (WWW), vivemos na chamada “era da informação” . Este paradigma da sociedade alicerça-se no fenómeno tipicamente referido como datafication, que se refere ao processo já enraizado e inerente à vida quotidiana através do qual a nossa atividade humana e formas de participação na sociedade são convertidas em dados. Esta produção em larga escala e em tempo real de dados funciona como o combustível para um amplo leque de aplicações nos mais variados domínios, desde a indústria, à investigação científica, à saúde, entre outros. Deste modo, testemunhamos uma crescente procura, e mesmo necessidade, de grandes coleções de dados, para alimentarem os diferentes setores de atividade. A Web representa talvez o maior volume de dados amplamente disponível ao público em geral. É nos websites e nas aplicações online que uma grande parte da população realiza diariamente um conjunto de tarefas e ações, sejam estas de caráter profissional ou lúdico. Os nossos hábitos de consumo de informação são assegurados predominantemente por estes espaços digitais, como as redes sociais ou as plataformas digitais de media tradicionais. Da mesma forma, as nossas interações sociais mediadas por dispositivos digitais são cada vez mais frequentes. A Web é, portanto, um reservatório de potenciais descobertas e de informação valiosa, que pode ser eventualmente extraída através da exploração dos dados que contém. Pela sua própria natureza, a Web levanta grandes desafios relativos às formas de capturar este valor presente nos dados digitais. Enormes volumes de dados podem ser rapidamente e facilmente identificados e extraídos. No entanto, não existe um processo de acréscimo de valor a estes dados sem que passem primeiramente por uma fase de organização. Para que seja possível extrair conhecimento dos dados obtidos, é necessário que estes apresentam a devida organização e qualidade. As maiores dificuldades nas metodologias de colheita e gestão de dados digitais passam por assegurar precisamente esta qualidade. Os dados da Web são naturalmente muito heterogéneos, visto resultarem da convergência de imensas fontes de informação. São também, na sua maioria, não estruturados, nomeadamente em formatos textuais que precisam de ser interpretados computacionalmente e compartimentalizados para facilitar futura análise. Muitas vezes, existem também dados em falta ou que apresentam uma qualidade tão baixa que são inviáveis para as finalidades em mente. Para além destes fatores intrínsecos aos dados em si, as questões que os rodeiam são também cruciais a considerar: a capacidade de detetar e localizar os dados pretendidos, a capacidade de aceder a estes dados, e o grau de disponibilidade destes dados, quando acessíveis. Deve também ter-se em consideração as questões éticas, de privacidade e de direitos de autor associadas aos dados passíveis de serem colecionados. ... automatizar processos de colheita para fontes e tipos de dados tão diversos quanto aqueles que se encontram disponíveis na Web. A pandemia causada pelo SARS-CoV-2, agente da COVID-19, representa uma crise de enormes proporções nas esferas política, económica e de saúde pública. Com a população do mundo restrita nos seus comportamentos e hábitos de modo a prevenir um agravamento da propagação do vírus, as pessoas recorreram ao digital como meio de comunicação e de obtenção e disseminação de informação (e desinformação). Assim, os media e as redes sociais foram relevantes pontos de convergência de uma grande parte da atenção do público, levantando questões importantes sobre a perceção pública dos especialistas científicos e sobre a saliência de certos tópicos de discussão. Num contexto mais alargado, podemos perspetivar a crise pandémica como um desafio no domínio das tecnologias da informação. No desenvolver desta emergência de saúde pública, temos vindo a ser confrontados com vários dos desafios presentes em data science: dados complexos, na escala de populações inteiras, a serem produzidos em tempo real por múltiplas fontes, com diferentes estruturas e formatos, e que sofrem uma rápida desatualização, requerem rápida análise, mas também processos de limpeza e melhoramento robustos. Todos estes fatores nos levam à nossa questão principal: numa crise que evolui tão rapidamente como a pandemia da COVID-19, como podemos construir uma pipeline que nos permita responder aos desafios da coleção e gestão de dados, de modo a criar um dataset de media digital para análise? Para extrair os dados necessários, recorremos a três fontes distintas: a plataforma open-source Media Cloud, a base de dados Internet Archive, e o API da rede social Twitter. Começámos por definir dezoito tópicos distintos, constituídos por palavras-chaves para uso na pesquisa pelos artigos e posts de media. Alguns tópicos são relacionados com a pandemia, enquanto outros funcionam como potenciais controlos positivos e negativos. A coesão semântica de cada tópico foi assegurada através do uso da base de dados léxica WordNet, que fornece significados e relações de palavras. Os metadados inicialmente obtidos foram processados e utilizados para identificar as fontes primárias dos dados de notícias. A partir de Web scraping, obtivemos dados brutos de artigos de media dos Estados Unidos da América disponíveis online, de Janeiro de 2019 a Janeiro de 2021 (inclusive). Estes foram subsequentemente transformados, passando por um processo de filtragem, limpeza e formatação, que é acompanhado de uma análise exploratória dos dados e visualização de dados para efeitos de diagnóstico do processo completo. Os dados da rede social foram extraídos através de um API próprio, especificando parâmetros para restringir resultados aos Estados Unidos e ao intervalo de tempo anteriormente definido. Os dados devidamente tratados foram posteriormente armazenados na base de dados desenhada e contruída para o propósito. A base de dados foi concebida com quatro tabelas, que incluem os dados de notícias, os dados da rede social Twitter, os metadados das pesquisas originais e metadados sobre as fontes das notícias, e feita através do sistema de gestão de bases de dados PostgreSQL. Para otimizar o desempenho das pesquisas no nosso conjunto de dados, procedemos à construção de índices para campos específicos, nomeadamente campos de texto, que são o nosso interesse principal. Utilizando as funcionalidades disponíveis, foram construídas representações vetoriais do texto das notícias, e a partir destas foi contruído um índice apropriado para pesquisa em dados textuais, que reduziu o tempo de pesquisa por um fator nas dezenas de milhares de vezes. Demonstramos ainda a pesquisa preliminar de dados longitudinais para efeitos de estudo da saliência de diferentes tópicos nos meios de comunicação. Foram aplicadas diferentes metodologias estatísticas de análise de séries temporais para responder às questões a abordar. Através do uso de médias móveis, os sinais foram clarificados para melhor visualização. Os testes de estacionaridade serviram de diagnóstico para as transformações a aplicar aos dados de modo a garantir a validade de análises posteriores. Com testes de causalidade de Granger, foi possível estabelecer relações entre séries temporais com base no poder preditivo e assim compreender a dinâmica de interação de diferentes media. Com recurso a técnicas de deteção de pontos de quebra, conseguimos defender a ideia de que existiram períodos de mudança dos padrões observados nos media que coincidem com o despoletar da crise pandémica. Assim, potenciada por uma pipeline customizada, robusta e transparente, conseguimos gerar um corpus de media, contendo milhões de documentos, que albergam milhares de milhões de tokens, correspondendo a um período de tempo superior a dois anos e múltiplas fontes de notícias e tópicos, permitindo assim potenciar finalidades de mineração de texto (text mining) e outros propósitos analíticos, oferecendo uma metodologia computacional centrada nos dados digitais para investigar este tipo de questões nas ciências sociais

    Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems

    Full text link
    Large language models (LLMs) have strong capabilities in solving diverse natural language processing tasks. However, the safety and security issues of LLM systems have become the major obstacle to their widespread application. Many studies have extensively investigated risks in LLM systems and developed the corresponding mitigation strategies. Leading-edge enterprises such as OpenAI, Google, Meta, and Anthropic have also made lots of efforts on responsible LLMs. Therefore, there is a growing need to organize the existing studies and establish comprehensive taxonomies for the community. In this paper, we delve into four essential modules of an LLM system, including an input module for receiving prompts, a language model trained on extensive corpora, a toolchain module for development and deployment, and an output module for exporting LLM-generated content. Based on this, we propose a comprehensive taxonomy, which systematically analyzes potential risks associated with each module of an LLM system and discusses the corresponding mitigation strategies. Furthermore, we review prevalent benchmarks, aiming to facilitate the risk assessment of LLM systems. We hope that this paper can help LLM participants embrace a systematic perspective to build their responsible LLM systems
    corecore