659 research outputs found
Structured information extraction from scientific text with large language models.
Extracting structured knowledge from scientific text remains a challenging task for machine learning models. Here, we present a simple approach to joint named entity recognition and relation extraction and demonstrate how pretrained large language models (GPT-3, Llama-2) can be fine-tuned to extract useful records of complex scientific knowledge. We test three representative tasks in materials chemistry: linking dopants and host materials, cataloging metal-organic frameworks, and general composition/phase/morphology/application information extraction. Records are extracted from single sentences or entire paragraphs, and the output can be returned as simple English sentences or a more structured format such as a list of JSON objects. This approach represents a simple, accessible, and highly flexible route to obtaining large databases of structured specialized scientific knowledge extracted from research papers
Applying data mining techniques over big data
Thesis (M.S.C.S.) PLEASE NOTE: Boston University Libraries did not receive an Authorization To Manage form for this thesis or dissertation. It is therefore not openly accessible, though it may be available by request. If you are the author or principal advisor of this work and would like to request open access for it, please contact us at [email protected]. Thank you.The rapid development of information technology in recent decades means that data appear in a wide variety of formats — sensor data, tweets, photographs, raw data, and unstructured data. Statistics show that there were 800,000 Petabytes stored in the world in 2000. Today’s internet has about 0.1 Zettabytes of data (ZB is about 1021 bytes), and this number will reach 35 ZB by 2020. With such an overwhelming flood of information, present data management systems are not able to scale to this huge amount of raw, unstructured data—in today’s parlance, Big Data.
In the present study, we show the basic concepts and design of Big Data tools, algorithms, and techniques. We compare the classical data mining algorithms to the Big Data algorithms by using Hadoop/MapReduce as a core implementation of Big Data for scalable algorithms. We implemented the K-means algorithm and A-priori algorithm with Hadoop/MapReduce on a 5 nodes Hadoop cluster. We explore NoSQL databases for semi-structured, massively large-scaling of data by using MongoDB as an example.
Finally, we show the performance between HDFS (Hadoop Distributed File System) and MongoDB data storage for these two algorithms
Enhancing Usability Of Malware Analysis Pipelines With Reverse Engineering
Lots of work has been done on analyzing software distributed in binary form. This is a challenging problem because of the relatively unstructured nature of binaries. To recover high-level structure, various attempts have included static and dynamic analysis. However, human inspection is often required, as high-level structure is compiled away. Recent success in this area includes work on variable-name recovery, vulnerability discovery, class recovery for object-oriented languages. We are interested in building a pipeline for user to analyze malware. In this thesis we tackle two problems central to malware analysis pipelines. The first is D3RE, an interactive querying tool that allows users to analyze binaries interactively by writing declarative rules and visualizing their results projected onto a binary. The second is Assmeblage, a tool which automatically scrapes GitHub for C and C++ repositories and builds these repositories automatically using different compilation settings to produce a variety of configurations. These two tools will enable users to get enough data to do analysis as well for them to do interactive analysis. Finally, we present future work demonstrating a possible visualization combining d3re and Ghidra along with some specific questions for future user studies
Emerging approaches for data-driven innovation in Europe: Sandbox experiments on the governance of data and technology
Europe’s digital transformation of the economy and society is one of the priorities of the current Commission
and is framed by the European strategy for data. This strategy aims at creating a single market for data
through the establishment of a common European data space, based in turn on domain-specific data spaces
in strategic sectors such as environment, agriculture, industry, health and transportation. Acknowledging the
key role that emerging technologies and innovative approaches for data sharing and use can play to make
European data spaces a reality, this document presents a set of experiments that explore emerging
technologies and tools for data-driven innovation, and also deepen in the socio-technical factors and forces
that occur in data-driven innovation. Experimental results shed some light in terms of lessons learned and
practical recommendations towards the establishment of European data spaces
Enhancing Usability of Malware Analysis Pipelines With Reverse Engineering
Lots of work has been done on analyzing software distributed in binary form. This is a challenging problem because of the relatively unstructured nature of binaries. To recover high-level structure, various attempts have included static and dynamic analysis. However, human inspection is often required, as high-level structure is compiled away. Recent success in this area includes work on variable-name recovery, vulnerability discovery, class recovery for object-oriented languages. We are interested in building a pipeline for user to analyze malware. In this thesis we tackle two problems central to malware analysis pipelines. The first is D3RE, an interactive querying tool that allows users to analyze binaries interactively by writing declarative rules and visualizing their results projected onto a binary. The second is Assmeblage, a tool which automatically scrapes GitHub for C and C++ repositories and builds these repositories automatically using different compilation settings to produce a variety of configurations. These two tools will enable users to get enough data to do analysis as well for them to do interactive analysis. Finally, we present future work demonstrating a possible visualization combining d3re and Ghidra along with some specific questions for future user studies
Time and Memory Efficient Parallel Algorithm for Structural Graph Summaries and two Extensions to Incremental Summarization and -Bisimulation for Long -Chaining
We developed a flexible parallel algorithm for graph summarization based on
vertex-centric programming and parameterized message passing. The base
algorithm supports infinitely many structural graph summary models defined in a
formal language. An extension of the parallel base algorithm allows incremental
graph summarization. In this paper, we prove that the incremental algorithm is
correct and show that updates are performed in time , where is the number of additions, deletions, and modifications
to the input graph, the maximum degree, and is the maximum distance in
the subgraphs considered. Although the iterative algorithm supports values of
, it requires nested data structures for the message passing that are
memory-inefficient. Thus, we extended the base summarization algorithm by a
hash-based messaging mechanism to support a scalable iterative computation of
graph summarizations based on -bisimulation for arbitrary . We
empirically evaluate the performance of our algorithms using benchmark and
real-world datasets. The incremental algorithm almost always outperforms the
batch computation. We observe in our experiments that the incremental algorithm
is faster even in cases when of the graph database changes from one
version to the next. The incremental computation requires a three-layered hash
index, which has a low memory overhead of only (). Finally, the
incremental summarization algorithm outperforms the batch algorithm even with
fewer cores. The iterative parallel -bisimulation algorithm computes
summaries on graphs with over M edges within seconds. We show that the
algorithm processes graphs of M edges within a few minutes while having
a moderate memory consumption of GB. For the largest BSBM1B dataset with
1 billion edges, it computes bisimulation in under an hour
Understanding O-RAN: Architecture, Interfaces, Algorithms, Security, and Research Challenges
The Open Radio Access Network (RAN) and its embodiment through the O-RAN
Alliance specifications are poised to revolutionize the telecom ecosystem.
O-RAN promotes virtualized RANs where disaggregated components are connected
via open interfaces and optimized by intelligent controllers. The result is a
new paradigm for the RAN design, deployment, and operations: O-RAN networks can
be built with multi-vendor, interoperable components, and can be
programmatically optimized through a centralized abstraction layer and
data-driven closed-loop control. Therefore, understanding O-RAN, its
architecture, its interfaces, and workflows is key for researchers and
practitioners in the wireless community. In this article, we present the first
detailed tutorial on O-RAN. We also discuss the main research challenges and
review early research results. We provide a deep dive of the O-RAN
specifications, describing its architecture, design principles, and the O-RAN
interfaces. We then describe how the O-RAN RAN Intelligent Controllers (RICs)
can be used to effectively control and manage 3GPP-defined RANs. Based on this,
we discuss innovations and challenges of O-RAN networks, including the
Artificial Intelligence (AI) and Machine Learning (ML) workflows that the
architecture and interfaces enable, security and standardization issues.
Finally, we review experimental research platforms that can be used to design
and test O-RAN networks, along with recent research results, and we outline
future directions for O-RAN development.Comment: 33 pages, 16 figures, 3 tables. Submitted for publication to the IEE
Emerging approaches for data-driven innovation in Europe
Europe's digital transformation of the economy and society is one of the priorities of the current Commission and is framed by the European strategy for data. This strategy aims at creating a single market for data through the establishment of a common European data space, based in turn on domain-specific data spaces in strategic sectors such as environment, agriculture, industry, health and transportation. Acknowledging the key role that emerging technologies and innovative approaches for data sharing and use can play to make European data spaces a reality, this document presents a set of experiments that explore emerging technologies and tools for data-driven innovation, and also deepen in the socio-technical factors and forces that occur in data-driven innovation. Experimental results shed some light in terms of lessons learned and practical recommendations towards the establishment of European data spaces
- …