Search CORE

659 research outputs found

Structured information extraction from scientific text with large language models.

Author: Ceder Gerbrand
Ceder-Persson Kristin
Dagdelen John
Dunn Alexander
Jain Anubhav
Lee Sanghoon
Rosen Andrew
Walker Nicholas
Publication venue: eScholarship, University of California
Publication date: 01/02/2024
Field of study

Extracting structured knowledge from scientific text remains a challenging task for machine learning models. Here, we present a simple approach to joint named entity recognition and relation extraction and demonstrate how pretrained large language models (GPT-3, Llama-2) can be fine-tuned to extract useful records of complex scientific knowledge. We test three representative tasks in materials chemistry: linking dopants and host materials, cataloging metal-organic frameworks, and general composition/phase/morphology/application information extraction. Records are extracted from single sentences or entire paragraphs, and the output can be returned as simple English sentences or a more structured format such as a list of JSON objects. This approach represents a simple, accessible, and highly flexible route to obtaining large databases of structured specialized scientific knowledge extracted from research papers

Directory of Open Access Journals

eScholarship - University of California

Applying data mining techniques over big data

Author: Al-Hashemi Idrees Yousef
Publication venue: Boston University
Publication date: 01/01/2013
Field of study

Thesis (M.S.C.S.) PLEASE NOTE: Boston University Libraries did not receive an Authorization To Manage form for this thesis or dissertation. It is therefore not openly accessible, though it may be available by request. If you are the author or principal advisor of this work and would like to request open access for it, please contact us at [email protected]. Thank you.The rapid development of information technology in recent decades means that data appear in a wide variety of formats — sensor data, tweets, photographs, raw data, and unstructured data. Statistics show that there were 800,000 Petabytes stored in the world in 2000. Today’s internet has about 0.1 Zettabytes of data (ZB is about 1021 bytes), and this number will reach 35 ZB by 2020. With such an overwhelming flood of information, present data management systems are not able to scale to this huge amount of raw, unstructured data—in today’s parlance, Big Data. In the present study, we show the basic concepts and design of Big Data tools, algorithms, and techniques. We compare the classical data mining algorithms to the Big Data algorithms by using Hadoop/MapReduce as a core implementation of Big Data for scalable algorithms. We implemented the K-means algorithm and A-priori algorithm with Hadoop/MapReduce on a 5 nodes Hadoop cluster. We explore NoSQL databases for semi-structured, massively large-scaling of data by using MongoDB as an example. Finally, we show the performance between HDFS (Hadoop Distributed File System) and MongoDB data storage for these two algorithms

Boston University Institutional Repository (OpenBU)

Processing data close to its origin - edge computing on IoT devices to detect noise pollution

Author: Ostermann F.O.
Publication venue
Publication date: 01/01/2022
Field of study

University of Twente Research Information

Enhancing Usability Of Malware Analysis Pipelines With Reverse Engineering

Author: Ching Jeffrey
Publication venue: SURFACE at Syracuse University
Publication date: 22/05/2021
Field of study

Lots of work has been done on analyzing software distributed in binary form. This is a challenging problem because of the relatively unstructured nature of binaries. To recover high-level structure, various attempts have included static and dynamic analysis. However, human inspection is often required, as high-level structure is compiled away. Recent success in this area includes work on variable-name recovery, vulnerability discovery, class recovery for object-oriented languages. We are interested in building a pipeline for user to analyze malware. In this thesis we tackle two problems central to malware analysis pipelines. The first is D3RE, an interactive querying tool that allows users to analyze binaries interactively by writing declarative rules and visualizing their results projected onto a binary. The second is Assmeblage, a tool which automatically scrapes GitHub for C and C++ repositories and builds these repositories automatically using different compilation settings to produce a variety of configurations. These two tools will enable users to get enough data to do analysis as well for them to do interactive analysis. Finally, we present future work demonstrating a possible visualization combining d3re and Ghidra along with some specific questions for future user studies

Syracuse University Research Facility and Collaborative Environment

Emerging approaches for data-driven innovation in Europe: Sandbox experiments on the governance of data and technology

Author: Dencik Lina
Granell Carlos
Jirka Simon
Kotsev Alexander
MICHELI MARINA
Minghini Marco
Mooney Peter
Oost H.
Ostermann Frank
Rieke M.
Sarretta Alessandro
Schade Sven
Van Den Broecke J.
Verhulst Stefaan
Publication venue: 'Publications Office of the European Union'
Publication date: 16/02/2022
Field of study

Europe’s digital transformation of the economy and society is one of the priorities of the current Commission and is framed by the European strategy for data. This strategy aims at creating a single market for data through the establishment of a common European data space, based in turn on domain-specific data spaces in strategic sectors such as environment, agriculture, industry, health and transportation. Acknowledging the key role that emerging technologies and innovative approaches for data sharing and use can play to make European data spaces a reality, this document presents a set of experiments that explore emerging technologies and tools for data-driven innovation, and also deepen in the socio-technical factors and forces that occur in data-driven innovation. Experimental results shed some light in terms of lessons learned and practical recommendations towards the establishment of European data spaces

Repositori Institucional de la Universitat Jaume I

Enhancing Usability of Malware Analysis Pipelines With Reverse Engineering

Author: Ching Jeffrey
Publication venue: SURFACE at Syracuse University
Publication date: 23/05/2021
Field of study

Syracuse University Research Facility and Collaborative Environment

Time and Memory Efficient Parallel Algorithm for Structural Graph Summaries and two Extensions to Incremental Summarization and $k$ -Bisimulation for Long $k$ -Chaining

Author: Blume Till
Rau Jannik
Richerby David
Scherp Ansgar
Publication venue
Publication date: 04/11/2022
Field of study

We developed a flexible parallel algorithm for graph summarization based on vertex-centric programming and parameterized message passing. The base algorithm supports infinitely many structural graph summary models defined in a formal language. An extension of the parallel base algorithm allows incremental graph summarization. In this paper, we prove that the incremental algorithm is correct and show that updates are performed in time

\mathcal{O}(\Delta \cdot d^k)

, where

\Delta

is the number of additions, deletions, and modifications to the input graph,

d

the maximum degree, and

k

is the maximum distance in the subgraphs considered. Although the iterative algorithm supports values of

k>1

, it requires nested data structures for the message passing that are memory-inefficient. Thus, we extended the base summarization algorithm by a hash-based messaging mechanism to support a scalable iterative computation of graph summarizations based on

k

-bisimulation for arbitrary

k

. We empirically evaluate the performance of our algorithms using benchmark and real-world datasets. The incremental algorithm almost always outperforms the batch computation. We observe in our experiments that the incremental algorithm is faster even in cases when

50\%

of the graph database changes from one version to the next. The incremental computation requires a three-layered hash index, which has a low memory overhead of only

8\%

(

\pm 1\%

). Finally, the incremental summarization algorithm outperforms the batch algorithm even with fewer cores. The iterative parallel

k

-bisimulation algorithm computes summaries on graphs with over

10

M edges within seconds. We show that the algorithm processes graphs of

100+\,

M edges within a few minutes while having a moderate memory consumption of

<150

GB. For the largest BSBM1B dataset with 1 billion edges, it computes

k=10

bisimulation in under an hour

arXiv.org e-Print Archive

Understanding O-RAN: Architecture, Interfaces, Algorithms, Security, and Research Challenges

Author: Basagni Stefano
Bonati Leonardo
D'Oro Salvatore
Melodia Tommaso
Polese Michele
Publication venue
Publication date: 01/08/2022
Field of study

The Open Radio Access Network (RAN) and its embodiment through the O-RAN Alliance specifications are poised to revolutionize the telecom ecosystem. O-RAN promotes virtualized RANs where disaggregated components are connected via open interfaces and optimized by intelligent controllers. The result is a new paradigm for the RAN design, deployment, and operations: O-RAN networks can be built with multi-vendor, interoperable components, and can be programmatically optimized through a centralized abstraction layer and data-driven closed-loop control. Therefore, understanding O-RAN, its architecture, its interfaces, and workflows is key for researchers and practitioners in the wireless community. In this article, we present the first detailed tutorial on O-RAN. We also discuss the main research challenges and review early research results. We provide a deep dive of the O-RAN specifications, describing its architecture, design principles, and the O-RAN interfaces. We then describe how the O-RAN RAN Intelligent Controllers (RICs) can be used to effectively control and manage 3GPP-defined RANs. Based on this, we discuss innovations and challenges of O-RAN networks, including the Artificial Intelligence (AI) and Machine Learning (ML) workflows that the architecture and interfaces enable, security and standardization issues. Finally, we review experimental research platforms that can be used to design and test O-RAN networks, along with recent research results, and we outline future directions for O-RAN development.Comment: 33 pages, 16 figures, 3 tables. Submitted for publication to the IEE

arXiv.org e-Print Archive

Emerging approaches for data-driven innovation in Europe

Author: Dencik Lina
Granell Carlos
Jirka Simon
Kotsev Alexander
Micheli Marina
Minghini Marco
Mooney Peter
Oost Hillen
Ostermann Frank
Rieke Matthes
Sarretta Alessandro
Schade Sven
Van Den Broecke Just
Verhulst Stefaan
Publication venue: Publications Office of the European Union
Publication date
Field of study

Europe's digital transformation of the economy and society is one of the priorities of the current Commission and is framed by the European strategy for data. This strategy aims at creating a single market for data through the establishment of a common European data space, based in turn on domain-specific data spaces in strategic sectors such as environment, agriculture, industry, health and transportation. Acknowledging the key role that emerging technologies and innovative approaches for data sharing and use can play to make European data spaces a reality, this document presents a set of experiments that explore emerging technologies and tools for data-driven innovation, and also deepen in the socio-technical factors and forces that occur in data-driven innovation. Experimental results shed some light in terms of lessons learned and practical recommendations towards the establishment of European data spaces

Goldsmiths Research Online