641 research outputs found

    Unifying Large Language Models and Knowledge Graphs: A Roadmap

    Full text link
    Large language models (LLMs), such as ChatGPT and GPT4, are making new waves in the field of natural language processing and artificial intelligence, due to their emergent ability and generalizability. However, LLMs are black-box models, which often fall short of capturing and accessing factual knowledge. In contrast, Knowledge Graphs (KGs), Wikipedia and Huapu for example, are structured knowledge models that explicitly store rich factual knowledge. KGs can enhance LLMs by providing external knowledge for inference and interpretability. Meanwhile, KGs are difficult to construct and evolving by nature, which challenges the existing methods in KGs to generate new facts and represent unseen knowledge. Therefore, it is complementary to unify LLMs and KGs together and simultaneously leverage their advantages. In this article, we present a forward-looking roadmap for the unification of LLMs and KGs. Our roadmap consists of three general frameworks, namely, 1) KG-enhanced LLMs, which incorporate KGs during the pre-training and inference phases of LLMs, or for the purpose of enhancing understanding of the knowledge learned by LLMs; 2) LLM-augmented KGs, that leverage LLMs for different KG tasks such as embedding, completion, construction, graph-to-text generation, and question answering; and 3) Synergized LLMs + KGs, in which LLMs and KGs play equal roles and work in a mutually beneficial way to enhance both LLMs and KGs for bidirectional reasoning driven by both data and knowledge. We review and summarize existing efforts within these three frameworks in our roadmap and pinpoint their future research directions.Comment: 29 pages, 25 figure

    A Survey on Semantic Processing Techniques

    Full text link
    Semantic processing is a fundamental research domain in computational linguistics. In the era of powerful pre-trained language models and large language models, the advancement of research in this domain appears to be decelerating. However, the study of semantics is multi-dimensional in linguistics. The research depth and breadth of computational semantic processing can be largely improved with new technologies. In this survey, we analyzed five semantic processing tasks, e.g., word sense disambiguation, anaphora resolution, named entity recognition, concept extraction, and subjectivity detection. We study relevant theoretical research in these fields, advanced methods, and downstream applications. We connect the surveyed tasks with downstream applications because this may inspire future scholars to fuse these low-level semantic processing tasks with high-level natural language processing tasks. The review of theoretical research may also inspire new tasks and technologies in the semantic processing domain. Finally, we compare the different semantic processing techniques and summarize their technical trends, application trends, and future directions.Comment: Published at Information Fusion, Volume 101, 2024, 101988, ISSN 1566-2535. The equal contribution mark is missed in the published version due to the publication policies. Please contact Prof. Erik Cambria for detail

    Construction of cardiovascular information extraction corpus based on electronic medical records

    Get PDF
    Cardiovascular disease has a significant impact on both society and patients, making it necessary to conduct knowledge-based research such as research that utilizes knowledge graphs and automated question answering. However, the existing research on corpus construction for cardiovascular disease is relatively limited, which has hindered further knowledge-based research on this disease. Electronic medical records contain patient data that span the entire diagnosis and treatment process and include a large amount of reliable medical information. Therefore, we collected electronic medical record data related to cardiovascular disease, combined the data with relevant work experience and developed a standard for labeling cardiovascular electronic medical record entities and entity relations. By building a sentence-level labeling result dictionary through the use of a rule-based semi-automatic method, a cardiovascular electronic medical record entity and entity relationship labeling corpus (CVDEMRC) was constructed. The CVDEMRC contains 7691 entities and 11,185 entity relation triples, and the results of consistency examination were 93.51% and 84.02% for entities and entity-relationship annotations, respectively, demonstrating good consistency results. The CVDEMRC constructed in this study is expected to provide a database for information extraction research related to cardiovascular diseases

    STIXnet: entity and relation extraction from unstructured CTI reports

    Get PDF
    The increased frequency of cyber attacks against organizations and their potentially devastating effects has raised awareness on the severity of these threats. In order to proactively harden their defences, organizations have started to invest in Cyber Threat Intelligence (CTI), the field of Cybersecurity that deals with the collection, analysis and organization of intelligence on the attackers and their techniques. By being able to profile the activity of a particular threat actor, thus knowing the types of organizations that it targets and the kind of vulnerabilities that it exploits, it is possible not only to mitigate their attacks, but also to prevent them. Although the sharing of this type of intelligence is facilitated by several standards such as STIX (Structured Threat Information eXpression), most of the data still consists of reports written in natural language. This particular format can be highly time-consuming for Cyber Threat Intelligence analysts, which may need to read the entire report and label entities and relations in order to generate an interconnected graph from which the intel can be extracted. In this thesis, done in collaboration with Leonardo S.p.A., we provide a modular and extensible system called STIXnet for the extraction of entities and relations from natural language CTI reports. The tool is embedded in a larger platform, developed by Leonardo, called Cyber Threat Intelligence System (CTIS) and therefore inherits some of its features, such as an extensible knowledge base which also acts as a database for the entities to extract. STIXnet uses techniques from Natural Language Processing (NLP), the branch of computer science that studies the ability of a computer program to process and analyze natural language data. This field of study has been recently revolutionized by the increasing popularity of Machine Learning, which allows for more efficient algorithms and better results. After looking for known entities retrieved from the knowledge base, STIXnet analyzes the semantic structure of the sentences in order to extract new possible entities and predicts Tactics, Techniques, and Procedures (TTPs) used by the attacker. Finally, an NLP model extracts relations between these entities and converts them to be compliant with the STIX 2.1 standard, thus generating an interconnected graph which can be exported and shared. STIXnet is also able to be constantly and automatically improved with some feedback from a human analyzer, which by highlighting false positives and false negatives in the processing of the report, can trigger a fine-tuning process that will increase the tool's overall accuracy and precision. This framework can help defenders to immediately know at a glace all the gathered intelligence on a particular threat actor and thus deploy effective threat detection, perform attack simulations and strengthen their defenses, and together with the Cyber Threat Intelligence System platform organizations can be always one step ahead of the attacker and be secure against Advanced Persistent Threats (APTs).The increased frequency of cyber attacks against organizations and their potentially devastating effects has raised awareness on the severity of these threats. In order to proactively harden their defences, organizations have started to invest in Cyber Threat Intelligence (CTI), the field of Cybersecurity that deals with the collection, analysis and organization of intelligence on the attackers and their techniques. By being able to profile the activity of a particular threat actor, thus knowing the types of organizations that it targets and the kind of vulnerabilities that it exploits, it is possible not only to mitigate their attacks, but also to prevent them. Although the sharing of this type of intelligence is facilitated by several standards such as STIX (Structured Threat Information eXpression), most of the data still consists of reports written in natural language. This particular format can be highly time-consuming for Cyber Threat Intelligence analysts, which may need to read the entire report and label entities and relations in order to generate an interconnected graph from which the intel can be extracted. In this thesis, done in collaboration with Leonardo S.p.A., we provide a modular and extensible system called STIXnet for the extraction of entities and relations from natural language CTI reports. The tool is embedded in a larger platform, developed by Leonardo, called Cyber Threat Intelligence System (CTIS) and therefore inherits some of its features, such as an extensible knowledge base which also acts as a database for the entities to extract. STIXnet uses techniques from Natural Language Processing (NLP), the branch of computer science that studies the ability of a computer program to process and analyze natural language data. This field of study has been recently revolutionized by the increasing popularity of Machine Learning, which allows for more efficient algorithms and better results. After looking for known entities retrieved from the knowledge base, STIXnet analyzes the semantic structure of the sentences in order to extract new possible entities and predicts Tactics, Techniques, and Procedures (TTPs) used by the attacker. Finally, an NLP model extracts relations between these entities and converts them to be compliant with the STIX 2.1 standard, thus generating an interconnected graph which can be exported and shared. STIXnet is also able to be constantly and automatically improved with some feedback from a human analyzer, which by highlighting false positives and false negatives in the processing of the report, can trigger a fine-tuning process that will increase the tool's overall accuracy and precision. This framework can help defenders to immediately know at a glace all the gathered intelligence on a particular threat actor and thus deploy effective threat detection, perform attack simulations and strengthen their defenses, and together with the Cyber Threat Intelligence System platform organizations can be always one step ahead of the attacker and be secure against Advanced Persistent Threats (APTs)

    Knowledge-based approaches to producing large-scale training data from scratch for Word Sense Disambiguation and Sense Distribution Learning

    Get PDF
    Communicating and understanding each other is one of the most important human abilities. As humans, in fact, we can easily assign the correct meaning to the ambiguous words in a text, while, at the same time, being able to abstract, summarise and enrich its content with new information that we learned somewhere else. On the contrary, machines rely on formal languages which do not leave space to ambiguity hence being easy to parse and understand. Therefore, to fill the gap between humans and machines and enabling the latter to better communicate with and comprehend its sentient counterpart, in the modern era of computer-science's much effort has been put into developing Natural Language Processing (NLP) approaches which aim at understanding and handling the ambiguity of the human language. At the core of NLP lies the task of correctly interpreting the meaning of each word in a given text, hence disambiguating its content exactly as a human would do. Researchers in the Word Sense Disambiguation (WSD) field address exactly this issue by leveraging either knowledge bases, i.e. graphs where nodes are concept and edges are semantic relations among them, or manually-annotated datasets for training machine learning algorithms. One common obstacle is the knowledge acquisition bottleneck problem, id est, retrieving or generating semantically-annotated data which are necessary to build both semantic graphs or training sets is a complex task. This phenomenon is even more serious when considering languages other than English where resources to generate human-annotated data are scarce and ready-made datasets are completely absent. With the advent of deep learning this issue became even more serious as more complex models need larger datasets in order to learn meaningful patterns to solve the task. Another critical issue in WSD, as well as in other machine-learning-related fields, is the domain adaptation problem, id est, performing the same task in different application domains. This is particularly hard when dealing with word senses, as, in fact, they are governed by a Zipfian distribution; hence, by slightly changing the application domain, a sense might become very frequent even though it is very rare in the general domain. For example the geometric sense of plane is very frequent in a corpus made of math books, while it is very rare in a general domain dataset. In this thesis we address both these problems. Inter alia, we focus on relieving the burden of human annotations in Word Sense Disambiguation thus enabling the automatic construction of high-quality sense-annotated dataset not only for English, but especially for other languages where sense-annotated data are not available at all. Furthermore, recognising in word-sense distribution one of the main pitfalls for WSD approaches, we also alleviate the dependency on most frequent sense information by automatically inducing the word-sense distribution in a given text of raw sentences. In the following we propose a language-independent and automatic approach to generating semantic annotations given a collection of sentences, and then introduce two methods for the automatic inference of word-sense distributions. Finally, we combine the two kind of approaches to build a semantically-annotated dataset that reflect the sense distribution which we automatically infer from the target text
    • …
    corecore