8 research outputs found

    Chatbot-based Culinary Tourism Recommender System Using Named Entity Recognition

    Get PDF
    Over the time, culinary tourism in several cities of Indonesia is growing rapidly, one example is culinary tourism in Bandung city. This makes it difficult for tourists to decide their choice. To overcome these problems, a recommendation system is needed. Thus, in this study we developed a chatbot-based conversational recommendation system to assist users in finding culinary tourism recommendations. The chatbot was built using Google Dialogflow platform and uses methods in Natural Language Processing, namely Named Entity Recognition. Named Entity Recognition was used to extract entities from user’s input, such as usernames and culinary preferences. To find culinary recommendations, TF-IDF and cosine similarity was used to find similarities between each culinary based on reviews, telegram was used as a medium to implement the chatbot that has been built. The chatbot has a good performance in providing culinary recommendations, it can be seen from the score obtained from usability testing on the recommendation aspect, which is 85.7%

    Advancements in Natural Language Processing for Text Understanding

    Get PDF
    Natural language processing (NLP) developments have made it possible for robots to read and analyze human language with astounding precision, revolutionizing the field of text understanding. An overview of current advancements in NLP approaches and their effects on text comprehension are provided in this abstract. It examines significant developments in fields including named entity identification, sentiment analysis, semantic analysis, and question answering, highlighting the difficulties encountered and creative solutions put forth. To sum up, recent developments in natural language processing have raised the bar for text comprehension. Deep learning models and extensive pre-training have changed methods including semantic analysis, sentiment analysis, named entity identification, and question answering. These developments have produced text comprehension systems that are increasingly precise and complex. However, issues with prejudice, coreference resolution, and contextual comprehension still need to be resolved. The future of NLP for text understanding has considerable potential with continuing study and innovation, opening the door for increasingly sophisticated applications in numerous sectors

    Annotation of Pharmaceutical Assay Data Using Text Mining Techniques

    Get PDF
    Legacy data stores of experimental assay data in a pharmaceutical R&D organization are poorly structured and annotated, which hinders the integration of these data with data from more recent research programs and from other publicly available clinical, biological and chemical data sources. Being able to integrate and analyze this data in aggregate will help maximize the value of the available data, which will help inform and potentially improve the drug discovery process. In this study, text mining and information extraction tools and techniques were applied to improve the annotation of a subset of these data in an accurate and automated fashion. Experimental results of this study show promise for classifying some features of the available assay data. Initial results of classification using a Naïve-Bayes classifier provided high values of accuracy (up to 93%). This indicates that the methods described in this study can be extended to larger dataset to extract more annotation from the available data.Master of Science in Information Scienc

    STIXnet: entity and relation extraction from unstructured CTI reports

    Get PDF
    The increased frequency of cyber attacks against organizations and their potentially devastating effects has raised awareness on the severity of these threats. In order to proactively harden their defences, organizations have started to invest in Cyber Threat Intelligence (CTI), the field of Cybersecurity that deals with the collection, analysis and organization of intelligence on the attackers and their techniques. By being able to profile the activity of a particular threat actor, thus knowing the types of organizations that it targets and the kind of vulnerabilities that it exploits, it is possible not only to mitigate their attacks, but also to prevent them. Although the sharing of this type of intelligence is facilitated by several standards such as STIX (Structured Threat Information eXpression), most of the data still consists of reports written in natural language. This particular format can be highly time-consuming for Cyber Threat Intelligence analysts, which may need to read the entire report and label entities and relations in order to generate an interconnected graph from which the intel can be extracted. In this thesis, done in collaboration with Leonardo S.p.A., we provide a modular and extensible system called STIXnet for the extraction of entities and relations from natural language CTI reports. The tool is embedded in a larger platform, developed by Leonardo, called Cyber Threat Intelligence System (CTIS) and therefore inherits some of its features, such as an extensible knowledge base which also acts as a database for the entities to extract. STIXnet uses techniques from Natural Language Processing (NLP), the branch of computer science that studies the ability of a computer program to process and analyze natural language data. This field of study has been recently revolutionized by the increasing popularity of Machine Learning, which allows for more efficient algorithms and better results. After looking for known entities retrieved from the knowledge base, STIXnet analyzes the semantic structure of the sentences in order to extract new possible entities and predicts Tactics, Techniques, and Procedures (TTPs) used by the attacker. Finally, an NLP model extracts relations between these entities and converts them to be compliant with the STIX 2.1 standard, thus generating an interconnected graph which can be exported and shared. STIXnet is also able to be constantly and automatically improved with some feedback from a human analyzer, which by highlighting false positives and false negatives in the processing of the report, can trigger a fine-tuning process that will increase the tool's overall accuracy and precision. This framework can help defenders to immediately know at a glace all the gathered intelligence on a particular threat actor and thus deploy effective threat detection, perform attack simulations and strengthen their defenses, and together with the Cyber Threat Intelligence System platform organizations can be always one step ahead of the attacker and be secure against Advanced Persistent Threats (APTs).The increased frequency of cyber attacks against organizations and their potentially devastating effects has raised awareness on the severity of these threats. In order to proactively harden their defences, organizations have started to invest in Cyber Threat Intelligence (CTI), the field of Cybersecurity that deals with the collection, analysis and organization of intelligence on the attackers and their techniques. By being able to profile the activity of a particular threat actor, thus knowing the types of organizations that it targets and the kind of vulnerabilities that it exploits, it is possible not only to mitigate their attacks, but also to prevent them. Although the sharing of this type of intelligence is facilitated by several standards such as STIX (Structured Threat Information eXpression), most of the data still consists of reports written in natural language. This particular format can be highly time-consuming for Cyber Threat Intelligence analysts, which may need to read the entire report and label entities and relations in order to generate an interconnected graph from which the intel can be extracted. In this thesis, done in collaboration with Leonardo S.p.A., we provide a modular and extensible system called STIXnet for the extraction of entities and relations from natural language CTI reports. The tool is embedded in a larger platform, developed by Leonardo, called Cyber Threat Intelligence System (CTIS) and therefore inherits some of its features, such as an extensible knowledge base which also acts as a database for the entities to extract. STIXnet uses techniques from Natural Language Processing (NLP), the branch of computer science that studies the ability of a computer program to process and analyze natural language data. This field of study has been recently revolutionized by the increasing popularity of Machine Learning, which allows for more efficient algorithms and better results. After looking for known entities retrieved from the knowledge base, STIXnet analyzes the semantic structure of the sentences in order to extract new possible entities and predicts Tactics, Techniques, and Procedures (TTPs) used by the attacker. Finally, an NLP model extracts relations between these entities and converts them to be compliant with the STIX 2.1 standard, thus generating an interconnected graph which can be exported and shared. STIXnet is also able to be constantly and automatically improved with some feedback from a human analyzer, which by highlighting false positives and false negatives in the processing of the report, can trigger a fine-tuning process that will increase the tool's overall accuracy and precision. This framework can help defenders to immediately know at a glace all the gathered intelligence on a particular threat actor and thus deploy effective threat detection, perform attack simulations and strengthen their defenses, and together with the Cyber Threat Intelligence System platform organizations can be always one step ahead of the attacker and be secure against Advanced Persistent Threats (APTs)

    Named entity recognition menggunakan metode Conditional Random Fields untuk deteksi peristiwa banjir di Gerbang Kertosusila berdasarkan data twitter

    Get PDF
    Setiap musim hujan sejumlah wilayah di Indonesia dihimbau untuk waspada terjadinya banjir seperti kawasan strategis nasional Gerbang Kertosusila, Salah satu upaya yang ada adalah meletakkan sensor banjir di beberapa titik rawan banjir. Namun terkendala perangkat yang sangat terbatas untuk menangani banyaknya wilayah yang membutuhkan. Sehingga diperlukan pengembangan teknologi tentang penyebaran informasi banjir. Penyebaran informasi banjir dengan cepat didapatkan dari media sosial Twitter. Salah satu caranya memanfaatkan sumber data teks Twitter untuk model deteksi berbasis Named Entity Recognition untuk membantu mendeteksi peristiwa banjir dan lokasinya. Agar tujuan tersebut bisa tercapai, dibuatlah model Named Entity Recognition (NER) dengan metode Conditional Random Fields (CRFs). Riset ini menambahkan penanganan handle slang word pada tahap preprosessing untuk memaksimalkan performa model, Sekaligus menggunakan format BIO pada proses labelling dan POS Tagging dalam proses Extraction Feature. Hasil evaluasi dengan skenario kfold = 5, 80% data training dan 20% data testing menunjukkan model NER CRFs memiliki performa yang sangat baik dengan nilai Precision 0.981, Recall 0.926 dan f-measure 0.950. Sehingga dengan hasil ini dapat membantu masyarakat dan pemerintahaan terkait informasi distribusi banjir

    A Comparative Analysis of NLP Algorithms for Implementing AI Conversational Assistants

    Get PDF
    The rapid adoption of low-code/no-code software systems has reshaped the landscape of software development, but it also brings challenges in usability and accessibility, particularly for those unfamiliar with the specific components and templates of these platforms. This thesis targets improving the developer experience in Nokia Corporation's low-code/no-code software system for network management through the incorporation of Natural Language Interfaces (NLIs) using Natural Language Processing (NLP) algorithms. Focused on key NLP tasks like entity extraction and intent classification, we analyzed a variety of algorithms, including MaxEnt Classifier with NLTK, Spacy, Conditional Random Fields with Stanford NER for entity recognition, and SVM Classifier, Logistic Regression, Naïve Bayes, Decision Tree, Random Forest, and RASA DIET for intent classification. Each algorithm's performance was rigorously evaluated using a dataset generated from network-related utterances. The evaluation metrics included not only performance metrics but also system metrics. Our research uncovers significant trade-offs in algorithmic selection, elucidating the balance between computational cost and predictive accuracy. It reveals that while some models, like RASA DIET, excel in accuracy, they require extensive computational resources, making them less suitable for lightweight systems. In contrast, simpler models like Spacy and StanfordNER provide a balanced performance but require careful consideration for specific entity types. While the study is limited by dataset size and focuses on simpler algorithms, it offers an empirically grounded framework for practitioners and decision-makers at Nokia and similar corporations. The findings point towards future research directions, including the exploration of ensemble methods, the fine-tuning of existing models, and the real-world implementation and scalability of these algorithms in low-code/no-code platforms

    Analysis and Application of Language Models to Human-Generated Textual Content

    Get PDF
    Social networks are enormous sources of human-generated content. Users continuously create information, useful but hard to detect, extract, and categorize. Language Models (LMs) have always been among the most useful and used approaches to process textual data. Firstly designed as simple unigram models, they improved through the years until the recent release of BERT, a pre-trained Transformer-based model reaching state-of-the-art performances in many heterogeneous benchmark tasks, such as text classification and tagging. In this thesis, I apply LMs to textual content publicly shared on social media. I selected Twitter as the principal source of data for the performed experiments since its users mainly share short and noisy texts. My goal is to build models that generate meaningful representations of users encoding their syntactic and semantic features. Once appropriate embeddings are defined, I compute similarities between users to perform higher-level analyses. Tested tasks include the extraction of emerging knowledge, represented by users similar to a given set of well-known accounts, controversy detection, obtaining controversy scores for topics discussed online, community detection and characterization, clustering similar users and detecting outliers, and stance classification of users and tweets (e.g., political inclination, COVID-19 vaccines position). The obtained results suggest that publicly available data contains delicate information about users, and Language Models can now extract it, threatening users' privacy
    corecore