4,082 research outputs found

    Text classification supervised algorithms with term frequency inverse document frequency and global vectors for word representation: a comparative study

    Get PDF
    Over the course of the previous two decades, there has been a rise in the quantity of text documents stored digitally. The ability to organize and categorize those documents in an automated mechanism, is known as text categorization which is used to classify them into a set of predefined categories so they may be preserved and sorted more efficiently. Identifying appropriate structures, architectures, and methods for text classification presents a challenge for researchers. This is due to the significant impact this concept has on content management, contextual search, opinion mining, product review analysis, spam filtering, and text sentiment mining. This study analyzes the generic categorization strategy and examines supervised machine learning approaches and their ability to comprehend complex models and nonlinear data interactions. Among these methods are k-nearest neighbors (KNN), support vector machine (SVM), and ensemble learning algorithms employing various evaluation techniques. Thereafter, an evaluation is conducted on the constraints of every technique and how they can be applied to real-life situations

    Analysis of Abstractive and Extractive Summarization Methods

    Get PDF
    This paper explains the existing approaches employed for (automatic) text summarization. The summarizing method is part of the natural language processing (NLP) field and is applied to the source document to produce a compact version that preserves its aggregate meaning and key concepts. On a broader scale, approaches for text-based summarization are categorized into two groups: abstractive and extractive. In abstractive summarization, the main contents of the input text are paraphrased, possibly using vocabulary that is not present in the source document, while in extractive summarization, the output summary is a subset of the input text and is generated by using the sentence ranking technique. In this paper, the main ideas behind the existing methods used for abstractive and extractive summarization are discussed broadly. A comparative study of these methods is also highlighted

    Statistical analysis of grouped text documents

    Get PDF
    L'argomento di questa tesi sono i modelli statistici per l'analisi dei dati testuali, con particolare attenzione ai contesti in cui i campioni di testo sono raggruppati. Quando si ha a che fare con dati testuali, il primo problema è quello di elaborarli, per renderli compatibili dal punto di vista computazionale e metodologico con i metodi matematici e statistici prodotti e continuamente sviluppati dalla comunità scientifica. Per questo motivo, la tesi passa in rassegna i metodi esistenti per la rappresentazione analitica e l'elaborazione di campioni di dati testuali, compresi i "Vector Space Models", le "rappresentazioni distribuite" di parole e documenti e i "contextualized embeddings". Questa rassegna comporta la standardizzazione di una notazione che, anche all'interno dello stesso approccio di rappresentazione, appare molto eterogenea in letteratura. Vengono poi esplorati due domini di applicazione: i social media e il turismo culturale. Per quanto riguarda il primo, viene proposto uno studio sull'autodescrizione di gruppi diversi di individui sulla piattaforma StockTwits, dove i mercati finanziari sono gli argomenti dominanti. La metodologia proposta ha integrato diversi tipi di dati, sia testuali che variabili categoriche. Questo studio ha agevolato la comprensione sul modo in cui le persone si presentano online e ha trovato stutture di comportamento ricorrenti all'interno di gruppi di utenti. Per quanto riguarda il turismo culturale, la tesi approfondisce uno studio condotto nell'ambito del progetto "Data Science for Brescia - Arts and Cultural Places", in cui è stato addestrato un modello linguistico per classificare le recensioni online scritte in italiano in quattro aree semantiche distinte relative alle attrazioni culturali della città di Brescia. Il modello proposto permette di identificare le attrazioni nei documenti di testo, anche quando non sono esplicitamente menzionate nei metadati del documento, aprendo così la possibilità di espandere il database relativo a queste attrazioni culturali con nuove fonti, come piattaforme di social media, forum e altri spazi online. Infine, la tesi presenta uno studio metodologico che esamina la specificità di gruppo delle parole, analizzando diversi stimatori di specificità di gruppo proposti in letteratura. Lo studio ha preso in considerazione documenti testuali raggruppati con variabile di "outcome" e variabile di gruppo. Il suo contributo consiste nella proposta di modellare il corpus di documenti come una distribuzione multivariata, consentendo la simulazione di corpora di documenti di testo con caratteristiche predefinite. La simulazione ha fornito preziose indicazioni sulla relazione tra gruppi di documenti e parole. Inoltre, tutti i risultati possono essere liberamente esplorati attraverso un'applicazione web, i cui componenti sono altresì descritti in questo manoscritto. In conclusione, questa tesi è stata concepita come una raccolta di studi, ognuno dei quali suggerisce percorsi di ricerca futuri per affrontare le sfide dell'analisi dei dati testuali raggruppati.The topic of this thesis is statistical models for the analysis of textual data, emphasizing contexts in which text samples are grouped. When dealing with text data, the first issue is to process it, making it computationally and methodologically compatible with the existing mathematical and statistical methods produced and continually developed by the scientific community. Therefore, the thesis firstly reviews existing methods for analytically representing and processing textual datasets, including Vector Space Models, distributed representations of words and documents, and contextualized embeddings. It realizes this review by standardizing a notation that, even within the same representation approach, appears highly heterogeneous in the literature. Then, two domains of application are explored: social media and cultural tourism. About the former, a study is proposed about self-presentation among diverse groups of individuals on the StockTwits platform, where finance and stock markets are the dominant topics. The methodology proposed integrated various types of data, including textual and categorical data. This study revealed insights into how people present themselves online and found recurring patterns within groups of users. About the latter, the thesis delves into a study conducted as part of the "Data Science for Brescia - Arts and Cultural Places" Project, where a language model was trained to classify Italian-written online reviews into four distinct semantic areas related to cultural attractions in the Italian city of Brescia. The model proposed allows for the identification of attractions in text documents, even when not explicitly mentioned in document metadata, thus opening possibilities for expanding the database related to these cultural attractions with new sources, such as social media platforms, forums, and other online spaces. Lastly, the thesis presents a methodological study examining the group-specificity of words, analyzing various group-specificity estimators proposed in the literature. The study considered grouped text documents with both outcome and group variables. Its contribution consists of the proposal of modeling the corpus of documents as a multivariate distribution, enabling the simulation of corpora of text documents with predefined characteristics. The simulation provided valuable insights into the relationship between groups of documents and words. Furthermore, all its results can be freely explored through a web application, whose components are also described in this manuscript. In conclusion, this thesis has been conceived as a collection of papers. It aimed to contribute to the field with both applications and methodological proposals, and each study presented here suggests paths for future research to address the challenges in the analysis of grouped textual data

    Multidisciplinary perspectives on Artificial Intelligence and the law

    Get PDF
    This open access book presents an interdisciplinary, multi-authored, edited collection of chapters on Artificial Intelligence (‘AI’) and the Law. AI technology has come to play a central role in the modern data economy. Through a combination of increased computing power, the growing availability of data and the advancement of algorithms, AI has now become an umbrella term for some of the most transformational technological breakthroughs of this age. The importance of AI stems from both the opportunities that it offers and the challenges that it entails. While AI applications hold the promise of economic growth and efficiency gains, they also create significant risks and uncertainty. The potential and perils of AI have thus come to dominate modern discussions of technology and ethics – and although AI was initially allowed to largely develop without guidelines or rules, few would deny that the law is set to play a fundamental role in shaping the future of AI. As the debate over AI is far from over, the need for rigorous analysis has never been greater. This book thus brings together contributors from different fields and backgrounds to explore how the law might provide answers to some of the most pressing questions raised by AI. An outcome of the Católica Research Centre for the Future of Law and its interdisciplinary working group on Law and Artificial Intelligence, it includes contributions by leading scholars in the fields of technology, ethics and the law.info:eu-repo/semantics/publishedVersio

    Unifying context with labeled property graph: A pipeline-based system for comprehensive text representation in NLP

    Get PDF
    Extracting valuable insights from vast amounts of unstructured digital text presents significant challenges across diverse domains. This research addresses this challenge by proposing a novel pipeline-based system that generates domain-agnostic and task-agnostic text representations. The proposed approach leverages labeled property graphs (LPG) to encode contextual information, facilitating the integration of diverse linguistic elements into a unified representation. The proposed system enables efficient graph-based querying and manipulation by addressing the crucial aspect of comprehensive context modeling and fine-grained semantics. The effectiveness of the proposed system is demonstrated through the implementation of NLP components that operate on LPG-based representations. Additionally, the proposed approach introduces specialized patterns and algorithms to enhance specific NLP tasks, including nominal mention detection, named entity disambiguation, event enrichments, event participant detection, and temporal link detection. The evaluation of the proposed approach, using the MEANTIME corpus comprising manually annotated documents, provides encouraging results and valuable insights into the system\u27s strengths. The proposed pipeline-based framework serves as a solid foundation for future research, aiming to refine and optimize LPG-based graph structures to generate comprehensive and semantically rich text representations, addressing the challenges associated with efficient information extraction and analysis in NLP

    GPT models in construction industry: Opportunities, limitations, and a use case validation

    Get PDF
    Large Language Models (LLMs) trained on large data sets came into prominence in 2018 after Google introduced BERT. Subsequently, different LLMs such as GPT models from OpenAI have been released. These models perform well on diverse tasks and have been gaining widespread applications in fields such as business and education. However, little is known about the opportunities and challenges of using LLMs in the construction industry. Thus, this study aims to assess GPT models in the construction industry. A critical review, expert discussion and case study validation are employed to achieve the study's objectives. The findings revealed opportunities for GPT models throughout the project lifecycle. The challenges of leveraging GPT models are highlighted and a use case prototype is developed for materials selection and optimization. The findings of the study would be of benefit to researchers, practitioners and stakeholders, as it presents research vistas for LLMs in the construction industry

    Sustainable digital marketing under big data: an AI random forest model approach

    Get PDF
    Digital marketing refers to the process of promoting, selling, and delivering products or services through online platforms and channels using the internet and electronic devices in a digital environment. Its aim is to attract and engage target audiences through various strategies and methods, driving brand promotion and sales growth. The primary objective of this scholarly study is to seamlessly integrate advanced big data analytics and artificial intelligence (AI) technology into the realm of digital marketing, thereby fostering the progression and optimization of sustainable digital marketing practices. First, the characteristics and applications of big data involving vast, diverse, and complex datasets are analyzed. Understanding their attributes and scope of application is essential. Subsequently, a comprehensive investigation into AI-driven learning mechanisms is conducted, culminating in the development of an AI random forest model (RFM) tailored for sustainable digital marketing. Subsequent to this, leveraging a real-world case study involving enterprise X, fundamental customer data is collected and subjected to meticulous analysis. The RFM model, ingeniously crafted in this study, is then deployed to prognosticate the anticipated count of prospective customers for said enterprise. The empirical findings spotlight a pronounced prevalence of university-affiliated individuals across diverse age cohorts. In terms of occupational distribution within the customer base, the categories of workers and educators emerge as dominant, constituting 41% and 31% of the demographic, respectively. Furthermore, the price distribution of patrons exhibits a skewed pattern, whereby the price bracket of 0–150 encompasses 17% of the population, whereas the range of 150–300 captures a notable 52%. These delineated price bands collectively constitute a substantial proportion, whereas the range exceeding 450 embodies a minority, accounting for less than 20%. Notably, the RFM model devised in this scholarly endeavor demonstrates a remarkable proficiency in accurately projecting forthcoming passenger volumes over a seven-day horizon, significantly surpassing the predictive capability of logistic regression. Evidently, the AI-driven RFM model proffered herein excels in the precise anticipation of target customer counts, thereby furnishing a pragmatic foundation for the intelligent evolution of sustainable digital marketing strategies

    Using a Novel Hybrid Krill Herd and Bat based Recurrent Replica to Estimate the Sentiment Values of Twitter based Political Data

    Get PDF
    Big data is an essential part of the world since it is directly applicable to many functions. Twitter is an essential social network or big data replicating political information. However, big data sentiment analysis in opinion mining is challenging for complex information. In this approach, the Twitter-based political datasets are taken as input. Furthermore, the sentiment analysis of twitter-based political multilingual datasets like Hindi and English is not easy because of the complicated data. Therefore, this paper introduces a novel Hybrid Krill Herd and Bat-based Recurrent Replica (HKHBRR) to evaluate the sentiment values of twitter-based political data. Here, the fitness functions of the krill herd and bat optimization model are initialized in the dense layer to enhance the accuracy, precision, etc., and also reduce the error rate. Initially, Twitter-based political datasets are taken as input, and these collected datasets are also trained to this proposed approach. Moreover, the proposed deep learning technique is implemented in the Python framework. Thus, the outcomes of the developed model are compared with existing techniques and have attained the finest results of 98.68% accuracy and 0.5% error

    SeeChart: Enabling Accessible Visualizations Through Interactive Natural Language Interface For People with Visual Impairments

    Full text link
    Web-based data visualizations have become very popular for exploring data and communicating insights. Newspapers, journals, and reports regularly publish visualizations to tell compelling stories with data. Unfortunately, most visualizations are inaccessible to readers with visual impairments. For many charts on the web, there are no accompanying alternative (alt) texts, and even if such texts exist they do not adequately describe important insights from charts. To address the problem, we first interviewed 15 blind users to understand their challenges and requirements for reading data visualizations. Based on the insights from these interviews, we developed SeeChart, an interactive tool that automatically deconstructs charts from web pages and then converts them to accessible visualizations for blind people by enabling them to hear the chart summary as well as to interact through data points using the keyboard. Our evaluation with 14 blind participants suggests the efficacy of SeeChart in understanding key insights from charts and fulfilling their information needs while reducing their required time and cognitive burden.Comment: 28 pages, 13 figure
    • …
    corecore