8,834 research outputs found
Statistical analysis of grouped text documents
L'argomento di questa tesi sono i modelli statistici per l'analisi dei dati testuali, con particolare attenzione ai contesti in cui i campioni di testo sono raggruppati.
Quando si ha a che fare con dati testuali, il primo problema è quello di elaborarli, per renderli compatibili dal punto di vista computazionale e metodologico con i metodi matematici e statistici prodotti e continuamente sviluppati dalla comunità scientifica. Per questo motivo, la tesi passa in rassegna i metodi esistenti per la rappresentazione analitica e l'elaborazione di campioni di dati testuali, compresi i "Vector Space Models", le "rappresentazioni distribuite" di parole e documenti e i "contextualized embeddings". Questa rassegna comporta la standardizzazione di una notazione che, anche all'interno dello stesso approccio di rappresentazione, appare molto eterogenea in letteratura.
Vengono poi esplorati due domini di applicazione: i social media e il turismo culturale. Per quanto riguarda il primo, viene proposto uno studio sull'autodescrizione di gruppi diversi di individui sulla piattaforma StockTwits, dove i mercati finanziari sono gli argomenti dominanti. La metodologia proposta ha integrato diversi tipi di dati, sia testuali che variabili categoriche. Questo studio ha agevolato la comprensione sul modo in cui le persone si presentano online e ha trovato stutture di comportamento ricorrenti all'interno di gruppi di utenti.
Per quanto riguarda il turismo culturale, la tesi approfondisce uno studio condotto nell'ambito del progetto "Data Science for Brescia - Arts and Cultural Places", in cui è stato addestrato un modello linguistico per classificare le recensioni online scritte in italiano in quattro aree semantiche distinte relative alle attrazioni culturali della città di Brescia. Il modello proposto permette di identificare le attrazioni nei documenti di testo, anche quando non sono esplicitamente menzionate nei metadati del documento, aprendo cosÏ la possibilità di espandere il database relativo a queste attrazioni culturali con nuove fonti, come piattaforme di social media, forum e altri spazi online.
Infine, la tesi presenta uno studio metodologico che esamina la specificitĂ di gruppo delle parole, analizzando diversi stimatori di specificitĂ di gruppo proposti in letteratura. Lo studio ha preso in considerazione documenti testuali raggruppati con variabile di "outcome" e variabile di gruppo. Il suo contributo consiste nella proposta di modellare il corpus di documenti come una distribuzione multivariata, consentendo la simulazione di corpora di documenti di testo con caratteristiche predefinite. La simulazione ha fornito preziose indicazioni sulla relazione tra gruppi di documenti e parole. Inoltre, tutti i risultati possono essere liberamente esplorati attraverso un'applicazione web, i cui componenti sono altresĂŹ descritti in questo manoscritto.
In conclusione, questa tesi è stata concepita come una raccolta di studi, ognuno dei quali suggerisce percorsi di ricerca futuri per affrontare le sfide dell'analisi dei dati testuali raggruppati.The topic of this thesis is statistical models for the analysis of textual data, emphasizing contexts in which text samples are grouped.
When dealing with text data, the first issue is to process it, making it computationally and methodologically compatible with the existing mathematical and statistical methods produced and continually developed by the scientific community. Therefore, the thesis firstly reviews existing methods for analytically representing and processing textual datasets, including Vector Space Models, distributed representations of words and documents, and contextualized embeddings. It realizes this review by standardizing a notation that, even within the same representation approach, appears highly heterogeneous in the literature.
Then, two domains of application are explored: social media and cultural tourism. About the former, a study is proposed about self-presentation among diverse groups of individuals on the StockTwits platform, where finance and stock markets are the dominant topics. The methodology proposed integrated various types of data, including textual and categorical data. This study revealed insights into how people present themselves online and found recurring patterns within groups of users.
About the latter, the thesis delves into a study conducted as part of the "Data Science for Brescia - Arts and Cultural Places" Project, where a language model was trained to classify Italian-written online reviews into four distinct semantic areas related to cultural attractions in the Italian city of Brescia. The model proposed allows for the identification of attractions in text documents, even when not explicitly mentioned in document metadata, thus opening possibilities for expanding the database related to these cultural attractions with new sources, such as social media platforms, forums, and other online spaces.
Lastly, the thesis presents a methodological study examining the group-specificity of words, analyzing various group-specificity estimators proposed in the literature. The study considered grouped text documents with both outcome and group variables. Its contribution consists of the proposal of modeling the corpus of documents as a multivariate distribution, enabling the simulation of corpora of text documents with predefined characteristics. The simulation provided valuable insights into the relationship between groups of documents and words. Furthermore, all its results can be freely explored through a web application, whose components are also described in this manuscript.
In conclusion, this thesis has been conceived as a collection of papers. It aimed to contribute to the field with both applications and methodological proposals, and each study presented here suggests paths for future research to address the challenges in the analysis of grouped textual data
UMSL Bulletin 2023-2024
The 2023-2024 Bulletin and Course Catalog for the University of Missouri St. Louis.https://irl.umsl.edu/bulletin/1088/thumbnail.jp
Are Trade Rules Undermining Taxation of the Digital Economy in Africa?
African countries are currently considering provisions in the AfCFTA and at the WTO to
liberalise digital trade. As they face mounting fiscal pressures, it is imperative that they
beware the implications of digital trade provisions for their ability to tax their digital economy.
In this paper, we develop a comprehensive framework for analysing the impact of trade rules
on tax regimes in the digital economy, with a focus on Kenya, Rwanda, and South Africa. We
explore how trade rules ostensibly shape tax policies and their implications for revenue
generation. By examining rules regulating trade in services and the imposition of customs
duties on electronic transmissions, we identify how these rules may directly impact tax
policies and limit revenue generation possibilities. Moreover, digital trade rules, such as
those related to data flows, localisation, and source code sharing, have the capacity to
produce both indirect and administrative effects on tax measures. These rules can alter tax
structures, taxation rights, data collection, and the capacity to monitor and implement tax
measures. Our findings shed light on the complex interplay between trade rules and tax
measures, highlighting potential challenges and opportunities for revenue generation from
the digital economy in African countries
Recommended from our members
Mitigating Data Scarcity for Neural Language Models
In recent years, pretrained neural language models (PNLMs) have taken the field of natural language processing by storm, achieving new benchmarks and state-of-theart performances. These models often rely heavily on annotated data, which may not always be available. Data scarcity are commonly found in specialized domains, such as medical, or in low-resource languages that are underexplored by AI research. In this dissertation, we focus on mitigating data scarcity using data augmentation and neural ensemble learning techniques for neural language models. In both research directions, we implement neural network algorithms and evaluate their impact on assisting neural language models in downstream NLP tasks. Specifically, for data augmentation, we explore two techniques: 1) creating positive training data by moving an answer span around its original context and 2) using text simplification techniques to introduce a variety of writing styles to the original training data. Our results indicate that these simple and effective solutions improve the performance of neural language models considerably in low-resource NLP domains and tasks. For neural ensemble learning, we use a multi-label neural classifier to select the best prediction outcome from a variety of individual pretrained neural language models trained for a low-resource medical text simplification task
UMSL Bulletin 2022-2023
The 2022-2023 Bulletin and Course Catalog for the University of Missouri St. Louis.https://irl.umsl.edu/bulletin/1087/thumbnail.jp
Low- and high-resource opinion summarization
Customer reviews play a vital role in the online purchasing decisions we make. The reviews
express user opinions that are useful for setting realistic expectations and uncovering important
details about products. However, some products receive hundreds or even thousands of
reviews, making them time-consuming to read. Moreover, many reviews contain uninformative
content, such as irrelevant personal experiences. Automatic summarization offers an
alternative â short text summaries capturing the essential information expressed in reviews.
Automatically produced summaries can reflect overall or particular opinions and be tailored to
user preferences. Besides being presented on major e-commerce platforms, home assistants
can also vocalize them. This approach can improve user satisfaction by assisting in making
faster and better decisions.
Modern summarization approaches are based on neural networks, often requiring thousands of
annotated samples for training. However, human-written summaries for products are expensive
to produce because annotators need to read many reviews. This has led to annotated data
scarcity where only a few datasets are available. Data scarcity is the central theme of our
works, and we propose a number of approaches to alleviate the problem. The thesis consists
of two parts where we discuss low- and high-resource data settings.
In the first part, we propose self-supervised learning methods applied to customer reviews
and few-shot methods for learning from small annotated datasets. Customer reviews without
summaries are available in large quantities, contain a breadth of in-domain specifics, and
provide a powerful training signal. We show that reviews can be used for learning summarizers
via a self-supervised objective. Further, we address two main challenges associated with
learning from small annotated datasets. First, large models rapidly overfit on small datasets
leading to poor generalization. Second, it is not possible to learn a wide range of in-domain
specifics (e.g., product aspects and usage) from a handful of gold samples. This leads to
subtle semantic mistakes in generated summaries, such as âgreat dead on arrival battery.â We
address the first challenge by explicitly modeling summary properties (e.g., content coverage
and sentiment alignment). Furthermore, we leverage small modules â adapters â that are
more robust to overfitting. As we show, despite their size, these modules can be used to
store in-domain knowledge to reduce semantic mistakes. Lastly, we propose a simple method
for learning personalized summarizers based on aspects, such as âprice,â âbattery life,â and
âresolution.â This task is harder to learn, and we present a few-shot method for training a
query-based summarizer on small annotated datasets.
In the second part, we focus on the high-resource setting and present a large dataset with
summaries collected from various online resources. The dataset has more than 33,000 humanwritten
summaries, where each is linked up to thousands of reviews. This, however, makes it
challenging to apply an âexpensiveâ deep encoder due to memory and computational costs. To
address this problem, we propose selecting small subsets of informative reviews. Only these
subsets are encoded by the deep encoder and subsequently summarized. We show that the
selector and summarizer can be trained end-to-end via amortized inference and policy gradient
methods
Ecological and confined domain ontology construction scheme using concept clustering for knowledge management
Knowledge management in a structured system is a complicated task that requires common, standardized methods that are acceptable to all actors in a system. Ontology, in this regard, is a primary element and plays a central role in knowledge management, interoperability between various departments, and better decision making. The ontology construction for structured systems comprises logical and structural complications. Researchers have already proposed a variety of domain ontology construction schemes. However, these schemes do not involve some important phases of ontology construction that make ontologies more collaborative. Furthermore, these schemes do not provide details of the activities and methods involved in the construction of an ontology, which may cause difficulty in implementing the ontology. The major objectives of this research were to provide a comparison between some existing ontology construction schemes and to propose an enhanced ecological and confined domain ontology construction (EC-DOC) scheme for structured knowledge management. The proposed scheme introduces five important phases to construct an ontology, with a major focus on the conceptualizing and clustering of domain concepts. In the conceptualization phase, a glossary of domain-related concepts and their properties is maintained, and a Fuzzy C-Mean soft clustering mechanism is used to form the clusters of these concepts. In addition, the localization of concepts is instantly performed after the conceptualization phase, and a translation file of localized concepts is created. The EC-DOC scheme can provide accurate concepts regarding the terms for a specific domain, and these concepts can be made available in a preferred local language
Food, Freedom, Fairness, and the Family Farm
The concept of the âfamily farmâ holds powerful sway within the American narrative, embodying both nostalgia for an imagined past and anxiety for a future perceived to be under threat. Since the founding of the United States, this cultural ideal has been invoked in support of a rosy vision of agrarian democracy while obscuring the ways in which the U.S. Department of Agricultureâs codified definition of âfamily farmâ has unfairly aggregated advantages for the benefit of a particular kind of family (nuclear) and farmer (white, male, straight). At the same time, consumers are misled by an under-interrogated conflation of family farming with âgoodâ farming practices. There exists a pervasive fear among Americans that the family farm is at risk of disappearing, and that something must be done to save it. This Essay analyzes the history of family farms in the United States and contends that reclaiming, not rescuing, is what needs to be done. As an alternative to preserving an institution whose benefits have always been constrained by gender, race, and wealth, we propose instead re-orienting efforts toward three concepts rooted in the family farm ideal but which we believe to possess greater transformative potential: fairnessâthe distribution of benefits along the agrifood chain to ensure adequate compensation and access; self-determinationâthe ability for communities to make their own decisions within the food system; and âgoodâ farmingâthe specific practices that could lead to a more just, humane, and sustainable food system
Relphormer: Relational Graph Transformer for Knowledge Graph Representations
Transformers have achieved remarkable performance in widespread fields,
including natural language processing, computer vision and graph mining.
However, vanilla Transformer architectures have not yielded promising
improvements in the Knowledge Graph (KG) representations, where the
translational distance paradigm dominates this area. Note that vanilla
Transformer architectures struggle to capture the intrinsically heterogeneous
structural and semantic information of knowledge graphs. To this end, we
propose a new variant of Transformer for knowledge graph representations dubbed
Relphormer. Specifically, we introduce Triple2Seq which can dynamically sample
contextualized sub-graph sequences as the input to alleviate the heterogeneity
issue. We propose a novel structure-enhanced self-attention mechanism to encode
the relational information and keep the semantic information within entities
and relations. Moreover, we utilize masked knowledge modeling for general
knowledge graph representation learning, which can be applied to various
KG-based tasks including knowledge graph completion, question answering, and
recommendation. Experimental results on six datasets show that Relphormer can
obtain better performance compared with baselines. Code is available in
https://github.com/zjunlp/Relphormer.Comment: Work in progres
- âŚ