1,525 research outputs found

    Statistical analysis of grouped text documents

    Get PDF
    L'argomento di questa tesi sono i modelli statistici per l'analisi dei dati testuali, con particolare attenzione ai contesti in cui i campioni di testo sono raggruppati. Quando si ha a che fare con dati testuali, il primo problema è quello di elaborarli, per renderli compatibili dal punto di vista computazionale e metodologico con i metodi matematici e statistici prodotti e continuamente sviluppati dalla comunità scientifica. Per questo motivo, la tesi passa in rassegna i metodi esistenti per la rappresentazione analitica e l'elaborazione di campioni di dati testuali, compresi i "Vector Space Models", le "rappresentazioni distribuite" di parole e documenti e i "contextualized embeddings". Questa rassegna comporta la standardizzazione di una notazione che, anche all'interno dello stesso approccio di rappresentazione, appare molto eterogenea in letteratura. Vengono poi esplorati due domini di applicazione: i social media e il turismo culturale. Per quanto riguarda il primo, viene proposto uno studio sull'autodescrizione di gruppi diversi di individui sulla piattaforma StockTwits, dove i mercati finanziari sono gli argomenti dominanti. La metodologia proposta ha integrato diversi tipi di dati, sia testuali che variabili categoriche. Questo studio ha agevolato la comprensione sul modo in cui le persone si presentano online e ha trovato stutture di comportamento ricorrenti all'interno di gruppi di utenti. Per quanto riguarda il turismo culturale, la tesi approfondisce uno studio condotto nell'ambito del progetto "Data Science for Brescia - Arts and Cultural Places", in cui è stato addestrato un modello linguistico per classificare le recensioni online scritte in italiano in quattro aree semantiche distinte relative alle attrazioni culturali della città di Brescia. Il modello proposto permette di identificare le attrazioni nei documenti di testo, anche quando non sono esplicitamente menzionate nei metadati del documento, aprendo così la possibilità di espandere il database relativo a queste attrazioni culturali con nuove fonti, come piattaforme di social media, forum e altri spazi online. Infine, la tesi presenta uno studio metodologico che esamina la specificità di gruppo delle parole, analizzando diversi stimatori di specificità di gruppo proposti in letteratura. Lo studio ha preso in considerazione documenti testuali raggruppati con variabile di "outcome" e variabile di gruppo. Il suo contributo consiste nella proposta di modellare il corpus di documenti come una distribuzione multivariata, consentendo la simulazione di corpora di documenti di testo con caratteristiche predefinite. La simulazione ha fornito preziose indicazioni sulla relazione tra gruppi di documenti e parole. Inoltre, tutti i risultati possono essere liberamente esplorati attraverso un'applicazione web, i cui componenti sono altresì descritti in questo manoscritto. In conclusione, questa tesi è stata concepita come una raccolta di studi, ognuno dei quali suggerisce percorsi di ricerca futuri per affrontare le sfide dell'analisi dei dati testuali raggruppati.The topic of this thesis is statistical models for the analysis of textual data, emphasizing contexts in which text samples are grouped. When dealing with text data, the first issue is to process it, making it computationally and methodologically compatible with the existing mathematical and statistical methods produced and continually developed by the scientific community. Therefore, the thesis firstly reviews existing methods for analytically representing and processing textual datasets, including Vector Space Models, distributed representations of words and documents, and contextualized embeddings. It realizes this review by standardizing a notation that, even within the same representation approach, appears highly heterogeneous in the literature. Then, two domains of application are explored: social media and cultural tourism. About the former, a study is proposed about self-presentation among diverse groups of individuals on the StockTwits platform, where finance and stock markets are the dominant topics. The methodology proposed integrated various types of data, including textual and categorical data. This study revealed insights into how people present themselves online and found recurring patterns within groups of users. About the latter, the thesis delves into a study conducted as part of the "Data Science for Brescia - Arts and Cultural Places" Project, where a language model was trained to classify Italian-written online reviews into four distinct semantic areas related to cultural attractions in the Italian city of Brescia. The model proposed allows for the identification of attractions in text documents, even when not explicitly mentioned in document metadata, thus opening possibilities for expanding the database related to these cultural attractions with new sources, such as social media platforms, forums, and other online spaces. Lastly, the thesis presents a methodological study examining the group-specificity of words, analyzing various group-specificity estimators proposed in the literature. The study considered grouped text documents with both outcome and group variables. Its contribution consists of the proposal of modeling the corpus of documents as a multivariate distribution, enabling the simulation of corpora of text documents with predefined characteristics. The simulation provided valuable insights into the relationship between groups of documents and words. Furthermore, all its results can be freely explored through a web application, whose components are also described in this manuscript. In conclusion, this thesis has been conceived as a collection of papers. It aimed to contribute to the field with both applications and methodological proposals, and each study presented here suggests paths for future research to address the challenges in the analysis of grouped textual data

    Redefining Disproportionate Arrest Rates: An Exploratory Quasi-Experiment that Reassesses the Role of Skin Tone

    Get PDF
    The New York Times reported that Black Lives Matter was the third most-read subject of 2020. These articles brought to the forefront the question of disparity in arrest rates for darker-skinned people. Questioning arrest disparity is understandable because virtually everything known about disproportionate arrest rates has been a guess, and virtually all prior research on disproportionate arrest rates is questionable because of improper benchmarking (the denominator effect). Current research has highlighted the need to switch from demographic data to skin tone data and start over on disproportionate arrest rate research; therefore, this study explored the relationship between skin tone and disproportionate arrest rates. This study also sought to determine which of the three theories surrounding disproportionate arrests is most predictive of disproportionate rates. The current theories are that disproportionate arrests increase as skin tone gets darker (stereotype threat theory), disproportionate rates are different for Black and Brown people (self-categorization theory), or disproportionate rates apply equally across all darker skin colors (social dominance theory). This study used a quantitative exploratory quasi-experimental design using linear spline regression to analyze arrest rates in Alachua County, Florida, before and after the county’s mandate to reduce arrests as much as possible during the COVID-19 pandemic to protect the prison population. The study was exploratory as no previous study has used skin tone analysis to examine arrest disparity. The findings of this study redefines the understanding of the existence and nature of disparities in arrest rates and offer a solid foundation for additional studies about the relationship between disproportionate arrest rates and skin color

    LIPIcs, Volume 251, ITCS 2023, Complete Volume

    Get PDF
    LIPIcs, Volume 251, ITCS 2023, Complete Volum

    2023-2024 Catalog

    Get PDF
    The 2023-2024 Governors State University Undergraduate and Graduate Catalog is a comprehensive listing of current information regarding:Degree RequirementsCourse OfferingsUndergraduate and Graduate Rules and Regulation

    Out-of-Distribution Generalization of Deep Learning to Illuminate Dark Protein Functional Space

    Full text link
    Dark protein illumination is a fundamental challenge in drug discovery where majority human proteins are understudied, i.e. with only known protein sequence but no known small molecule binder. It\u27s a major road block to enable drug discovery paradigm shift from single-targeted which looks to identify a single target and design drug to regulate the single target to multi-targeted in a Systems Pharmacology perspective. Diseases such as Alzheimer\u27s and Opioid-Use-Disorder plaguing millions of patients call for effective multi-targeted approach involving dark proteins. Using limited protein data to predict dark protein property requires deep learning systems with OOD generalization capacity. Out-of-Distribution (OOD) generalization is a problem hindering the application and adoption of deep learning in real world problems. Classic deep learning setting in contrast is assuming training and testing data are independent identically distributed (iid). A well trained model under iid setting with reported 98% accuracy could deteriorate to worse than random guess in deployment to OOD data significantly different from training data. Numerous techniques in the research field emerged but are only addressing some specific OOD scenario instead of a general one. Dark protein illumination has unique complexity comparing to common deep learning tasks. There are three OOD axes, protein-OOD, compound-OOD, interaction-OOD. Previous research have only focused on compound-OOD, where new compound design algorithms are developed but still for 500 common proteins, instead of whole human genome 20,000 proteins, and only for single-targeted paradigm instead of multi-targeted. Focusing on an instrumental problem in drug discovery, dark protein function illumination problem is introduced from the OOD perspective. A series of dark protein OOD algorithms are developed to predict dark protein ligand interaction where multiple instrumental deep learning techniques are adapted to the biology context. By proposing the dark protein illumination problem, highlighting the neglected axes, demonstrating possibilities, numerous diseases now embrace new hopes

    Toward Efficient and Robust Computer Vision for Large-Scale Edge Applications

    Get PDF
    The past decade has been witnessing remarkable advancements in computer vision and deep learning algorithms, ushering in a transformative wave of large-scale edge applications across various industries. These image processing methods, however, still encounter numerous challenges when it comes to meeting real-world demands, especially in terms of accuracy and latency at scale. Indeed, striking a balance among efficiency, robustness, and scalability remains a common obstacle. This dissertation investigates these issues in the context of different computer vision tasks, including image classification, semantic segmentation, depth estimation, and object detection. We introduce novel solutions, focusing on utilizing adjustable neural networks, joint multi-task architecture search, and generalized supervision interpolation. The first obstacle revolves around the ability to trade off between speed and accuracy in convolutional neural networks (CNNs) during inference on resource-constrained platforms. Despite their progress, CNNs are typically monolithic at runtime, which can present practical difficulties since computational budgets may vary over time. To address this, we introduce Any-Width Network, an adjustable-width CNN architecture that utilizes a novel Triangular Convolution module to enable fine-grained control over speed and accuracy during inference. The second challenge focuses on the computationally demanding nature of dense prediction tasks such as semantic segmentation and depth estimation. This issue becomes especially problematic for edge platforms with limited resources. To tackle this, we propose a novel and scalable framework named EDNAS. EDNAS leverages the synergistic relationship between Multi-Task Learning and hardware-aware Neural Architecture Search to significantly enhance on-device speed and accuracy of dense predictions. Finally, to improve the robustness of object detection, we introduce a novel data mixing augmentation. While mixing techniques such as Mixup have proven successful in image classification, their application to object detection is non-trivial due to spatial misalignment, foreground/background distinction, and instance multiplicity. To address these issues, we propose a generalized data mixing principle, Supervision Interpolation, and its simple yet effective implementation, LossMix. By addressing these challenges, this dissertation aims to facilitate better efficiency, accuracy, and scalability of computer vision and deep learning algorithms and contribute to the advancement of large-scale edge applications across different domains.Doctor of Philosoph

    Computational Approaches to Drug Profiling and Drug-Protein Interactions

    Get PDF
    Despite substantial increases in R&D spending within the pharmaceutical industry, denovo drug design has become a time-consuming endeavour. High attrition rates led to a long period of stagnation in drug approvals. Due to the extreme costs associated with introducing a drug to the market, locating and understanding the reasons for clinical failure is key to future productivity. As part of this PhD, three main contributions were made in this respect. First, the web platform, LigNFam enables users to interactively explore similarity relationships between ‘drug like’ molecules and the proteins they bind. Secondly, two deep-learning-based binding site comparison tools were developed, competing with the state-of-the-art over benchmark datasets. The models have the ability to predict offtarget interactions and potential candidates for target-based drug repurposing. Finally, the open-source ScaffoldGraph software was presented for the analysis of hierarchical scaffold relationships and has already been used in multiple projects, including integration into a virtual screening pipeline to increase the tractability of ultra-large screening experiments. Together, and with existing tools, the contributions made will aid in the understanding of drug-protein relationships, particularly in the fields of off-target prediction and drug repurposing, helping to design better drugs faster

    A Pairwise Dataset for GUI Conversion and Retrieval between Android Phones and Tablets

    Full text link
    With the popularity of smartphones and tablets, users have become accustomed to using different devices for different tasks, such as using their phones to play games and tablets to watch movies. To conquer the market, one app is often available on both smartphones and tablets. However, although one app has similar graphic user interfaces (GUIs) and functionalities on phone and tablet, current app developers typically start from scratch when developing a tablet-compatible version of their app, which drives up development costs and wastes existing design resources. Researchers are attempting to employ deep learning in automated GUIs development to enhance developers' productivity. Deep learning models rely heavily on high-quality datasets. There are currently several publicly accessible GUI page datasets for phones, but none for pairwise GUIs between phones and tablets. This poses a significant barrier to the employment of deep learning in automated GUI development. In this paper, we collect and make public the Papt dataset, which is a pairwise dataset for GUI conversion and retrieval between Android phones and tablets. The dataset contains 10,035 phone-tablet GUI page pairs from 5,593 phone-tablet app pairs. We illustrate the approaches of collecting pairwise data and statistical analysis of this dataset. We also illustrate the advantages of our dataset compared to other current datasets. Through preliminary experiments on this dataset, we analyse the present challenges of utilising deep learning in automated GUI development and find that our dataset can assist the application of some deep learning models to tasks involving automatic GUI development.Comment: 10 pages, 9 figure
    • …
    corecore