255 research outputs found

    Decoding the Real World: Tackling Virtual Ethnographic Challenges through Data-Driven Methods

    Get PDF

    Text Mining Oral Histories in Historical Archaeology

    Get PDF
    Advances in text mining and natural language processing methodologies have the potential to productively inform historical archaeology and oral history research. However, text mining methods are largely developed in the context of contemporary big data and publicly available texts, limiting the applicability of these tools in the context of historical and archaeological interpretation. Given the ability of text analysis to efficiently process and analyze large volumes of data, the potential for such tools to meaningfully inform historical archaeological research is significant, particularly for working with digitized data repositories or lengthy texts. Using oral histories recorded about a half-century ago from the anthracite coal mining region of Pennsylvania, USA, we discuss recent methodological developments in text analysis methodologies. We suggest future pathways to bridge the gap between generalized text mining methods and the particular needs of working with historical and place-based texts

    Towards data-based search engines for RDF graphs: a reproducibility study

    Get PDF
    openIl framework RDF, grazie alla sua flessibilità e versatilità, è uno dei formati più utilizzati per la condivisione di dati e informazioni sul Web. Al giorno d'oggi sono infatti disponibili molti datasets e knowledge repositories in formato RDF in ambito scientifico e politico, facilmente consultabili e scaricabili da numerosi open data portals. Tuttavia, questi datasets RDF non possono essere sfruttati e consultati appieno, a causa dell'assenza di motori di ricerca avanzati che permettano agli utenti di ottenere i datasets più adatti alle loro esigenze. Questi sistemi rispondono alle esigenze del Ad-Hoc RDF Datasets Retrieval task: lo scopo di questo task è rispondere ad una keyword query dell'utente con un rank di 10 datasets in ordine di rilevanza. I sistemi attuali non sono così avanzati e si basano principalmente sui metadati dei dataset, che potrebbero essere incompleti o non sempre disponibili, invece di basarsi sul loro contenuto. ACORDAR è la prima open test collection creata per testare i sistemi sviluppati per l'Ad-Hoc RDF Datasets Retrieval task. Questa test collection può garantire un impulso nello sviluppo di questi sistemi e un possibile passaggio da sistemi di ricerca basati sui metadati a sistemi basati sul contenuto dei dataset. L'obiettivo principale di questa tesi è uno studio sulla riproducibilità su ACORDAR. Verrà testata la qualità, l'utilità e l'adeguatezza di questa collection per il task, riproducendo i sistemi di base sviluppati dai creatori di ACORDAR e discutendo tutti i problemi di riproducibilità incontrati durante lo sviluppo dei sistemi riprodotti.The RDF framework, thanks to its flexibility and versatility, is one of the most used formats for sharing data and knowledge on the Web. Nowadays a lot of RDF datasets and RDF knowledge repositories are available in the scientific and political fields and can be easily consulted from a lot of open data portals. However, these RDF datasets cannot be fully exploited and accessed due to the absence of advanced search engines that allow users to retrieve the best datasets that suit their needs. These systems solve the Ad-Hoc RDF Datasets retrieval task: answer to a user keyword query with a rank of 10 datasets ordered by relevance. The current systems are not so advanced and are principally based on the datasets metadata, which could be incomplete or not always available, instead of being based on their content. ACORDAR is the first open test collection created to evaluate the systems developed for the Ad-Hoc RDF Datasets retrieval task. This test collection can ensure a boost in the development and improvement of these systems and a possible switch from metadata-based to content-based search systems. The main focus of this thesis is a reproducibility study on the ACORDAR collection. We are going to actually test how this collection is good, useful and suited for the Ad Hoc RDF datasets retrieval task by reproducing the baseline systems developed by the ACORDAR creators and by discussing all the reproducibility problems encountered during the development of the reproduced systems

    Endogenous measures for contextualising large-scale social phenomena: a corpus-based method for mediated public discourse

    Get PDF
    This work presents an interdisciplinary methodology for developing endogenous measures of group membership through analysis of pervasive linguistic patterns in public discourse. Focusing on political discourse, this work critiques the conventional approach to the study of political participation, which is premised on decontextualised, exogenous measures to characterise groups. Considering the theoretical and empirical weaknesses of decontextualised approaches to large-scale social phenomena, this work suggests that contextualisation using endogenous measures might provide a complementary perspective to mitigate such weaknesses. This work develops a sociomaterial perspective on political participation in mediated discourse as affiliatory action performed through language. While the affiliatory function of language is often performed consciously (such as statements of identity), this work is concerned with unconscious features (such as patterns in lexis and grammar). This work argues that pervasive patterns in such features that emerge through socialisation are resistant to change and manipulation, and thus might serve as endogenous measures of sociopolitical contexts, and thus of groups. In terms of method, the work takes a corpus-based approach to the analysis of data from the Twitter messaging service whereby patterns in users’ speech are examined statistically in order to trace potential community membership. The method is applied in the US state of Michigan during the second half of 2018—6 November having been the date of midterm (i.e. non-Presidential) elections in the United States. The corpus is assembled from the original posts of 5,889 users, who are nominally geolocalised to 417 municipalities. These users are clustered according to pervasive language features. Comparing the linguistic clusters according to the municipalities they represent finds that there are regular sociodemographic differentials across clusters. This is understood as an indication of social structure, suggesting that endogenous measures derived from pervasive patterns in language may indeed offer a complementary, contextualised perspective on large-scale social phenomena

    Analyzing the Impact of RDF Graph Structure on Dataset Search: A Case Study with ACORDAR

    Get PDF
    openNel mondo del Semantic Web, RDF si pone come elemento cardine per la modellazione precisa dei dati e dei loro legami. L'obiettivo centrale di questo lavoro è esplorare le dinamiche dei grafi RDF, mettendo in luce le principali problematiche e potenzialità nell'ambito della ricerca di dataset. Il caso studio di ACORDAR viene esaminato per illustrare l'effetto delle strutture a grafo sull'organizzazione dei dati. Vengono analizzate le tecniche di serializzazione in RDF, sottolineando la centralità di elementi quali gli URI e le capacità avanzate offerte da SPARQL. Si affronta il tema della riproducibilità di ACORDAR, mettendo in risalto l'importanza dei metadati nella fase di ricerca dei dataset. In conclusione, si delineano prospettive future per ottimizzare la ricerca di dataset, arricchendo l'analisi con informazioni tratte dalle strutture a grafo e avvalendosi delle tecnologie emergenti.RDF plays a central role in the era of the Semantic Web, enabling a structured representation of datasets and their relationships. The complex nature of RDF graph structures significantly influences the retrieval of datasets, offering a blend of both challenges and possibilities. Delving deeply into the ACORDAR case study, the work unveils how graph structures influence dataset retrieval and the organization of data. Furthermore, it introduces serialization methods within RDF, emphasizing the importance of URI and the capabilities of the SPARQL. Presenting the ACORDAR reproducibility, the research underscores the significance of metadata in dataset search. Exploring potential avenues for future research in dataset search, the investigation integrates graph structures and harnesses emerging technologies from the Semantic Web era

    Comparison of Support Vector Machine (SVM) and Random Forest Algorithm for Detection of Negative Content on Websites

    Get PDF
    The amount of negative content circulating on the internet can damage people's morale so that social conflicts arise in society that threaten national sovereignty. Detecting negative content can help identify and prevent harmful events before they occur. This can lead to a safer and more positive online environment. Comparison of Support Vector Machine (SVM) and Random Forest (RF) Algorithm for Detection of Negative Content on Websites. The research contributions are 1) detect negative content on the internet with random forest and SVM, 2) comparing SVM and RF algorithms for detecting negative content on websites, 3) detection of negative content based on text focusing on the categories of fraud, gambling, pornography and Whitelist. The stages of this research are preparing a text content dataset on a website that has been labeled, preprocessing (duplicated data, text cleansing, case folding, stopward, tokenize, label encoding, data splitting, and determine the TF-IDF), finally performing the classification process with SVM and Random Forest. The dataset used in this study is a structured dataset in the form of text obtained from emails that have been registered on the TrustPositive website as negative content. Negative content includes fraud, pornography and gambling. The results show the accuracy of the SVM is 97%, Precision 90% and Recall 91%, while for Accuracy in Random Forest is 92%, Precision 71%, and Recall 86%. The value obtained is the result of testing using 526 website URLs. The test results show that the Support Vector Machine is better than the Random Forest in this study

    Comparison of Support Vector Machine (SVM) and Random Forest Algorithm for Detection of Negative Content on Websites

    Get PDF
    The amount of negative content circulating on the internet can damage people's morale so that social conflicts arise in society that threaten national sovereignty. Detecting negative content can help identify and prevent harmful events before they occur. This can lead to a safer and more positive online environment. Comparison of Support Vector Machine (SVM) and Random Forest (RF) Algorithm for Detection of Negative Content on Websites. The research contributions are 1) detect negative content on the internet with random forest and SVM, 2) comparing SVM and RF algorithms for detecting negative content on websites, 3) detection of negative content based on text focusing on the categories of fraud, gambling, pornography and Whitelist. The stages of this research are preparing a text content dataset on a website that has been labeled, preprocessing (duplicated data, text cleansing, case folding, stopward, tokenize, label encoding, data splitting, and determine the TF-IDF), finally performing the classification process with SVM and Random Forest. The dataset used in this study is a structured dataset in the form of text obtained from emails that have been registered on the TrustPositive website as negative content.  Negative content includes fraud, pornography and gambling. The results show the accuracy of the SVM is 97%, Precision 90% and Recall 91%, while for Accuracy in Random Forest is 92%, Precision 71%, and Recall 86%. The value obtained is the result of testing using 526 website URLs. The test results show that the Support Vector Machine is better than the Random Forest in this study
    corecore