3,049 research outputs found

    Mapping Big Data into Knowledge Space with Cognitive Cyber-Infrastructure

    Full text link
    Big data research has attracted great attention in science, technology, industry and society. It is developing with the evolving scientific paradigm, the fourth industrial revolution, and the transformational innovation of technologies. However, its nature and fundamental challenge have not been recognized, and its own methodology has not been formed. This paper explores and answers the following questions: What is big data? What are the basic methods for representing, managing and analyzing big data? What is the relationship between big data and knowledge? Can we find a mapping from big data into knowledge space? What kind of infrastructure is required to support not only big data management and analysis but also knowledge discovery, sharing and management? What is the relationship between big data and science paradigm? What is the nature and fundamental challenge of big data computing? A multi-dimensional perspective is presented toward a methodology of big data computing.Comment: 59 page

    Transfomer Models: From Model Inspection to Applications in Patents

    Get PDF
    L'elaborazione del linguaggio naturale viene utilizzata per affrontare diversi compiti, sia di tipo linguistico, come ad esempio l'etichettatura della parte del discorso, il parsing delle dipendenze, sia più specifiche, come ad esempio la traduzione automatica e l'analisi del sentimento. Per affrontare questi compiti, nel tempo sono stati sviluppati approcci dedicati.Una metodologia che aumenta le prestazioni in tutti questi casi in modo unificato è la modellazione linguistica, che consiste nel preaddestrare un modello per sostituire i token mascherati in grandi quantità di testo, in modo casuale all'interno di pezzi di testo o in modo sequenziale uno dopo l'altro, per sviluppare rappresentazioni di uso generale che possono essere utilizzate per migliorare le prestazioni in molti compiti contemporaneamente.L'architettura di rete neurale che attualmente svolge al meglio questo compito è il transformer, inoltre, le dimensioni del modello e la quantità dei dati sono essenziali per lo sviluppo di rappresentazioni ricche di informazioni. La disponibilità di insiemi di dati su larga scala e l'uso di modelli con miliardi di parametri sono attualmente il percorso più efficace verso una migliore rappresentazione del testo.Tuttavia, i modelli di grandi dimensioni comportano una maggiore difficoltà nell'interpretazione dell'output che forniscono. Per questo motivo, sono stati condotti diversi studi per indagare le rappresentazioni fornite da modelli di transformers.In questa tesi indago questi modelli da diversi punti di vista, studiando le proprietà linguistiche delle rappresentazioni fornite da BERT, per capire se le informazioni che codifica sono localizzate all'interno di specifiche elementi della rappresentazione vettoriale. A tal fine, identifico pesi speciali che mostrano un'elevata rilevanza per diversi compiti di sondaggio linguistico. In seguito, analizzo la causa di questi particolari pesi e li collego alla distribuzione dei token e ai token speciali.Per completare questa analisi generale ed estenderla a casi d'uso più specifici, studio l'efficacia di questi modelli sui brevetti. Utilizzo modelli dedicati, per identificare entità specifiche del dominio, come le tecnologie o per segmentare il testo dei brevetti. Studio sempre l'analisi delle prestazioni integrandola con accurate misurazioni dei dati e delle proprietà del modello per capire se le conclusioni tratte per i modelli generici valgono anche in questo contesto.Natural Language Processing is used to address several tasks, linguistic related ones, e.g. part of speech tagging, dependency parsing, and downstream tasks, e.g. machine translation, sentiment analysis. To tackle these tasks, dedicated approaches have been developed over time.A methodology that increases performance on all tasks in a unified manner is language modeling, this is done by pre-training a model to replace masked tokens in large amounts of text, either randomly within chunks of text or sequentially one after the other, to develop general purpose representations that can be used to improve performance in many downstream tasks at once.The neural network architecture currently best performing this task is the transformer, moreover, model size and data scale are essential to the development of information-rich representations. The availability of large scale datasets and the use of models with billions of parameters is currently the most effective path towards better representations of text.However, with large models, comes the difficulty in interpreting the output they provide. Therefore, several studies have been carried out to investigate the representations provided by transformers models trained on large scale datasets.In this thesis I investigate these models from several perspectives, I study the linguistic properties of the representations provided by BERT, a language model mostly trained on the English Wikipedia, to understand if the information it codifies is localized within specific entries of the vector representation. Doing this I identify special weights that show high relevance to several distinct linguistic probing tasks. Subsequently, I investigate the cause of these special weights, and link them to token distribution and special tokens.To complement this general purpose analysis and extend it to more specific use cases, given the wide range of applications for language models, I study their effectiveness on technical documentation, specifically, patents. I use both general purpose and dedicated models, to identify domain-specific entities such as users of the inventions and technologies or to segment patents text. I always study performance analysis complementing it with careful measurements of data and model properties to understand if the conclusions drawn for general purpose models hold in this context as well

    A Bibliometric Analysis in Industry 4.0 and Advanced Manufacturing: What about the Sustainable Supply Chain?

    Get PDF
    During the last decade, different concepts, methodologies, and technologies have appeared, evolving industry toward what we know today as the fourth industrial evolution or Industry 4.0 (I4.0) and Advanced Manufacturing (AM). Based on both, Supply Chain (SC) is presented as the relevant process that sets the sustainability of manufacturing and, therefore, is defined as a key term in a sustainable approach to I4.0. However, there are no studies that analyze the evolution of science in the fields of I4.0 and AM together. In order to fill this gap, the aim of this research work is to analyze the tendencies of science research related to I4.0 and AM by conducting a bibliometric and network analysis and also to generate a new contribution through the analysis of scientific trends related to SC and Sustainable Supply Chain (SSC) within this scientific context, for the time span 2010–2019. The results show that the number of publications is growing exponentially and the most active countries are Germany and the U.S., with Rheinisch-Westfälische Technische Hochschule (RWTH) Aachen University being the most productive organization and Tecnologico de Monterrey the most collaborative. The analysis of the scientific terms allows us to conclude that the research field is in a growth phase, generating up to almost 4500 new terms in 2019

    Surfacing the deep data of taxonomy

    Get PDF
    Taxonomic databases are perpetuating approaches to citing literature that may have been appropriate before the Internet, often being little more than digitised 5 × 3 index cards. Typically the original taxonomic literature is either not cited, or is represented in the form of a (typically abbreviated) text string. Hence much of the “deep data” of taxonomy, such as the original descriptions, revisions, and nomenclatural actions are largely hidden from all but the most resourceful users. At the same time there are burgeoning efforts to digitise the scientific literature, and much of this newly available content has been assigned globally unique identifiers such as Digital Object Identifiers (DOIs), which are also the identifier of choice for most modern publications. This represents an opportunity for taxonomic databases to engage with digitisation efforts. Mapping the taxonomic literature on to globally unique identifiers can be time consuming, but need be done only once. Furthermore, if we reuse existing identifiers, rather than mint our own, we can start to build the links between the diverse data that are needed to support the kinds of inference which biodiversity informatics aspires to support. Until this practice becomes widespread, the taxonomic literature will remain balkanized, and much of the knowledge that it contains will linger in obscurity

    Contextual impacts on industrial processes brought by the digital transformation of manufacturing: a systematic review

    Get PDF
    The digital transformation of manufacturing (a phenomenon also known as "Industry 4.0" or "Smart Manufacturing") is finding a growing interest both at practitioner and academic levels, but is still in its infancy and needs deeper investigation. Even though current and potential advantages of digital manufacturing are remarkable, in terms of improved efficiency, sustainability, customization, and flexibility, only a limited number of companies has already developed ad hoc strategies necessary to achieve a superior performance. Through a systematic review, this study aims at assessing the current state of the art of the academic literature regarding the paradigm shift occurring in the manufacturing settings, in order to provide definitions as well as point out recurring patterns and gaps to be addressed by future research. For the literature search, the most representative keywords, strict criteria, and classification schemes based on authoritative reference studies were used. The final sample of 156 primary publications was analyzed through a systematic coding process to identify theoretical and methodological approaches, together with other significant elements. This analysis allowed a mapping of the literature based on clusters of critical themes to synthesize the developments of different research streams and provide the most representative picture of its current state. Research areas, insights, and gaps resulting from this analysis contributed to create a schematic research agenda, which clearly indicates the space for future evolutions of the state of knowledge in this field

    DARIAH and the Benelux

    Get PDF
    corecore