1,430 research outputs found

    Using skipgrams and POS-based feature selection for patent classiïŹcation

    Get PDF
    Contains fulltext : 116289.pdf (publisher's version ) (Open Access)19 p

    Transfomer Models: From Model Inspection to Applications in Patents

    Get PDF
    L'elaborazione del linguaggio naturale viene utilizzata per affrontare diversi compiti, sia di tipo linguistico, come ad esempio l'etichettatura della parte del discorso, il parsing delle dipendenze, sia più specifiche, come ad esempio la traduzione automatica e l'analisi del sentimento. Per affrontare questi compiti, nel tempo sono stati sviluppati approcci dedicati.Una metodologia che aumenta le prestazioni in tutti questi casi in modo unificato è la modellazione linguistica, che consiste nel preaddestrare un modello per sostituire i token mascherati in grandi quantità di testo, in modo casuale all'interno di pezzi di testo o in modo sequenziale uno dopo l'altro, per sviluppare rappresentazioni di uso generale che possono essere utilizzate per migliorare le prestazioni in molti compiti contemporaneamente.L'architettura di rete neurale che attualmente svolge al meglio questo compito è il transformer, inoltre, le dimensioni del modello e la quantità dei dati sono essenziali per lo sviluppo di rappresentazioni ricche di informazioni. La disponibilità di insiemi di dati su larga scala e l'uso di modelli con miliardi di parametri sono attualmente il percorso più efficace verso una migliore rappresentazione del testo.Tuttavia, i modelli di grandi dimensioni comportano una maggiore difficoltà nell'interpretazione dell'output che forniscono. Per questo motivo, sono stati condotti diversi studi per indagare le rappresentazioni fornite da modelli di transformers.In questa tesi indago questi modelli da diversi punti di vista, studiando le proprietà linguistiche delle rappresentazioni fornite da BERT, per capire se le informazioni che codifica sono localizzate all'interno di specifiche elementi della rappresentazione vettoriale. A tal fine, identifico pesi speciali che mostrano un'elevata rilevanza per diversi compiti di sondaggio linguistico. In seguito, analizzo la causa di questi particolari pesi e li collego alla distribuzione dei token e ai token speciali.Per completare questa analisi generale ed estenderla a casi d'uso più specifici, studio l'efficacia di questi modelli sui brevetti. Utilizzo modelli dedicati, per identificare entità specifiche del dominio, come le tecnologie o per segmentare il testo dei brevetti. Studio sempre l'analisi delle prestazioni integrandola con accurate misurazioni dei dati e delle proprietà del modello per capire se le conclusioni tratte per i modelli generici valgono anche in questo contesto.Natural Language Processing is used to address several tasks, linguistic related ones, e.g. part of speech tagging, dependency parsing, and downstream tasks, e.g. machine translation, sentiment analysis. To tackle these tasks, dedicated approaches have been developed over time.A methodology that increases performance on all tasks in a unified manner is language modeling, this is done by pre-training a model to replace masked tokens in large amounts of text, either randomly within chunks of text or sequentially one after the other, to develop general purpose representations that can be used to improve performance in many downstream tasks at once.The neural network architecture currently best performing this task is the transformer, moreover, model size and data scale are essential to the development of information-rich representations. The availability of large scale datasets and the use of models with billions of parameters is currently the most effective path towards better representations of text.However, with large models, comes the difficulty in interpreting the output they provide. Therefore, several studies have been carried out to investigate the representations provided by transformers models trained on large scale datasets.In this thesis I investigate these models from several perspectives, I study the linguistic properties of the representations provided by BERT, a language model mostly trained on the English Wikipedia, to understand if the information it codifies is localized within specific entries of the vector representation. Doing this I identify special weights that show high relevance to several distinct linguistic probing tasks. Subsequently, I investigate the cause of these special weights, and link them to token distribution and special tokens.To complement this general purpose analysis and extend it to more specific use cases, given the wide range of applications for language models, I study their effectiveness on technical documentation, specifically, patents. I use both general purpose and dedicated models, to identify domain-specific entities such as users of the inventions and technologies or to segment patents text. I always study performance analysis complementing it with careful measurements of data and model properties to understand if the conclusions drawn for general purpose models hold in this context as well

    Natural Language Processing in-and-for Design Research

    Full text link
    We review the scholarly contributions that utilise Natural Language Processing (NLP) methods to support the design process. Using a heuristic approach, we collected 223 articles published in 32 journals and within the period 1991-present. We present state-of-the-art NLP in-and-for design research by reviewing these articles according to the type of natural language text sources: internal reports, design concepts, discourse transcripts, technical publications, consumer opinions, and others. Upon summarizing and identifying the gaps in these contributions, we utilise an existing design innovation framework to identify the applications that are currently being supported by NLP. We then propose a few methodological and theoretical directions for future NLP in-and-for design research

    Translating Common English and Chinese Verb-Noun Pairs in Technical Documents with Collocational and Bilingual Information

    Get PDF

    Information retrieval and text mining technologies for chemistry

    Get PDF
    Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European Community’s Horizon 2020 Program (project reference: 654021 - OpenMinted). M.K. additionally acknowledges the Encomienda MINETAD-CNIO as part of the Plan for the Advancement of Language Technology. O.R. and J.O. thank the Foundation for Applied Medical Research (FIMA), University of Navarra (Pamplona, Spain). This work was partially funded by Consellería de Cultura, Educación e Ordenación Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic funding of UID/BIO/04469/2013 unit and COMPETE 2020 (POCI-01-0145-FEDER-006684). We thank Iñigo Garciá -Yoldi for useful feedback and discussions during the preparation of the manuscript.info:eu-repo/semantics/publishedVersio

    Building an Environmental Sustainability Dictionary for the IT Industry

    Get PDF
    Content analysis is a commonly utilized methodology in corporate sustainability research. However, because most corporate sustainability research using content analysis is based on human coding, the research capability and the scope of the research design has limitations. The relatively recent text mining technique addresses some of the limitations of manual content analysis but its usage is often dependent upon the development of a domain specific dictionary. This paper develops an environmental sustainability dictionary in the context of corporate sustainability reports for the IT industry. In support of building said dictionary, we develop a standardized dictionary building process model that can be applied across many domains

    A Survey on Semantic Processing Techniques

    Full text link
    Semantic processing is a fundamental research domain in computational linguistics. In the era of powerful pre-trained language models and large language models, the advancement of research in this domain appears to be decelerating. However, the study of semantics is multi-dimensional in linguistics. The research depth and breadth of computational semantic processing can be largely improved with new technologies. In this survey, we analyzed five semantic processing tasks, e.g., word sense disambiguation, anaphora resolution, named entity recognition, concept extraction, and subjectivity detection. We study relevant theoretical research in these fields, advanced methods, and downstream applications. We connect the surveyed tasks with downstream applications because this may inspire future scholars to fuse these low-level semantic processing tasks with high-level natural language processing tasks. The review of theoretical research may also inspire new tasks and technologies in the semantic processing domain. Finally, we compare the different semantic processing techniques and summarize their technical trends, application trends, and future directions.Comment: Published at Information Fusion, Volume 101, 2024, 101988, ISSN 1566-2535. The equal contribution mark is missed in the published version due to the publication policies. Please contact Prof. Erik Cambria for detail

    Data analytics 2016: proceedings of the fifth international conference on data analytics

    Get PDF

    A Survey to Fix the Threshold and Implementation for Detecting Duplicate Web Documents

    Get PDF
    The drastic development in the information accessible on the World Wide Web has made the employment of automated tools to locate the information resources of interest, and for tracking and analyzing the same a certainty. Web Mining is the branch of data mining that deals with the analysis of World Wide Web. The concepts from various areas such as Data Mining, Internet technology and World Wide Web, and recently, Semantic Web can be said as the origin of web mining. Web mining can be defined as the procedure of determining hidden yet potentially beneficial knowledge from the data accessible in the web. Web mining comprise the sub areas: web content mining, web structure mining, and web usage mining. Web content mining is the process of mining knowledge from the web pages besides other web objects. The process of mining knowledge about the link structure linking web pages and some other web objects is defined as Web structure mining. Web usage mining is defined as the process of mining the usage patterns created by the users accessing the web pages. The search engine technology has led to the development of World Wide. The search engines are the chief gateways for access of information in the web. The ability to locate contents of particular interest amidst a huge heap has turned businesses beneficial and productive. The search engines respond to the queries by employing the process of web crawling that populates an indexed repository of web pages. The programs construct a confined repository of the segment of the web that they visit by navigating the web graph and retrieving pages. There are two main types of crawling, namely, Generic and Focused crawling. Generic crawlers crawls documents and links of diverse topics. Focused crawlers limit the number of pages with the aid of some prior obtained specialized knowledge. The systems that index, mine, and otherwise analyze pages (such as, the search engines) are provided with inputs from the repositories of web pages built by the web crawlers. The drastic development of the Internet and the growing necessity to incorporate heterogeneous data is accompanied by the issue of the existence of near duplicate data. Even if the near duplicate data don’t exhibit bit wise identical nature they are remarkably similar. The duplicate and near duplicate web pages either increase the index storage space or slow down or increase the serving costs which annoy the users, thus causing huge problems for the web search engines. Hence it is inevitable to design algorithms to detect such pages
    • 

    corecore