171 research outputs found

    LEDGAR : a large-scale multi-label corpus for text classification of legal provisions in contracts

    Get PDF
    We present LEDGAR, a multilabel corpus of legal provisions in contracts. The corpus was crawled and scraped from the public domain (SEC filings) and is, to the best of our knowledge, the first freely available corpus of its kind. Since the corpus was constructed semi-automatically, we apply and discuss various approaches to noise removal. Due to the rather large labelset of over 12'000 labels annotated in almost 100'000 provisions in over 60'000 contracts, we believe the corpus to be of interest for research in the field of Legal NLP, (large-scale or extreme) text classification, as well as for legal studies. We discuss several methods to sample subcopora from the corpus and implement and evaluate different automatic classification approaches. Finally, we perform transfer experiments to evaluate how well the classifiers perform on contracts stemming from outside the corpus

    Analysing Finnish Multi-Word Expressions with Word Embeddings

    Get PDF
    Sanayhdistelmät ovat useamman sanan kombinaatioita, jotka ovat jollakin tavalla jähmeitä ja/tai idiomaattisia. Tutkimuksessa tarkastellaan suomen kielen verbaalisia idiomeja sanaupotusmenetelmän (word2vec) avulla. Työn aineistona käytetään Gutenberg-projektista haettuja suomenkielisiä kirjoja. Työssä tutkitaan pääosin erityisesti idiomeja, joissa esiintyy suomen kielen sana ‘silmä’. Niiden idiomaattisuutta mitataan komposiittisuuden (kuinka hyvin sanayhdistelmän merkitys vastaa sen komponenttien merkitysten kombinaatiota) ja jähmeyttä leksikaalisen korvaustestin avulla. Vastaavat testit tehdään myös sanojen sisäisen rakenteen huomioonottavan fastText-algoritmin avulla. Työssä on myös luotu Gutenberg-korpuksen perusteella pienehkö luokiteltu lausejoukko, jota lajitellaan neuroverkkopohjaisen luokittelijan avulla. Tämä lisäksi työssä tunnustellaan eri ominaisuuksien kuten sijamuodon vaikutusta idiomin merkitykseen. Mittausmenetelmien tulokset ovat yleisesti ottaen varsin kirjavia. fastText-algoritmin suorituskyky on yleisesti ottaen hieman parempi kuin perusmenetelmän; sen lisäksi sanaupotusten laatu on parempi. Leksikaalinen korvaustesti antaa parhaimmat tulokset, kun vain lähin naapuri otetaan huomioon. Sijamuodon todettiin olevan varsin tärkeä idiomin merkityksen määrittämiseen. Mittauksien heikot tulokset voivat johtua monesta tekijästä, kuten siitä, että idiomien semanttisen läpinäkyvyyden aste voi vaihdella. Sanaupotusmenetelmä ei myöskään normaalisti ota huomioon sitä, että myös sanayhdistelmillä voi olla useita merkityksiä (kirjaimellinen ja idiomaattinen/kuvaannollinen). Suomen kielen rikas morfologia asettaa menetelmälle myös ylimääräisiä haasteita. Tuloksena voidaan sanoa, että sanaupotusmenetelmä on jokseenkin hyödyllinen suomen kielen idiomien tutkimiseen. Testattujen mittausmenetelmien käyttökelpoisuus yksin käytettynä on rajallinen, mutta ne saattaisivat toimia paremmin osana laajempaa tutkimusmekanismia

    Natural language processing for similar languages, varieties, and dialects: A survey

    Get PDF
    There has been a lot of recent interest in the natural language processing (NLP) community in the computational processing of language varieties and dialects, with the aim to improve the performance of applications such as machine translation, speech recognition, and dialogue systems. Here, we attempt to survey this growing field of research, with focus on computational methods for processing similar languages, varieties, and dialects. In particular, we discuss the most important challenges when dealing with diatopic language variation, and we present some of the available datasets, the process of data collection, and the most common data collection strategies used to compile datasets for similar languages, varieties, and dialects. We further present a number of studies on computational methods developed and/or adapted for preprocessing, normalization, part-of-speech tagging, and parsing similar languages, language varieties, and dialects. Finally, we discuss relevant applications such as language and dialect identification and machine translation for closely related languages, language varieties, and dialects.Non peer reviewe

    Exploiting Latent Features of Text and Graphs

    Get PDF
    As the size and scope of online data continues to grow, new machine learning techniques become necessary to best capitalize on the wealth of available information. However, the models that help convert data into knowledge require nontrivial processes to make sense of large collections of text and massive online graphs. In both scenarios, modern machine learning pipelines produce embeddings --- semantically rich vectors of latent features --- to convert human constructs for machine understanding. In this dissertation we focus on information available within biomedical science, including human-written abstracts of scientific papers, as well as machine-generated graphs of biomedical entity relationships. We present the Moliere system, and our method for identifying new discoveries through the use of natural language processing and graph mining algorithms. We propose heuristically-based ranking criteria to augment Moliere, and leverage this ranking to identify a new gene-treatment target for HIV-associated Neurodegenerative Disorders. We additionally focus on the latent features of graphs, and propose a new bipartite graph embedding technique. Using our graph embedding, we advance the state-of-the-art in hypergraph partitioning quality. Having newfound intuition of graph embeddings, we present Agatha, a deep-learning approach to hypothesis generation. This system learns a data-driven ranking criteria derived from the embeddings of our large proposed biomedical semantic graph. To produce human-readable results, we additionally propose CBAG, a technique for conditional biomedical abstract generation

    Evolving linguistic divergence on polarizing social media

    Full text link
    Language change is influenced by many factors, but often starts from synchronic variation, where multiple linguistic patterns or forms coexist, or where different speech communities use language in increasingly different ways. Besides regional or economic reasons, communities may form and segregate based on political alignment. The latter, referred to as political polarization, is of growing societal concern across the world. Here we map and quantify linguistic divergence across the partisan left-right divide in the United States, using social media data. We develop a general methodology to delineate (social) media users by their political preference, based on which (potentially biased) news media accounts they do and do not follow on a given platform. Our data consists of 1.5M short posts by 10k users (about 20M words) from the social media platform Twitter (now "X"). Delineating this sample involved mining the platform for the lists of followers (n=422M) of 72 large news media accounts. We quantify divergence in topics of conversation and word frequencies, messaging sentiment, and lexical semantics of words and emoji. We find signs of linguistic divergence across all these aspects, especially in topics and themes of conversation, in line with previous research. While US American English remains largely intelligible within its large speech community, our findings point at areas where miscommunication may eventually arise given ongoing polarization and therefore potential linguistic divergence. Our methodology - combining data mining, lexicostatistics, machine learning, large language models and a systematic human annotation approach - is largely language and platform agnostic. In other words, while we focus here on US political divides and US English, the same approach is applicable to other countries, languages, and social media platforms
    corecore