10 research outputs found
The NLP4NLP Corpus (I): 50 Years of Publication, Collaboration and Citation in Speech and Language Processing
This paper introduces the NLP4NLP corpus, which contains articles published in 34 major conferences and journals in the field of speech and natural language processing over a period of 50 years (1965–2015), comprising 65,000 documents, gathering 50,000 authors, including 325,000 references and representing ~270 million words. Most of these publications are in English, some are in French, German, or Russian. Some are open access, others have been provided by the publishers. In order to constitute and analyze this corpus several tools have been used or developed. Many of them use Natural Language Processing methods that have been published in the corpus, hence its name. The paper presents the corpus and some findings regarding its content (evolution over time of the number of articles and authors, collaborations between authors, citations between papers and authors), in the context of a global or comparative analysis between sources. Numerous manual corrections were necessary, which demonstrated the importance of establishing standards for uniquely identifying authors, articles, or publications
NLP4NLP+5: The Deep (R)evolution in Speech and Language Processing
This paper aims at analyzing the changes in the fields of speech and natural language processing over the recent past 5 years (2016–2020). It is in continuation of a series of two papers that we published in 2019 on the analysis of the NLP4NLP corpus, which contained articles published in 34 major conferences and journals in the field of speech and natural language processing, over a period of 50 years (1965–2015), and analyzed with the methods developed in the field of NLP, hence its name. The extended NLP4NLP+5 corpus now covers 55 years, comprising close to 90,000 documents [+30% compared with NLP4NLP: as many articles have been published in the single year 2020 than over the first 25 years (1965–1989)], 67,000 authors (+40%), 590,000 references (+80%), and approximately 380 million words (+40%). These analyses are conducted globally or comparatively among sources and also with the general scientific literature, with a focus on the past 5 years. It concludes in identifying profound changes in research topics as well as in the emergence of a new generation of authors and the appearance of new publications around artificial intelligence, neural networks, machine learning, and word embedding
Cross-Platform Text Mining and Natural Language Processing Interoperability - Proceedings of the LREC2016 conference
No abstract available
Cross-Platform Text Mining and Natural Language Processing Interoperability - Proceedings of the LREC2016 conference
No abstract available
A Survey of Pre-trained Language Models for Processing Scientific Text
The number of Language Models (LMs) dedicated to processing scientific text
is on the rise. Keeping pace with the rapid growth of scientific LMs (SciLMs)
has become a daunting task for researchers. To date, no comprehensive surveys
on SciLMs have been undertaken, leaving this issue unaddressed. Given the
constant stream of new SciLMs, appraising the state-of-the-art and how they
compare to each other remain largely unknown. This work fills that gap and
provides a comprehensive review of SciLMs, including an extensive analysis of
their effectiveness across different domains, tasks and datasets, and a
discussion on the challenges that lie ahead.Comment: Resources are available at https://github.com/Alab-NII/Awesome-SciL
Metodologija rešavanja semantičkih problema u obradi kratkih tekstova napisanih na prirodnim jezicima sa ograničenim resursima
Statistički pristupi obradi prirodnih jezika tipično zahtevaju značajne količine anotiranih
podataka, a često i različite pomoćne jezičke alate, što ograničava njihovu primenu u resursno
ograničenim situacijama. U ovoj disertaciji predstavljena je metodologija razvoja statističkih rešenja
u semantičkoj obradi prirodnih jezika sa ograničenim resursima. Ovakvi jezici se odlikuju ne samo
limitiranim brojem postojećih jezičkih resursa, već i ograničenim mogućnostima za razvoj novih
skupova podataka i namenskih alata i algoritama. Predložena metodologija je usredsređena na kratke
tekstove zbog njihove rasprostranjenosti u digitalnoj komunikaciji i zbog veće složenosti njihove
semantičke obrade.
Metodologija obuhvata sve faze izrade statističkih rešenja, od prikupljanja tekstualnog sadržaja,
preko anotacije podataka, do formulisanja, obučavanja i evaluacije modela mašinskog učenja. Njena
upotreba je detaljno ilustrovana na dva semantička problema – analizi sentimenta i određivanju
semantičke sličnosti. Kao primer jezika sa ograničenim resursima korišćen je srpski jezik, ali se
predložena metodologija može primeniti i na druge jezike iz ove kategorije.
Pored opšte metodologije, u doprinose ove disertacije spada razvoj novog, fleksibilnog sistema
označavanja sentimenta kratkih tekstova, nove metrike za utvrđivanje ekonomičnosti anotacije, kao
i nekoliko novih modela za određivanje semantičke sličnosti kratkih tekstova. Rezultati disertacije
uključuju i kreiranje prvih javno dostupnih anotiranih skupova podataka za probleme analize
sentimenta i određivanja semantičke sličnosti kratkih tekstova na srpskom jeziku, razvoj i evaluaciju
većeg broja modela na ovim problemima, i prvu komparativnu evaluaciju većeg broja alata za
morfološku normalizaciju na kratkim tekstovima na srpskom jeziku.Statistical approaches to natural language processing typically require considerable
amounts of labeled data, and often various auxiliary language tools as well, limiting their applicability
in resource-limited settings. This thesis presents a methodology for developing statistical solutions in
the semantic processing of natural languages with limited resources. In these languages, not only are
existing language resources limited, but so are the capabilities for developing new datasets and
dedicated tools and algorithms. The proposed methodology focuses on short texts due to their
prevalence in digital communication, as well as the greater complexity regarding their semantic
processing.
The methodology encompasses all phases in the creation of statistical solutions, from the collection
of textual content, to data annotation, to the formulation, training, and evaluation of machine learning
models. Its use is illustrated in detail on two semantic tasks – sentiment analysis and semantic textual
similarity. The Serbian language is utilized as an example of a language with limited resources, but
the proposed methodology can also be applied to other languages in this category.
In addition to the general methodology, the contributions of this thesis consist of the development of
a new, flexible short-text sentiment annotation system, a new annotation cost-effectiveness metric, as
well as several new semantic textual similarity models. The thesis results also include the creation of
the first publicly available annotated datasets of short texts in Serbian for the tasks of sentiment
analysis and semantic textual similarity, the development and evaluation of numerous models on
these tasks, and the first comparative evaluation of multiple morphological normalization tools on
short texts in Serbian
The NLP4NLP Corpus (I): 50 Years of Publication, Collaboration and Citation in Speech and Language Processing
International audienceThis paper introduces the NLP4NLP corpus, which contains articles published in 34 major conferences and journals in the field of speech and natural language processing over a period of 50 years (1965â2015), comprising 65,000 documents, gathering 50,000 authors, including 325,000 references and representing ~270 million words. Most of these publications are in English, some are in French, German, or Russian. Some are open access, others have been provided by the publishers. In order to constitute and analyze this corpus several tools have been used or developed. Many of them use Natural Language Processing methods that have been published in the corpus, hence its name. The paper presents the corpus and some findings regarding its content (evolution over time of the number of articles and authors, collaborations between authors, citations between papers and authors), in the context of a global or comparative analysis between sources. Numerous manual corrections were necessary, which demonstrated the importance of establishing standards for uniquely identifying authors, articles, or publications