2 research outputs found
CONFLATOR: Incorporating Switching Point based Rotatory Positional Encodings for Code-Mixed Language Modeling
The mixing of two or more languages is called Code-Mixing (CM). CM is a
social norm in multilingual societies. Neural Language Models (NLMs) like
transformers have been very effective on many NLP tasks. However, NLM for CM is
an under-explored area. Though transformers are capable and powerful, they
cannot always encode positional/sequential information since they are
non-recurrent. Therefore, to enrich word information and incorporate positional
information, positional encoding is defined. We hypothesize that Switching
Points (SPs), i.e., junctions in the text where the language switches (L1 -> L2
or L2-> L1), pose a challenge for CM Language Models (LMs), and hence give
special emphasis to switching points in the modeling process. We experiment
with several positional encoding mechanisms and show that rotatory positional
encodings along with switching point information yield the best results.
We introduce CONFLATOR: a neural language modeling approach for code-mixed
languages. CONFLATOR tries to learn to emphasize switching points using smarter
positional encoding, both at unigram and bigram levels. CONFLATOR outperforms
the state-of-the-art on two tasks based on code-mixed Hindi and English
(Hinglish): (i) sentiment analysis and (ii) machine translation
Counter Turing Test CT^2: AI-Generated Text Detection is Not as Easy as You May Think -- Introducing AI Detectability Index
With the rise of prolific ChatGPT, the risk and consequences of AI-generated
text has increased alarmingly. To address the inevitable question of ownership
attribution for AI-generated artifacts, the US Copyright Office released a
statement stating that 'If a work's traditional elements of authorship were
produced by a machine, the work lacks human authorship and the Office will not
register it'. Furthermore, both the US and the EU governments have recently
drafted their initial proposals regarding the regulatory framework for AI.
Given this cynosural spotlight on generative AI, AI-generated text detection
(AGTD) has emerged as a topic that has already received immediate attention in
research, with some initial methods having been proposed, soon followed by
emergence of techniques to bypass detection. This paper introduces the Counter
Turing Test (CT^2), a benchmark consisting of techniques aiming to offer a
comprehensive evaluation of the robustness of existing AGTD techniques. Our
empirical findings unequivocally highlight the fragility of the proposed AGTD
methods under scrutiny. Amidst the extensive deliberations on policy-making for
regulating AI development, it is of utmost importance to assess the
detectability of content generated by LLMs. Thus, to establish a quantifiable
spectrum facilitating the evaluation and ranking of LLMs according to their
detectability levels, we propose the AI Detectability Index (ADI). We conduct a
thorough examination of 15 contemporary LLMs, empirically demonstrating that
larger LLMs tend to have a higher ADI, indicating they are less detectable
compared to smaller LLMs. We firmly believe that ADI holds significant value as
a tool for the wider NLP community, with the potential to serve as a rubric in
AI-related policy-making.Comment: EMNLP 2023 Mai