1,432 research outputs found
Soft Seeded SSL Graphs for Unsupervised Semantic Similarity-based Retrieval
Semantic similarity based retrieval is playing an increasingly important role
in many IR systems such as modern web search, question-answering, similar
document retrieval etc. Improvements in retrieval of semantically similar
content are very significant to applications like Quora, Stack Overflow, Siri
etc. We propose a novel unsupervised model for semantic similarity based
content retrieval, where we construct semantic flow graphs for each query, and
introduce the concept of "soft seeding" in graph based semi-supervised learning
(SSL) to convert this into an unsupervised model.
We demonstrate the effectiveness of our model on an equivalent question
retrieval problem on the Stack Exchange QA dataset, where our unsupervised
approach significantly outperforms the state-of-the-art unsupervised models,
and produces comparable results to the best supervised models. Our research
provides a method to tackle semantic similarity based retrieval without any
training data, and allows seamless extension to different domain QA
communities, as well as to other semantic equivalence tasks.Comment: Published in Proceedings of the 2017 ACM Conference on Information
and Knowledge Management (CIKM '17
Sequence to Sequence Learning for Query Expansion
Using sequence to sequence algorithms for query expansion has not been
explored yet in Information Retrieval literature nor in Question-Answering's.
We tried to fill this gap in the literature with a custom Query Expansion
engine trained and tested on open datasets. Starting from open datasets, we
built a Query Expansion training set using sentence-embeddings-based Keyword
Extraction. We therefore assessed the ability of the Sequence to Sequence
neural networks to capture expanding relations in the words embeddings' space.Comment: 8 pages, 2 figures, AAAI-19 Student Abstract and Poster Progra
Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense
The rise in malicious usage of large language models, such as fake content
creation and academic plagiarism, has motivated the development of approaches
that identify AI-generated text, including those based on watermarking or
outlier detection. However, the robustness of these detection algorithms to
paraphrases of AI-generated text remains unclear. To stress test these
detectors, we build a 11B parameter paraphrase generation model (DIPPER) that
can paraphrase paragraphs, condition on surrounding context, and control
lexical diversity and content reordering. Using DIPPER to paraphrase text
generated by three large language models (including GPT3.5-davinci-003)
successfully evades several detectors, including watermarking, GPTZero,
DetectGPT, and OpenAI's text classifier. For example, DIPPER drops detection
accuracy of DetectGPT from 70.3% to 4.6% (at a constant false positive rate of
1%), without appreciably modifying the input semantics.
To increase the robustness of AI-generated text detection to paraphrase
attacks, we introduce a simple defense that relies on retrieving
semantically-similar generations and must be maintained by a language model API
provider. Given a candidate text, our algorithm searches a database of
sequences previously generated by the API, looking for sequences that match the
candidate text within a certain threshold. We empirically verify our defense
using a database of 15M generations from a fine-tuned T5-XXL model and find
that it can detect 80% to 97% of paraphrased generations across different
settings while only classifying 1% of human-written sequences as AI-generated.
We open-source our models, code and data.Comment: NeurIPS 2023 camera ready (32 pages). Code, models, data available in
https://github.com/martiansideofthemoon/ai-detection-paraphrase
A Survey of Paraphrasing and Textual Entailment Methods
Paraphrasing methods recognize, generate, or extract phrases, sentences, or
longer natural language expressions that convey almost the same information.
Textual entailment methods, on the other hand, recognize, generate, or extract
pairs of natural language expressions, such that a human who reads (and trusts)
the first element of a pair would most likely infer that the other element is
also true. Paraphrasing can be seen as bidirectional textual entailment and
methods from the two areas are often similar. Both kinds of methods are useful,
at least in principle, in a wide range of natural language processing
applications, including question answering, summarization, text generation, and
machine translation. We summarize key ideas from the two areas by considering
in turn recognition, generation, and extraction methods, also pointing to
prominent articles and resources.Comment: Technical Report, Natural Language Processing Group, Department of
Informatics, Athens University of Economics and Business, Greece, 201
Mining question-answer pairs from web forum: a survey of challenges and resolutions
Internet forums, which are also known as discussion boards, are popular web applications. Members of the board discuss issues and share ideas to form a community within the board, and as a result generate huge amount of content on different topics on daily basis. Interest in information extraction and knowledge discovery from such sources has been on the increase in the research community. A number of factors are limiting the potentiality of mining knowledge from forums. Lexical chasm or lexical gap that renders some Natural Language Processing techniques (NLP) less effective, Informal tone that creates noisy data, drifting of discussion topic that prevents focused mining and asynchronous issue that makes it difficult to establish post-reply relationship are some of the problems that need to be addressed. This survey introduces these challenges within the framework of question answering. The survey provides description of the problems; cites and explores useful publications to the reader for further examination; provides an overview of resolution strategies and findings relevant to the challenges
- …