2 research outputs found
Improving Classification Performance With Human Feedback: Label a few, we label the rest
In the realm of artificial intelligence, where a vast majority of data is
unstructured, obtaining substantial amounts of labeled data to train supervised
machine learning models poses a significant challenge. To address this, we
delve into few-shot and active learning, where are goal is to improve AI models
with human feedback on a few labeled examples. This paper focuses on
understanding how a continuous feedback loop can refine models, thereby
enhancing their accuracy, recall, and precision through incremental human
input. By employing Large Language Models (LLMs) such as GPT-3.5, BERT, and
SetFit, we aim to analyze the efficacy of using a limited number of labeled
examples to substantially improve model accuracy. We benchmark this approach on
the Financial Phrasebank, Banking, Craigslist, Trec, Amazon Reviews datasets to
prove that with just a few labeled examples, we are able to surpass the
accuracy of zero shot large language models to provide enhanced text
classification performance. We demonstrate that rather than needing to manually
label millions of rows of data, we just need to label a few and the model can
effectively predict the rest
Enhancing Large Language Model Performance To Answer Questions and Extract Information More Accurately
Large Language Models (LLMs) generate responses to questions; however, their
effectiveness is often hindered by sub-optimal quality of answers and
occasional failures to provide accurate responses to questions. To address
these challenges, a fine-tuning process is employed, involving feedback and
examples to refine models. The objective is to enhance AI models through
continuous feedback loops, utilizing metrics such as cosine similarity, LLM
evaluation and Rouge-L scores to evaluate the models. Leveraging LLMs like
GPT-3.5, GPT4ALL, and LLaMA2, and Claude, this approach is benchmarked on
financial datasets, including the FinanceBench and RAG Instruct Benchmark
Tester Dataset, illustrating the necessity of fine-tuning. The results showcase
the capability of fine-tuned models to surpass the accuracy of zero-shot LLMs,
providing superior question and answering capabilities. Notably, the
combination of fine-tuning the LLM with a process known as Retrieval Augmented
Generation (RAG) proves to generate responses with improved accuracy