KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models

As the field of artificial intelligence is captivated by the remarkable performances of large language models like GPT-4 and Claude 3, a crucial question emerges: Are we truly evaluating these models' capabilities objectively? What capabilities are we evaluating by asking LLMs do multiple choice problems? In fact, current evaluations of large models are facing the challenge of data contamination.

Comparison of KIEval and traditional methods

Data Contamination

Data contamination occurs when models are exposed to test set data of benchmarks during training, leading to inflated performance metrics. Data contamination can be hard to detect, as LLMs are commonly trained on complex data sources, making it hard to avoid leaks of test data. Some models even train directly on test sets to achieve higher scores, artificially boosting their performance and potentially misleading research.

Existing Methods vs KIEval

Currently, detecting data contamination can be quite hard. While there are methods work relatively well for detecting data contamination during pre-training phase, they perform poorly in detecting data leaks during SFT in our experiments. KIEval is proposed to be a contamination-free and effective method for evaluating LLMs' generative capabilities. It also helps to detect data contamination. Generally speaking, KIEval is a knowledge-based dynamic interactive evaluation framework designed to assess a model's knowledge generalization and application abilities through multi-round dialogue interactions.

How KIEval Works

Dynamic Interactions: KIEval introduces an "interactor" model that engages in multi-round dialogues with the evaluated model. Each round generates new, deeper questions based on previous responses, testing the model's knowledge application and coherence.
Evaluation Process: An initial question from a high-quality dataset starts the dialogue. The "interactor" then generates follow-up questions to probe deeper. An "evaluator" model assesses responses for relevance, coherence, and logic. This process goes on, iteratively, to simulate a "interview" of LLMs.
Advantages: KIEval reduces the impact of data contamination and comprehensively evaluates the model's abilities beyond simple pattern matching.

Insights from KIEval

Here we provide some high-level takeaways from the Experiments section in our paper.

Performance Gaps: Traditional benchmarks often underestimate performance differences between models. KIEval's dynamic dialogues reveal significant disparities in knowledge application and logical reasoning that static datasets miss.
Impact of Data Contamination: Experiments with "cheating" models showed that data contamination leads to high test scores but poor performance in dynamic dialogues. This indicates that contaminated models memorize answers rather than understanding them.
Evaluating Knowledge Depth: KIEval scores, compared with static dataset accuracy, help infer data leakage. High accuracy but poor dynamic response suggests memorization rather than true understanding.
Correlation with Human Evaluation: KIEval scores correlate highly with human evaluations, proving its effectiveness and alignment with human judgment in dialogue quality.

Metric	Pearson	Spearman	Kendall-Tau	Variance
METEOR	0.016	0.023	0.021	0.012
ROUGE-1	0.259	0.316	0.226	0.016
ROUGE-2	0.280	0.303	0.223	0.007
ROUGE-L	0.209	0.268	0.200	0.007
BERTScore	0.189	0.336	0.250	0.001
MT-Bench	0.520	0.494	0.405	9.360
Ours
Accuracy	0.761	0.727	0.653	2.010
Logic	0.768	0.735	0.661	1.842
Relevance	0.633	0.643	0.543	1.152
Coherence	0.750	0.740	0.644	1.365
Conciseness	0.611	0.604	0.504	0.833
Overall	0.814	0.789	0.721	1.512

Model Biases: Experiments show that while biases exist at the sample level, overall evaluation scores by different evaluator models are strongly correlated, indicating biases do not significantly affect conclusions.

Model	Evaluator	Accuracy	Logic	Relevance	Coherence	Conciseness	Overall
GPT-3.5	GPT-4	94.6	94.7	98.5	96.1	97.3	95.5
GPT-3.5	Claude-3	98.6	98.8	99.8	99.4	99.0	98.7
LLaMA-2 70B	GPT-4	81.9	82.8	92.2	85.3	75.6	84.1
LLaMA-2 70B	Claude-3	98.3	98.7	98.2	96.9	84.6	96.4
LLaMA-2 7B	GPT-4	70.6	71.6	90.4	77.9	71.7	74.4
LLaMA-2 7B	Claude-3	90.9	91.8	98.0	95.0	85.2	91.0

Conclusion

KIEval provides a robust framework for dynamic, interactive evaluation of large language models, reducing the impact of data contamination and offering deeper insights into a model's true capabilities. It shifts the focus from static evaluation to a more comprehensive assessment of knowledge understanding and application.

Citation

If you find our work helpful, please consider citing us with:

@article{yu2024kieval,
    title={KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models}, 
    author={Zhuohao Yu and Chang Gao and Wenjin Yao and Yidong Wang and Wei Ye and Jindong Wang and Xing Xie and Yue Zhang and Shikun Zhang},
    journal={ArXiv},
    year={2024},
    volume={abs/2402.15043},
}