As the field of artificial intelligence is captivated by the remarkable performances of large language models like GPT-4 and Claude 3, a crucial question emerges: Are we truly evaluating these models' capabilities objectively? What capabilities are we evaluating by asking LLMs do multiple choice problems? In fact, current evaluations of large models are facing the challenge of data contamination.
Data Contamination
Data contamination occurs when models are exposed to test set data of benchmarks during training, leading to inflated performance metrics. Data contamination can be hard to detect, as LLMs are commonly trained on complex data sources, making it hard to avoid leaks of test data. Some models even train directly on test sets to achieve higher scores, artificially boosting their performance and potentially misleading research.
Existing Methods vs KIEval
Currently, detecting data contamination can be quite hard. While there are methods work relatively well for detecting data contamination during pre-training phase, they perform poorly in detecting data leaks during SFT in our experiments. KIEval is proposed to be a contamination-free and effective method for evaluating LLMs' generative capabilities. It also helps to detect data contamination. Generally speaking, KIEval is a knowledge-based dynamic interactive evaluation framework designed to assess a model's knowledge generalization and application abilities through multi-round dialogue interactions.
How KIEval Works
- Dynamic Interactions: KIEval introduces an "interactor" model that engages in multi-round dialogues with the evaluated model. Each round generates new, deeper questions based on previous responses, testing the model's knowledge application and coherence.
- Evaluation Process: An initial question from a high-quality dataset starts the dialogue. The "interactor" then generates follow-up questions to probe deeper. An "evaluator" model assesses responses for relevance, coherence, and logic. This process goes on, iteratively, to simulate a "interview" of LLMs.
- Advantages: KIEval reduces the impact of data contamination and comprehensively evaluates the model's abilities beyond simple pattern matching.
Insights from KIEval
Here we provide some high-level takeaways from the Experiments section in our paper.
- Performance Gaps: Traditional benchmarks often underestimate performance differences between models. KIEval's dynamic dialogues reveal significant disparities in knowledge application and logical reasoning that static datasets miss.
- Impact of Data Contamination: Experiments with "cheating" models showed that data contamination leads to high test scores but poor performance in dynamic dialogues. This indicates that contaminated models memorize answers rather than understanding them.
- Evaluating Knowledge Depth: KIEval scores, compared with static dataset accuracy, help infer data leakage. High accuracy but poor dynamic response suggests memorization rather than true understanding.
- Correlation with Human Evaluation: KIEval scores correlate highly with human evaluations, proving its effectiveness and alignment with human judgment in dialogue quality.
- Model Biases: Experiments show that while biases exist at the sample level, overall evaluation scores by different evaluator models are strongly correlated, indicating biases do not significantly affect conclusions.
Metric | Pearson | Spearman | Kendall-Tau | Variance |
---|---|---|---|---|
METEOR | 0.016 | 0.023 | 0.021 | 0.012 |
ROUGE-1 | 0.259 | 0.316 | 0.226 | 0.016 |
ROUGE-2 | 0.280 | 0.303 | 0.223 | 0.007 |
ROUGE-L | 0.209 | 0.268 | 0.200 | 0.007 |
BERTScore | 0.189 | 0.336 | 0.250 | 0.001 |
MT-Bench | 0.520 | 0.494 | 0.405 | 9.360 |
Ours | ||||
Accuracy | 0.761 | 0.727 | 0.653 | 2.010 |
Logic | 0.768 | 0.735 | 0.661 | 1.842 |
Relevance | 0.633 | 0.643 | 0.543 | 1.152 |
Coherence | 0.750 | 0.740 | 0.644 | 1.365 |
Conciseness | 0.611 | 0.604 | 0.504 | 0.833 |
Overall | 0.814 | 0.789 | 0.721 | 1.512 |
Model | Evaluator | Accuracy | Logic | Relevance | Coherence | Conciseness | Overall |
---|---|---|---|---|---|---|---|
GPT-3.5 | GPT-4 | 94.6 | 94.7 | 98.5 | 96.1 | 97.3 | 95.5 |
Claude-3 | 98.6 | 98.8 | 99.8 | 99.4 | 99.0 | 98.7 | |
LLaMA-2 70B | GPT-4 | 81.9 | 82.8 | 92.2 | 85.3 | 75.6 | 84.1 |
Claude-3 | 98.3 | 98.7 | 98.2 | 96.9 | 84.6 | 96.4 | |
LLaMA-2 7B | GPT-4 | 70.6 | 71.6 | 90.4 | 77.9 | 71.7 | 74.4 |
Claude-3 | 90.9 | 91.8 | 98.0 | 95.0 | 85.2 | 91.0 |
Conclusion
KIEval provides a robust framework for dynamic, interactive evaluation of large language models, reducing the impact of data contamination and offering deeper insights into a model's true capabilities. It shifts the focus from static evaluation to a more comprehensive assessment of knowledge understanding and application.
Citation
If you find our work helpful, please consider citing us with:
@article{yu2024kieval, title={KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models}, author={Zhuohao Yu and Chang Gao and Wenjin Yao and Yidong Wang and Wei Ye and Jindong Wang and Xing Xie and Yue Zhang and Shikun Zhang}, journal={ArXiv}, year={2024}, volume={abs/2402.15043}, }