KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models

Zhuohao Yu1, Chang Gao1, Wenjin Yao1, Yidong Wang1, Wei Ye1*, Jindong Wang2, Xing Xie2, Yue Zhang3, Shikun Zhang1
Peking University1, Microsoft Research2, Westlake University3
ACL 2024
*Corresponding Author
Code Paper

As the field of artificial intelligence is captivated by the remarkable performances of large language models like GPT-4 and Claude 3, a crucial question emerges: Are we truly evaluating these models' capabilities objectively? What capabilities are we evaluating by asking LLMs do multiple choice problems? In fact, current evaluations of large models are facing the challenge of data contamination.

Comparison of KIEval and traditional methods

Data Contamination

Data contamination occurs when models are exposed to test set data of benchmarks during training, leading to inflated performance metrics. Data contamination can be hard to detect, as LLMs are commonly trained on complex data sources, making it hard to avoid leaks of test data. Some models even train directly on test sets to achieve higher scores, artificially boosting their performance and potentially misleading research.

Existing Methods vs KIEval

Currently, detecting data contamination can be quite hard. While there are methods work relatively well for detecting data contamination during pre-training phase, they perform poorly in detecting data leaks during SFT in our experiments. KIEval is proposed to be a contamination-free and effective method for evaluating LLMs' generative capabilities. It also helps to detect data contamination. Generally speaking, KIEval is a knowledge-based dynamic interactive evaluation framework designed to assess a model's knowledge generalization and application abilities through multi-round dialogue interactions.

KIEval Pipeline

How KIEval Works

  1. Dynamic Interactions: KIEval introduces an "interactor" model that engages in multi-round dialogues with the evaluated model. Each round generates new, deeper questions based on previous responses, testing the model's knowledge application and coherence.
  2. Evaluation Process: An initial question from a high-quality dataset starts the dialogue. The "interactor" then generates follow-up questions to probe deeper. An "evaluator" model assesses responses for relevance, coherence, and logic. This process goes on, iteratively, to simulate a "interview" of LLMs.
  3. Advantages: KIEval reduces the impact of data contamination and comprehensively evaluates the model's abilities beyond simple pattern matching.
Experimental Results

Insights from KIEval

Here we provide some high-level takeaways from the Experiments section in our paper.

  1. Performance Gaps: Traditional benchmarks often underestimate performance differences between models. KIEval's dynamic dialogues reveal significant disparities in knowledge application and logical reasoning that static datasets miss.
  2. Impact of Data Contamination: Experiments with "cheating" models showed that data contamination leads to high test scores but poor performance in dynamic dialogues. This indicates that contaminated models memorize answers rather than understanding them.
  3. Evaluating Knowledge Depth: KIEval scores, compared with static dataset accuracy, help infer data leakage. High accuracy but poor dynamic response suggests memorization rather than true understanding.
  4. Correlation with Human Evaluation: KIEval scores correlate highly with human evaluations, proving its effectiveness and alignment with human judgment in dialogue quality.
  5. Metric Pearson Spearman Kendall-Tau Variance
    METEOR 0.016 0.023 0.021 0.012
    ROUGE-1 0.259 0.316 0.226 0.016
    ROUGE-2 0.280 0.303 0.223 0.007
    ROUGE-L 0.209 0.268 0.200 0.007
    BERTScore 0.189 0.336 0.250 0.001
    MT-Bench 0.520 0.494 0.405 9.360
    Ours
    Accuracy 0.761 0.727 0.653 2.010
    Logic 0.768 0.735 0.661 1.842
    Relevance 0.633 0.643 0.543 1.152
    Coherence 0.750 0.740 0.644 1.365
    Conciseness 0.611 0.604 0.504 0.833
    Overall 0.814 0.789 0.721 1.512
  6. Model Biases: Experiments show that while biases exist at the sample level, overall evaluation scores by different evaluator models are strongly correlated, indicating biases do not significantly affect conclusions.
  7. Model Evaluator Accuracy Logic Relevance Coherence Conciseness Overall
    GPT-3.5 GPT-4 94.6 94.7 98.5 96.1 97.3 95.5
    Claude-3 98.6 98.8 99.8 99.4 99.0 98.7
    LLaMA-2 70B GPT-4 81.9 82.8 92.2 85.3 75.6 84.1
    Claude-3 98.3 98.7 98.2 96.9 84.6 96.4
    LLaMA-2 7B GPT-4 70.6 71.6 90.4 77.9 71.7 74.4
    Claude-3 90.9 91.8 98.0 95.0 85.2 91.0

Conclusion

KIEval provides a robust framework for dynamic, interactive evaluation of large language models, reducing the impact of data contamination and offering deeper insights into a model's true capabilities. It shifts the focus from static evaluation to a more comprehensive assessment of knowledge understanding and application.

Citation

If you find our work helpful, please consider citing us with:


@article{yu2024kieval,
    title={KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models}, 
    author={Zhuohao Yu and Chang Gao and Wenjin Yao and Yidong Wang and Wei Ye and Jindong Wang and Xing Xie and Yue Zhang and Shikun Zhang},
    journal={ArXiv},
    year={2024},
    volume={abs/2402.15043},
}