评估

许多公司面临着评估大型语言模型（LLM）是否适合其特定用例以及确定哪些 LLM 性能最佳的挑战。评估一个模型能否有效替代另一个模型并确保所选 LLM 满足公司的独特要求至关重要。然而，评估 LLM 的过程可能复杂且令人困惑。本指南提供了一个全面的方法来评估适用于各种应用的 LLM。我们将首先讨论评估的目标、评估数据集的创建，然后深入探讨三种主要的评估方法，并提供带有详细 Python 代码演练的 Notebook。

您希望解决什么具体任务？
如何创建评估数据集？
如何进行评估？
- 基于指标的评估
- 基于 LLM 的评估
- 基于人工的评估

您的任务是什么？

在我们开始评估过程之前，重要的是要考虑您希望使用 LLM 解决什么具体任务。准确定义任务至关重要。

一些糟糕的示例

成为一名乐于助人的助手（在您的上下文中，“乐于助人”是什么意思？）
将文本转换为代码（什么类型的文本和代码？）
提高文档质量（高质量和低质量如何区分？）

一些好的示例

成为客户投诉助手。确保以礼貌和乐于助人的方式回复投诉，并给出不超过 3 句话的简洁答案。
如果您不知道答案，请务必通知用户。
将伪代码转换为 Python 代码
通过纠正所有拼写和语法错误来重新组织文档，并使文本具有更专业的语气
从医疗记录中提取所有相关信息（某些用例将受到特定适用法规的约束）
总结各种类型的文档，例如法律协议、新闻文章或科学论文

不同的目标和用例可能需要不同的评估策略。一些用例可能优先考虑准确性，而另一些可能强调简洁性和乐于助人。因此，在开始评估过程之前，了解您希望解决什么具体任务至关重要。

如何创建评估数据集？

有许多可用于评估大型语言模型（LLM）的公共基准，例如 MMLU、GSMK8 等。首先要考虑的问题是这些现有基准是否可用于您的特定用例。如果适用，则从现有基准数据集开始。

然而，我们经常观察到现有基准是学术性的，可能未涵盖现实世界的客户用例，或者只涉及有限的范围。因此，通常更倾向于创建自己的评估数据集。

一旦您确定了目标并确定现有基准是否适用，就可以着手创建针对您的特定用例量身定制的自定义评估数据集。

标注数据

在许多情况下，创建标注数据很重要。例如，如果您的任务是使用 LLM 从受特定适用法规约束的医疗记录中提取信息，您可以使用人工标注来标记数据并获得包含您希望提取的所有信息的黄金标准答案，例如 {"age": 60, "gender": "male", "diagnosis": "diabetes", "weight": 210, "smoking": "yes"}。

您需要多少数据标注？这取决于任务。您应该始终将质量置于数量之上。如果您拥有高质量数据，100 个测试用例就足以提供有意义的信号。

此外，建议拥有一个具有辨别力的基准，这意味着它既不太容易，也不太难。如果每个被测试的模型都达到 90% 以上的准确率，它们之间的差异可能没有意义，这表明需要更好的基准。目标是创建能够有效区分模型的数据。创建具有辨别力的数据可能具有挑战性，因此请专注于您的特定任务。可能有多个模型能满足您的需求。好的评估数据集应该导致模型性能与模型大小相符，即，更大的模型性能与较小的模型相当或更好（除非较小的模型已针对您的目的进行了专门训练）。

以下是一些不同能力下各种 Prompt 和黄金标准答案的示例

任务	Prompt/上下文	标准答案/标准
信息提取	一位 60 岁的男性患者约翰逊先生，出现口渴加剧、尿频、疲劳和不明原因体重减轻等症状。经评估，他被诊断出患有糖尿病，血糖水平升高证实了这一点。约翰逊先生体重 210 磅。他被开了二甲双胍，每天两次，随餐服用。咨询期间注意到该患者目前吸烟。	`{"age": 60, "gender": "male", "diagnosis": "diabetes", "weight": 210, "smoking": "yes"}`
Python 代码生成	编写一个 Python 函数，使用位运算符检查给定数字是否为奇数。	`assert is_odd(5) == True`
摘要生成	将以下内容总结成一个适合三年级学生的简短段落：“水是一种化合物，由两个氢原子与一个氧原子结合而成，化学式为 H2O。氢原子通过极性共价键与氧原子结合，这意味着键中的电子没有在原子之间平均共享，导致氧原子带有轻微的负电荷，氢原子带有轻微的正电荷。这使得水分子呈弯曲的 V 形结构，并使它们能够与其他水分子形成氢键，这些氢键造就了水的许多独特属性，例如其高沸点和表面张力。”	水是一种特殊的物质，由两个微小的部分组成，叫做原子，具体来说是两个氢原子和一个氧原子。这些原子以一种弯曲的形状结合在一起，这使得水具有一些惊人的特性。例如，水可以粘在一起，这有助于它保持液体状态而不是变成气体。这也使得水具有高沸点，这意味着它需要大量的热量才能变成气体。水独特的特性使其成为我们生活和地球的重要组成部分。
多语言	将此大学作业翻译成高级西班牙语，供高级写作课程使用：“写一篇 500 字的个人散文，讲述您家庭中的一个关键时刻，描述事件、其对家庭关系的影响以及围绕它的情感和思想。您的散文应结构清晰，包含引言、正文和结论，并使用富有表现力和精确的语言。将根据清晰度、连贯性、深度和原创性进行评估。”	Escriba un ensayo personal de 500 palabras sobre un momento crucial en su familia, describiendo el evento, su impacto en las relaciones familiares, y las emociones y pensamientos que lo rodearon. Su ensayo debe tener una estructura clara, con una introducción, un cuerpo y una conclusión, y utilizar un lenguaje expresivo y preciso. Será evaluado sobre la base de claridad, coherencia, profundidad y originalidad.
聊天	您是一位八年级拉丁语课的辅导老师，您的工作是帮助学生记忆每周的词汇测试。您不直接给出答案，而是帮助学生自己找到答案。开始您的辅导课程时，请询问学生他们需要帮助记忆哪个词汇表。记住要给予鼓励，并要求学生“展示他们的思考过程”，即阐明他们如何得出答案。	你好！希望你今天过得愉快。作为你的拉丁语辅导老师，我在这里帮助你完成每周的词汇表测试。那么，让我们开始吧！你能告诉我这周你需要帮助记忆哪个词汇表吗？记住，目标不仅是记住单词，还要理解它们的含义以及如何在上下文中运用它们。一旦你选了一个单词，我想请你尝试回忆它的意思或用它造句。然后，我希望你“展示你的思考过程”，解释你是如何得出答案的。这将有助于巩固你的理解，使其更容易记住。听起来不错吗？让我们开始吧！

未标注数据：

在某些情况下，您可能无法轻松创建标注数据。您可以使用 LLM 生成所需的数据。例如，您可以要求 LLM 根据给定文本生成问题和答案。以下是两个用于生成问题和答案的 Prompt 模板

根据上下文生成问题的 Prompt 模板

Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge. Generate {num_questions_per_chunk} questions based on the context. The questions should be diverse in nature across the document. Restrict the questions to the context information provided.

根据上下文和上一个 Prompt 模板生成的问题生成答案的 Prompt 模板

Context information is below
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge,
answer the query.
Query: {generated_query_str}
Answer: 

然而，LLM 生成的数据通常需要进一步的细化和过滤。最好使用来自真实世界来源的问题，尤其是来自专家的。尽管如此，当真实数据难以获取或稀缺时，LLM 生成的数据仍然可以提供帮助。

如何进行评估？

基于指标的评估

基于指标的评估是许多公共基准评估中的标准方法。有许多指标可用于评估和比较模型生成的答案与标准答案。最常用的是完全匹配准确率、召回率、精确率和 F1 值。

完全匹配准确率：此指标衡量模型生成的答案与标准答案完全匹配的比例。它严格评估模型生成精确响应的能力。
召回率：召回率计算模型生成的答案成功检索到的标准答案中相关信息的比例。较高的召回率表明模型可以捕获更多相关信息。
精确率：精确率评估模型生成的答案中相关信息的比例。较高的精确率意味着模型生成更准确和相关的响应。
F1 值：F1 值是召回率和精确率的调和平均数，对模型在两项指标上的性能进行平衡评估。

除了这些常用指标外，还有其他自然语言处理评估指标，例如 BLEU（双语评估替补）和 ROUGE（面向召回的摘要评估替补）。BLEU 主要用于机器翻译任务，而 ROUGE 通常用于文本摘要。然而，由于对其可靠性和可能产生误导性结果的担忧，我们不推荐使用这些指标。

示例 1：使用准确率评估信息提取

示例 1：使用准确率评估信息提取

评估数据

第一个示例涉及从医疗笔记中提取患者信息。为了进行评估，评估数据中需要同时包含医疗笔记和正确/预期答案（称为“标准答案”）。以下是两个示例测试用例

prompts = {
    "Johnson": {
        "medical_notes": "A 60-year-old male patient, Mr. Johnson, presented with symptoms of increased thirst, frequent urination, fatigue, and unexplained weight loss. Upon evaluation, he was diagnosed with diabetes, confirmed by elevated blood sugar levels. Mr. Johnson's weight is 210 lbs. He has been prescribed Metformin to be taken twice daily with meals. It was noted during the consultation that the patient is a current smoker. ",
        "golden_answer": {
            "age": 60,
            "gender": "male",
            "diagnosis": "diabetes",
            "weight": 210,
            "smoking": "yes",
        },
    },
    "Smith": {
        "medical_notes": "Mr. Smith, a 55-year-old male patient, presented with severe joint pain and stiffness in his knees and hands, along with swelling and limited range of motion. After a thorough examination and diagnostic tests, he was diagnosed with arthritis. It is important for Mr. Smith to maintain a healthy weight (currently at 150 lbs) and quit smoking, as these factors can exacerbate symptoms of arthritis and contribute to joint damage.",
        "golden_answer": {
            "age": 55,
            "gender": "male",
            "diagnosis": "arthritis",
            "weight": 150,
            "smoking": "yes",
        },
    },
}

如何进行评估？

步骤 1：定义 Prompt 模板

我们设计了一个将医疗笔记作为上下文的 Prompt。此外，我们期望模型按照预定义的 JSON Schema 以 JSON 格式提供输出，确保模型准确生成所需的输出。值得注意的是，在调用我们的模型时，我们将响应格式指定为 {"type": "json_object"}，以确保一致的 JSON 输出。

import os
from mistralai import Mistral

def run_mistral(user_message, model="mistral-large-latest"):
    client = Mistral(api_key=os.getenv("MISTRAL_API_KEY"))
    messages = [{"role": "user", "content": user_message}]
    chat_response = client.chat.complete(
        model=model,
        messages=messages,
        response_format={"type": "json_object"},
    )
    return chat_response.choices[0].message.content


# define prompt template
prompt_template = """
Extract information from the following medical notes:
{medical_notes}

Return json format with the following JSON schema: 

{{
        "age": {{
            "type": "integer"
        }},
        "gender": {{
            "type": "string",
            "enum": ["male", "female", "other"]
        }},
        "diagnosis": {{
            "type": "string",
            "enum": ["migraine", "diabetes", "arthritis", "acne", "common cold"]
        }},
        "weight": {{
            "type": "integer"
        }},
        "smoking": {{
            "type": "string",
            "enum": ["yes", "no"]
        }},
        
}}
"""

步骤 2：定义如何将模型响应与标准答案进行比较

在步骤 2 中，我们编写了一个函数来比较两个 JSON 对象，一个作为模型响应，另一个作为标准答案。在此示例中，我们计算 JSON 键中匹配值的百分比来评估 JSON 输出的准确性。

import json

def compare_json_objects(obj1, obj2):
    total_fields = 0
    identical_fields = 0
    common_keys = set(obj1.keys()) & set(obj2.keys())
    for key in common_keys:
        identical_fields += obj1[key] == obj2[key]
    percentage_identical = (identical_fields / max(len(obj1.keys()), 1)) * 100
    return percentage_identical

步骤 3：计算测试用例的准确率现在，我们可以逐个测试用例进行评估。对于每个用例，我们可以根据已定义的 Prompt 模板创建用户消息。然后，我们从 LLM 中检索响应，并将其与正确答案进行比较。计算每个测试用例的准确率后，我们可以计算所有用例的总体平均准确率。

accuracy_rates = []

# for each test case
for name in prompts:

    # define user message
    user_message = prompt_template.format(medical_notes=prompts[name]["medical_notes"])

    # run LLM
    response = json.loads(run_mistral(user_message))

    # calculate accuracy rate for this test case
    accuracy_rates.append(
        compare_json_objects(response, prompts[name]["golden_answer"])
    )

# calculate accuracy rate across test cases
sum(accuracy_rates) / len(accuracy_rates)

示例 2：评估代码生成

示例 2：评估代码生成

评估数据

第二个示例涉及生成 Python 代码并评估生成的代码。为了进行评估，评估数据中需要同时包含 Python 指令和相应的单元测试。以下是两个此类评估数据集的示例

python_prompts = {
    "sort_string": {
        "prompt": "Write a python function to sort the given string.", 
        "test": "assert sort_string(\"data\") == \"aadt\""
    },
    "is_odd": {
        "prompt": "Write a python function to check whether the given number is odd or not using bitwise operator.", 
        "test": "assert is_odd(5) == True"
    }
}

如何进行评估？

步骤 1：定义 Prompt 模板

我们设计了一个根据特定任务描述生成 Python 代码片段的 Prompt。

def run_mistral(user_message, model="mistral-large-latest"):
    client = Mistral(api_key=os.getenv("MISTRAL_API_KEY"))
    messages = [{"role":"user", "content": user_message}]
    chat_response = client.chat.complete(
        model=model,
        messages=messages,
        response_format={"type": "json_object"},
    )
    return chat_response.choices[0].message.content


# define prompt template
prompt_template = """Write a Python function to execute the following task: {task} 
Return only valid Python code. Do not give any explanation.
Never start with ```python.
Always start with def {name}(. 
"""

步骤 2：决定如何评估代码生成

接下来，我们可以使用 code_eval.compute 来评估生成的代码是否通过预定义的单元测试。predictions 是 LLM 生成的代码输出列表，references 是每个预测的单元测试列表。

警告

此代码旨在运行模型生成的代码，该代码可能并不完全可靠。虽然模型生成的代码不太可能故意造成损害，但强烈建议在沙箱环境中运行此评估套件，将其与主系统和网络隔离。

from evaluate import load

os.environ["HF_ALLOW_CODE_EVAL"] = "1"
code_eval = load("code_eval")

# example using code_eval:
pass_at_1, results = code_eval.compute(
    references=['assert is_odd(5) == True'], 
    predictions=[['def is_odd(n):\n    return n & 1 != 0']], 
    k=[1])

# example output: 
# ({'pass@1': 1.0},
#  defaultdict(list,
#              {0: [(0,
#                 {'task_id': 0,
#                  'passed': True,
#                  'result': 'passed',
#                  'completion_id': 0})]}))

步骤 3：计算测试用例的准确率

现在，我们可以遍历所有测试用例，根据 Prompt 模板创建用户消息，使用 LLM 生成 Python 代码，并评估每个测试用例生成的代码。

refs = []
preds = []

for name in python_prompts:

    # define user message
    user_message = prompt_template.format(
        task=python_prompts[name]["prompt"], name=name
    )

    # run LLM
    response = run_mistral(user_message)

    refs.append(python_prompts[name]["test"])
    preds.append([response])

# evaluate code generation
pass_at_1, results = code_eval.compute(references=refs, predictions=preds)

pass_at_1

# example output
# {'pass@1': 1.0}

基于 LLM 的评估

在缺乏标注数据或标注数据不足的情况下，使用大型语言模型（LLM）来评估或判断另一个 LLM 的输出是一种常见做法。MT Bench 论文探讨了 LLM 作为评判者的有效性，并揭示强大的 LLM 可以达到与人类相似的性能。与通常需要大量时间和资源的人工评估相比，LLM 可以在相对较短的时间内处理和评估大量数据，这使得它具有高度的可扩展性和效率。

使用 LLM 作为评判者有几种方法，包括单点评分、基于参考的评分和成对评分。

单点评分：LLM 根据生成输出的质量或准确性为其分配一个分数。通常根据具体的评分说明进行评分。单点评分是一种直接高效的方法，但它可能无法捕捉各种复杂输出的细微差别。
基于参考的评分：LLM 将生成输出与一个或多个参考输出进行比较，并根据它们的相似性分配分数。此方法常用于机器翻译任务，其中给定输入可能有多个有效翻译。然而，基于参考的评分需要提供标准答案，而标准答案并非总是可得。
成对评分：LLM 比较两个生成输出，并根据它们的相对质量或准确性分配分数。此方法常用于对话生成等任务，其中给定查询可能存在多个有效响应。通过比较成对的响应，LLM 可以确定哪个响应更相关或更连贯，并相应地分配分数。

认识到潜在的局限性和挑战也至关重要。例如，LLM 可能存在固有的偏差。由一家公司开发的 LLM 倾向于偏爱同一家公司模型生成的答案。难以确保公平准确的评估。根据我们的经验，Mistral Large 表现出的偏袒性相对较小。

示例 3：使用 LLM 评估摘要生成

示例 3：使用 LLM 评估摘要生成

评估数据

在此示例中，我们生成新闻摘要，并使用 LLM 单点评分来评估摘要质量。为了进行评估，让我们使用一篇样本新闻文章进行摘要。

news = (
    "BRUSSELS (Reuters) - Theresa May looked despondent , with deep rings under her eyes, EU chief executive Jean-Claude Juncker told aides after dining with the British prime minister last week, a German newspaper said on Sunday. The report by a Frankfurter Allgemeine Zeitung correspondent whose leaked account of a Juncker-May dinner in April caused upset in London, said Juncker thought her marked by battles over Brexit with her own Conservative ministers as she asked for EU help to create more room for maneuver at home. No immediate comment was available from Juncker s office, which has a policy of not commenting on reports of meetings. The FAZ said May, who flew in for a hastily announced dinner in Brussels with the European Commission president last Monday ahead of an EU summit, seemed to Juncker anxious, despondent and disheartened , a woman who trusts hardly anyone but is also not ready for a clear-out to free herself . As she later did over dinner on Thursday with fellow EU leaders, May asked for help to overcome British divisions. She indicated that back home friend and foe are at her back plotting to bring her down, the paper said. May said she had no room left to maneuver. The Europeans have to create it for her. May s face and appearance spoke volumes, Juncker later told his colleagues, the FAZ added. She has deep rings under her eyes. She looks like someone who can t sleep a wink. She smiles for the cameras, it went on, but it looks forced , unlike in the past, when she could shake with laughter. Now she needs all her strength not to lose her poise. As with the April dinner at 10 Downing Street, when the FAZ reported that Juncker thought May in another galaxy in terms of Brexit expectations, both sides issued statements after last week s meeting saying talks were constructive and friendly . They said they agreed negotiations should be accelerated . May dismissed the dinner leak six months ago as Brussels gossip , though officials on both sides said the report in the FAZ did little to foster an atmosphere of trust which they agree will be important to reach a deal. German Chancellor Angela Merkel was also reported to have been irritated by that leak. Although the summit on Thursday and Friday rejected May s call for an immediate start to talks on the future relationship, leaders made a gesture to speed up the process and voiced hopes of opening a new phase in December. Some said they understood May s difficulties in forging consensus in London.",
)

如何进行评估？

步骤 1：为给定新闻生成摘要

首先，让我们使用一个较小的模型 open-mistral-7b 为提供的新闻文章生成摘要。如果您还有其他新闻文章需要摘要，请为每篇生成一个摘要。为了在此示例中简化，我们将只演示一个新闻样本。

import os
from mistralai import Mistral

def run_mistral(user_message, model="open-mistral-7b", is_json=False):
    client = Mistral(api_key=os.getenv("MISTRAL_API_KEY"))
    messages = [{"role":"user", "content":user_message}]

    if is_json:
        chat_response = client.chat.complete(
            model=model, messages=messages, response_format={"type": "json_object"}
        )
    else:
        chat_response = client.chat.complete(model=model, messages=messages)

    return chat_response.choices[0].message.content

summary_prompt = f"""
Summarize the following news. Write the summary based on the following criteria: relevancy and readability. Consider the sources cited, the quality of evidence provided, and any potential biases or misinformation. 

## News: 
{news}
"""

summary = run_mistral(summary_prompt)

步骤 2：定义评估指标和评分标准

为了准确评估生成的摘要质量，我们需要建立清晰明确的评估指标和评分标准。这些指南在指导 LLM 进行评估过程中起着关键作用。您可以自由地纳入各种指标并根据您的具体需求创建评分标准。

eval_rubrics = [
    {
        "metric": "relevancy", 
        "rubrics": """
        Score 1: The summary is not relevant to the original text. 
        Score 2: The summary is somewhat relevant to the original text, but has significant flaws.
        Score 3: The summary is mostly relevant to the original text, and effectively conveys its main ideas and arguments.
        Score 4: The summary is highly relevant to the original text, and provides additional value or insight.
        """
    },
    {
        "metric": "readability", 
        "rubrics": """
        Score 1: The summary is difficult to read and understand.
        Score 2: The summary is somewhat readable, but has significant flaws.
        Score 3: The summary is mostly readable and easy to understand.
        Score 4: The summary is highly readable and engaging.
        """
    },
    
]

步骤 3：使用更强大的 LLM（例如，Mistral Large）作为评判者

使用更强大的 LLM（例如 Mistral Large）作为评判者有利于确保对生成的摘要进行更准确和全面的评估。在 Prompt 中，我们提供了具体的评估指标、相关的评分标准、原始新闻文章以及生成的摘要。这些信息使 LLM 能够系统地根据预定义的标准评估摘要。在此示例中，我们分别评估每个指标，以更好地了解摘要模型在不同方面的性能。但是，您也可以选择合并所有指标进行更一般性的评估。

scoring_prompt = """
Please read the provided news article and its corresponding summary. Based on the specified evaluation metric and rubrics, assign an integer score between 1 and 4 to the summary. Then, return a JSON object with the metric as the key and the evaluation score as the value.

# Evaluation metric:
{metric}

# Evaluation rubrics:
{rubrics}

# News article
{news}

# Summary
{summary}

"""
for i in eval_rubrics:
    eval_output = run_mistral(
        scoring_prompt.format(
            news=news, summary=summary, metric=i["metric"], rubrics=i["rubrics"]
        ),
        model="mistral-large-latest",
        is_json=True,
    )
    print(eval_output)

# example output:
# {"relevancy": 2}
# {"readability": 3}

基于人工的评估

基于人工的评估可能提供最准确可靠的评估结果。然而，它的规模化实施困难且成本高昂。尽管存在这些挑战，将人工评估纳入更优质训练数据和微调过程的开发仍然非常有益，因为它效果显著。

人工评估的一种常见方法是通过众包。例如，著名的LMSYS 排行榜就是利用众包，要求投票者对两个匿名模型进行提问并投票选出更好的模型。LMSYS 已收集超过 30 万张人工投票，创建了一个基于 Elo 算法的 LLM 排行榜。许多公司也利用人工评估进行内部基准测试，通常利用众包平台来促进这一过程。

在实施众包进行人工评估时，您可以选择一种简单的方法，要求投票者选择更好的模型。或者，如果您的用例有更复杂的要求，您可以提供具体的标准供投票者参考，针对同理心、流畅度和其他相关因素。

您的任务是什么？​

如何创建评估数据集？​

标注数据​

未标注数据：​

如何进行评估？​

基于指标的评估​

示例 1：使用准确率评估信息提取​

评估数据​

如何进行评估？​

示例 2：评估代码生成​

评估数据​

如何进行评估？​

基于 LLM 的评估​

示例 3：使用 LLM 评估摘要生成​

评估数据​

如何进行评估？​

基于人工的评估​

您的任务是什么？

如何创建评估数据集？

标注数据

未标注数据：

如何进行评估？

基于指标的评估

示例 1：使用准确率评估信息提取

评估数据

如何进行评估？

示例 2：评估代码生成

评估数据

如何进行评估？

基于 LLM 的评估

示例 3：使用 LLM 评估摘要生成

评估数据

如何进行评估？

基于人工的评估