LMFlow Benchmark Guide#

We support two ways to add evaluation settings in our repo, NLL Task Setting and LM-Evaluation Task Setting. Below are the details of them:

1. NLL Task Setting#

Users can easily create new tasks and evaluate their datasets on the provide nll (Negative Log Likelihood) metric.

Setup#

Fork the main repo, clone it, and create a new branch with the name of your task, and install the following:

# After forking...
git clone https://github.com/<YOUR-USERNAME>/LMFlow.git
cd LMFlow
git checkout -b <TASK-NAME>
conda create -n lmflow python=3.9 -y
conda activate lmflow
conda install mpi4py
pip install -e .

Create Your Task Dataset File#

We provide several available datasets under data after running

cd data && ./download.sh && cd -

You can refer to some given evaluation dataset files and create your own. Also, you may refer to our guide on DATASET.

In this step, you will need to decide your answer type like text2text or text_only (Notice that the current nll implementation only supports these two answer types). We will note the chosen answer type as <ANSWER_TYPE>.

After preparing your own DATASET file, you can put it under data dir and make a TASK dir.

mkdir <TASK>
mv <DATASET> <TASK>

Task Registration#

Note the path of your dataset, data/<TASK>/<DATASET>.

Open the file examples/benchmarking.py, add your task’s info into LOCAL_DATSET_GROUP_MAP, LOCAL_DATSET_MAP, LOCAL_DATSET_ANSWERTYPE_MAP

In LOCAL_DATSET_MAP, you will need to specify your DATASET files’ path:

LOCAL_DATSET_MAP ={
    "...":"...",
    "<TASK>":"data/<TASK>/<DATASET>",
}

In LOCAL_DATSET_ANSWERTYPE_MAP, you will need to specify your task’s <ANSWER_TYPE>:

LOCAL_DATSET_ANSWERTYPE_MAP ={
    "...":"...",
    "<TASK>":"<ANSWER_TYPE>,
}

If you only have one task, you can add key-value pair like "<TASK>":"<TASK>" in LOCAL_DATSET_GROUP_MAP:

LOCAL_DATSET_GROUP_MAP ={
    "...":"...",
    "<TASK>":"<TASK>",
}

If you want to combine several tasks, you may first specify a combination name <TASK_COMBINATION> and add key-value pair like "<TASK_COMBINATION>":"<TASK_1>,<TASK_2>,.."in LOCAL_DATSET_GROUP_MAP.

Remember to separate TASK by ,:

LOCAL_DATSET_GROUP_MAP ={
    "...":"...",
    "<TASK_COMBINATION>":"<TASK_1>,<TASK_2>,..",
}

After finishing changing these items, you can run your own <TASK> like:

deepspeed examples/benchmarking.py \
  --answer_type <ANSWER_TYPE> \
  --use_ram_optimized_load False \
  --model_name_or_path ${model_name} \
  --dataset_name data/<TASK>/<DATASET>\
  --deepspeed examples/ds_config.json \
  --metric nll \
  --prompt_structure "###Human: {input}###Assistant:" \
  | tee ${log_dir}/train.log \
  2> ${log_dir}/train.err 

2. LM-Evaluation Task Setting#

We integrate EleutherAI/lm-evaluation-harness into benchamrk.py by directly executing the evaluate commands. Users can also use their evaluation by simply changing two items in <LM_EVAL_DATASET_MAP> of examples/benchmarking.py.

Please refer to Eleuther’s task-table to get exact <TASK> name.

Similarly, you can combine several tasks, you may first specify a combination name <TASK_COMBINATION> and add key-value pair like "<TASK_COMBINATION>":"<TASK_1>,<TASK_2>,.."in LM_EVAL_DATASET_MAP.

Also, remember to separate TASK by ,:

LM_EVAL_DATASET_MAP ={
    "...":"...",
    "<TASK_COMBINATION>":"<TASK_1>,<TASK_2>,..",
}