LMFlow Benchmark Guide#
We support two ways to add evaluation settings in our repo, NLL Task Setting
and LM-Evaluation Task Setting
. Below are the details of them:
1. NLL Task Setting#
Users can easily create new tasks and evaluate their datasets on
the provide nll (Negative Log Likelihood)
metric.
Setup#
Fork the main repo, clone it, and create a new branch with the name of your task, and install the following:
# After forking...
git clone https://github.com/<YOUR-USERNAME>/LMFlow.git
cd LMFlow
git checkout -b <TASK-NAME>
conda create -n lmflow python=3.9 -y
conda activate lmflow
conda install mpi4py
pip install -e .
Create Your Task Dataset File#
We provide several available datasets under data
after running
cd data && ./download.sh && cd -
You can refer to some given evaluation dataset files and create your own. Also, you may refer to our guide on DATASET.
In this step, you will need to decide your answer type like text2text
or text_only
(Notice that the current nll
implementation only supports these
two answer types). We will note the chosen answer type as <ANSWER_TYPE>
.
After preparing your own DATASET
file, you can put it under data
dir
and make a TASK
dir.
mkdir <TASK>
mv <DATASET> <TASK>
Task Registration#
Note the path of your dataset, data/<TASK>/<DATASET>
.
Open the file examples/benchmarking.py
, add your task’s info into
LOCAL_DATSET_GROUP_MAP
, LOCAL_DATSET_MAP
, LOCAL_DATSET_ANSWERTYPE_MAP
In LOCAL_DATSET_MAP
, you will need to specify your DATASET
files’ path:
LOCAL_DATSET_MAP ={
"...":"...",
"<TASK>":"data/<TASK>/<DATASET>",
}
In LOCAL_DATSET_ANSWERTYPE_MAP
, you will need to specify your task’s
<ANSWER_TYPE>
:
LOCAL_DATSET_ANSWERTYPE_MAP ={
"...":"...",
"<TASK>":"<ANSWER_TYPE>,
}
If you only have one task, you can add key-value pair like "<TASK>":"<TASK>"
in LOCAL_DATSET_GROUP_MAP
:
LOCAL_DATSET_GROUP_MAP ={
"...":"...",
"<TASK>":"<TASK>",
}
If you want to combine several tasks, you may first specify a
combination name <TASK_COMBINATION>
and add key-value pair like
"<TASK_COMBINATION>":"<TASK_1>,<TASK_2>,.."
in LOCAL_DATSET_GROUP_MAP
.
Remember to separate TASK by ,
:
LOCAL_DATSET_GROUP_MAP ={
"...":"...",
"<TASK_COMBINATION>":"<TASK_1>,<TASK_2>,..",
}
After finishing changing these items, you can run your own <TASK>
like:
deepspeed examples/benchmarking.py \
--answer_type <ANSWER_TYPE> \
--use_ram_optimized_load False \
--model_name_or_path ${model_name} \
--dataset_name data/<TASK>/<DATASET>\
--deepspeed examples/ds_config.json \
--metric nll \
--prompt_structure "###Human: {input}###Assistant:" \
| tee ${log_dir}/train.log \
2> ${log_dir}/train.err
2. LM-Evaluation Task Setting#
We integrate EleutherAI/lm-evaluation-harness into
benchamrk.py
by directly executing the evaluate commands. Users
can also use their evaluation by simply changing two items in
<LM_EVAL_DATASET_MAP>
of examples/benchmarking.py
.
Please refer to Eleuther’s
task-table
to get exact <TASK>
name.
Similarly, you can combine several tasks, you may first specify a
combination name <TASK_COMBINATION>
and add key-value pair like
"<TASK_COMBINATION>":"<TASK_1>,<TASK_2>,.."
in LM_EVAL_DATASET_MAP
.
Also, remember to separate TASK by ,
:
LM_EVAL_DATASET_MAP ={
"...":"...",
"<TASK_COMBINATION>":"<TASK_1>,<TASK_2>,..",
}