# LMFlow Benchmark Guide We support two ways to add evaluation settings in our repo, `NLL Task Setting` and `LM-Evaluation Task Setting`. Below are the details of them: # 1. NLL Task Setting Users can easily create new tasks and evaluate their datasets on the provide `nll (Negative Log Likelihood)` metric. ## Setup Fork the main repo, clone it, and create a new branch with the name of your task, and install the following: ```bash # After forking... git clone https://github.com//LMFlow.git cd LMFlow git checkout -b conda create -n lmflow python=3.9 -y conda activate lmflow conda install mpi4py pip install -e . ``` ## Create Your Task Dataset File We provide several available datasets under `data` after running ```sh cd data && ./download.sh && cd - ``` You can refer to some given evaluation dataset files and create your own. Also, you may refer to our guide on [DATASET](https://optimalscale.github.io/LMFlow/examples/DATASETS.html). In this step, you will need to decide your answer type like `text2text` or `text_only` (Notice that the current `nll` implementation only supports these two answer types). We will note the chosen answer type as ``. After preparing your own `DATASET` file, you can put it under `data` dir and make a `TASK` dir. ```bash mkdir mv ``` ## Task Registration Note the path of your dataset, `data//`. Open the file `examples/benchmarking.py`, add your task's info into `LOCAL_DATSET_GROUP_MAP`, `LOCAL_DATSET_MAP`, `LOCAL_DATSET_ANSWERTYPE_MAP` In `LOCAL_DATSET_MAP`, you will need to specify your `DATASET` files' path: ```python LOCAL_DATSET_MAP ={ "...":"...", "":"data//", } ``` In `LOCAL_DATSET_ANSWERTYPE_MAP`, you will need to specify your task's ``: ```python LOCAL_DATSET_ANSWERTYPE_MAP ={ "...":"...", "":", } ``` If you only have one task, you can add key-value pair like `"":""` in `LOCAL_DATSET_GROUP_MAP`: ```python LOCAL_DATSET_GROUP_MAP ={ "...":"...", "":"", } ``` If you want to combine several tasks, you may first specify a combination name `` and add key-value pair like `"":",,.."`in `LOCAL_DATSET_GROUP_MAP`. Remember to separate TASK by `,`: ```python LOCAL_DATSET_GROUP_MAP ={ "...":"...", "":",,..", } ``` After finishing changing these items, you can run your own `` like: ```bash deepspeed examples/benchmarking.py \ --answer_type \ --use_ram_optimized_load False \ --model_name_or_path ${model_name} \ --dataset_name data//\ --deepspeed examples/ds_config.json \ --metric nll \ --prompt_structure "###Human: {input}###Assistant:" \ | tee ${log_dir}/train.log \ 2> ${log_dir}/train.err ``` # 2. LM-Evaluation Task Setting We integrate [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) into `benchamrk.py` by directly executing the evaluate commands. Users can also use their evaluation by simply changing two items in `` of `examples/benchmarking.py`. Please refer to Eleuther's [task-table](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/docs/task_table.md) to get exact `` name. Similarly, you can combine several tasks, you may first specify a combination name `` and add key-value pair like `"":",,.."`in `LM_EVAL_DATASET_MAP`. Also, remember to separate TASK by `,`: ```python LM_EVAL_DATASET_MAP ={ "...":"...", "":",,..", } ```