lmflow.args#

This script defines dataclasses: ModelArguments and DatasetArguments, that contain the arguments for the model and dataset used in training.

It imports several modules, including dataclasses, field from typing, Optional from typing, require_version from transformers.utils.versions, MODEL_FOR_CAUSAL_LM_MAPPING, and TrainingArguments from transformers.

MODEL_CONFIG_CLASSES is assigned a list of the model config classes from MODEL_FOR_CAUSAL_LM_MAPPING. MODEL_TYPES is assigned a tuple of the model types extracted from the MODEL_CONFIG_CLASSES.

Attributes#

Classes#

OptimizerNames

ModelArguments

Define a class ModelArguments using the dataclass decorator.

VisModelArguments

Define a class ModelArguments using the dataclass decorator.

DatasetArguments

Define a class DatasetArguments using the dataclass decorator.

MultiModalDatasetArguments

Define a class DatasetArguments using the dataclass decorator.

FinetunerArguments

Adapt transformers.TrainingArguments

RewardModelTunerArguments

Arguments for reward modeling.

EvaluatorArguments

Define a class EvaluatorArguments using the dataclass decorator. The class contains several optional

InferencerArguments

Define a class InferencerArguments using the dataclass decorator. The class contains several optional

RaftAlignerArguments

Define a class RaftAlignerArguments to configure raft aligner.

BenchmarkingArguments

DPOAlignerArguments

The arguments for the DPO training script.

DPOv2AlignerArguments

The arguments for the DPOv2 training script.

IterativeAlignerArguments

Arguments for iterative aligners.

IterativeDPOAlignerArguments

Arguments for iterative DPO aligners.

AutoArguments

Automatically choose arguments from FinetunerArguments or EvaluatorArguments.

Module Contents#

lmflow.args.MODEL_CONFIG_CLASSES[source]#
lmflow.args.MODEL_TYPES[source]#
lmflow.args.logger[source]#
class lmflow.args.OptimizerNames[source]#
DUMMY = 'dummy'[source]#
ADABELIEF = 'adabelief'[source]#
ADABOUND = 'adabound'[source]#
LARS = 'lars'[source]#
LAMB = 'lamb'[source]#
ADAMAX = 'adamax'[source]#
NADAM = 'nadam'[source]#
RADAM = 'radam'[source]#
ADAMP = 'adamp'[source]#
SGDP = 'sgdp'[source]#
YOGI = 'yogi'[source]#
SOPHIA = 'sophia'[source]#
ADAN = 'adan'[source]#
ADAM = 'adam'[source]#
NOVOGRAD = 'novograd'[source]#
ADADELTA = 'adadelta'[source]#
ADAGRAD = 'adagrad'[source]#
ADAMW_SCHEDULE_FREE = 'adamw_schedule_free'[source]#
SGD_SCHEDULE_FREE = 'sgd_schedule_free'[source]#
class lmflow.args.ModelArguments[source]#

Define a class ModelArguments using the dataclass decorator. The class contains several optional parameters that can be used to configure a model.

model_name_or_pathstr

a string representing the path or name of a pretrained model checkpoint for weights initialization. If None, a model will be trained from scratch.

model_typestr

a string representing the type of model to use if training from scratch. If not provided, a pretrained model will be used.

config_overridesstr

a string representing the default config settings to override when training a model from scratch.

config_namestr

a string representing the name or path of the pretrained config to use, if different from the model_name_or_path.

tokenizer_namestr

a string representing the name or path of the pretrained tokenizer to use, if different from the model_name_or_path.

cache_dirstr

a string representing the path to the directory where pretrained models downloaded from huggingface.co will be stored.

use_fast_tokenizerbool

a boolean indicating whether to use a fast tokenizer (backed by the tokenizers library) or not.

model_revisionstr

a string representing the specific model version to use (can be a branch name, tag name, or commit id).

tokenOptional[str]

Necessary when accessing a private model/dataset.

torch_dtypestr

a string representing the dtype to load the model under. If auto is passed, the dtype will be automatically derived from the model’s weights.

use_ram_optimized_loadbool

a boolean indicating whether to use disk mapping when memory is not enough.

use_int8bool

a boolean indicating whether to load int8 quantization for inference.

load_in_4bitbool

whether to load the model in 4bit

model_max_lengthint

The maximum length of the model.

truncation_sidestr

The side on which the model should have truncation applied.

arch_typestr

Model architecture type.

padding_sidestr

The side on which the tokenizer should have padding applied.

eos_paddingbool

whether to pad with eos token instead of pad token.

ignore_bias_buffersbool

fix for DDP issues with LM bias/mask buffers - invalid scalar type,`inplace operation.

model_name_or_path: str | None = None[source]#
lora_model_path: str | None = None[source]#
model_type: str | None = None[source]#
config_overrides: str | None = None[source]#
arch_type: str | None = 'decoder_only'[source]#
config_name: str | None = None[source]#
tokenizer_name: str | None = None[source]#
cache_dir: str | None = None[source]#
use_fast_tokenizer: bool = True[source]#
model_revision: str = 'main'[source]#
token: str | None = None[source]#
trust_remote_code: bool = False[source]#
torch_dtype: str | None = None[source]#
use_dora: bool = False[source]#
use_lora: bool = False[source]#
use_qlora: bool = False[source]#
bits: int = 4[source]#
quant_type: str = 'nf4'[source]#
double_quant: bool = True[source]#
lora_r: int = 8[source]#
lora_alpha: int = 32[source]#
lora_target_modules: List[str] = None[source]#
lora_dropout: float = 0.1[source]#
save_aggregated_lora: bool = False[source]#
use_ram_optimized_load: bool = True[source]#
use_flash_attention: bool = False[source]#
truncate_to_model_max_length: bool = True[source]#
do_rope_scaling: bool = False[source]#
rope_pi_ratio: int = 1[source]#
rope_ntk_ratio: int = 1[source]#
use_int8: bool = False[source]#
load_in_4bit: bool | None = True[source]#
model_max_length: int | None = None[source]#
truncation_side: str = None[source]#
padding_side: str = 'right'[source]#
eos_padding: bool | None = False[source]#
ignore_bias_buffers: bool | None = False[source]#
__post_init__()[source]#
class lmflow.args.VisModelArguments[source]#

Bases: ModelArguments

Define a class ModelArguments using the dataclass decorator. The class contains several optional parameters that can be used to configure a model.

model_name_or_pathstr

a string representing the path or name of a pretrained model checkpoint for weights initialization. If None, a model will be trained from scratch.

model_typestr

a string representing the type of model to use if training from scratch. If not provided, a pretrained model will be used.

config_overridesstr

a string representing the default config settings to override when training a model from scratch.

config_namestr

a string representing the name or path of the pretrained config to use, if different from the model_name_or_path.

tokenizer_namestr

a string representing the name or path of the pretrained tokenizer to use, if different from the model_name_or_path.

cache_dirstr

a string representing the path to the directory where pretrained models downloaded from huggingface.co will be stored.

use_fast_tokenizerbool

a boolean indicating whether to use a fast tokenizer (backed by the tokenizers library) or not.

model_revisionstr

a string representing the specific model version to use (can be a branch name, tag name, or commit id).

tokenOptional[str]

Necessary when accessing a private model/dataset.

torch_dtypestr

a string representing the dtype to load the model under. If auto is passed, the dtype will be automatically derived from the model’s weights.

use_ram_optimized_loadbool

a boolean indicating whether to use disk mapping when memory is not enough.

use_int8bool

a boolean indicating whether to load int8 quantization for inference.

load_in_4bitbool

whether to load the model in 4bit

model_max_lengthint

The maximum length of the model.

truncation_sidestr

The side on which the model should have truncation applied.

arch_typestr

Model architecture type.

padding_sidestr

The side on which the tokenizer should have padding applied.

eos_paddingbool

whether to pad with eos token instead of pad token.

ignore_bias_buffersbool

fix for DDP issues with LM bias/mask buffers - invalid scalar type,`inplace operation.

low_resource: bool | None = False[source]#
custom_model: bool = False[source]#
pretrained_language_projection_path: str = None[source]#
custom_vision_model: bool = False[source]#
image_encoder_name_or_path: str | None = None[source]#
qformer_name_or_path: str | None = None[source]#
llm_model_name_or_path: str | None = None[source]#
use_prompt_cache: bool = False[source]#
prompt_cache_path: str | None = None[source]#
llava_loading: bool | None = False[source]#
with_qformer: bool | None = False[source]#
vision_select_layer: int | None = -2[source]#
llava_pretrain_model_path: str | None = None[source]#
save_pretrain_model_path: str | None = None[source]#
class lmflow.args.DatasetArguments[source]#

Define a class DatasetArguments using the dataclass decorator. The class contains several optional parameters that can be used to configure a dataset for a language model.

dataset_pathstr

a string representing the path of the dataset to use.

dataset_namestr

a string representing the name of the dataset to use. The default value is “customized”.

is_custom_datasetbool

a boolean indicating whether to use custom data. The default value is False.

customized_cache_dirstr

a string representing the path to the directory where customized dataset caches will be stored.

dataset_config_namestr

a string representing the configuration name of the dataset to use (via the datasets library).

train_filestr

a string representing the path to the input training data file (a text file).

validation_filestr

a string representing the path to the input evaluation data file to evaluate the perplexity on (a text file).

max_train_samplesint

an integer indicating the maximum number of training examples to use for debugging or quicker training. If set, the training dataset will be truncated to this number.

max_eval_samples: int

an integer indicating the maximum number of evaluation examples to use for debugging or quicker training. If set, the evaluation dataset will be truncated to this number.

streamingbool

a boolean indicating whether to enable streaming mode.

block_size: int

an integer indicating the optional input sequence length after tokenization. The training dataset will be truncated in blocks of this size for training.

train_on_prompt: bool

a boolean indicating whether to train on prompt for conversation datasets such as ShareGPT.

conversation_template: str

a string representing the template for conversation datasets.

The class also includes some additional parameters that can be used to configure the dataset further, such as overwrite_cache, validation_split_percentage, preprocessing_num_workers, disable_group_texts, demo_example_in_prompt, explanation_in_prompt, keep_linebreaks, and prompt_structure.

The field function is used to set default values and provide help messages for each parameter. The Optional type hint is used to indicate that a parameter is optional. The metadata argument is used to provide additional information about each parameter, such as a help message.

dataset_path: str | None = None[source]#
dataset_name: str | None = 'customized'[source]#
is_custom_dataset: bool | None = False[source]#
customized_cache_dir: str | None = '.cache/llm-ft/datasets'[source]#
dataset_config_name: str | None = None[source]#
train_file: str | None = None[source]#
validation_file: str | None = None[source]#
max_train_samples: int | None = None[source]#
max_eval_samples: int | None = 10000000000.0[source]#
streaming: bool = False[source]#
block_size: int | None = None[source]#
overwrite_cache: bool = False[source]#
validation_split_percentage: int | None = 5[source]#
preprocessing_num_workers: int | None = None[source]#
group_texts_batch_size: int = 1000[source]#
disable_group_texts: bool = True[source]#
keep_linebreaks: bool = True[source]#
test_file: str | None = None[source]#
train_on_prompt: bool = False[source]#
conversation_template: str | None = None[source]#
__post_init__()[source]#
class lmflow.args.MultiModalDatasetArguments[source]#

Bases: DatasetArguments

Define a class DatasetArguments using the dataclass decorator. The class contains several optional parameters that can be used to configure a dataset for a language model.

dataset_pathstr

a string representing the path of the dataset to use.

dataset_namestr

a string representing the name of the dataset to use. The default value is “customized”.

is_custom_datasetbool

a boolean indicating whether to use custom data. The default value is False.

customized_cache_dirstr

a string representing the path to the directory where customized dataset caches will be stored.

dataset_config_namestr

a string representing the configuration name of the dataset to use (via the datasets library).

train_filestr

a string representing the path to the input training data file (a text file).

validation_filestr

a string representing the path to the input evaluation data file to evaluate the perplexity on (a text file).

max_train_samplesint

an integer indicating the maximum number of training examples to use for debugging or quicker training. If set, the training dataset will be truncated to this number.

max_eval_samples: int

an integer indicating the maximum number of evaluation examples to use for debugging or quicker training. If set, the evaluation dataset will be truncated to this number.

streamingbool

a boolean indicating whether to enable streaming mode.

block_size: int

an integer indicating the optional input sequence length after tokenization. The training dataset will be truncated in blocks of this size for training.

train_on_prompt: bool

a boolean indicating whether to train on prompt for conversation datasets such as ShareGPT.

conversation_template: str

a string representing the template for conversation datasets.

The class also includes some additional parameters that can be used to configure the dataset further, such as overwrite_cache, validation_split_percentage, preprocessing_num_workers, disable_group_texts, demo_example_in_prompt, explanation_in_prompt, keep_linebreaks, and prompt_structure.

The field function is used to set default values and provide help messages for each parameter. The Optional type hint is used to indicate that a parameter is optional. The metadata argument is used to provide additional information about each parameter, such as a help message.

image_folder: str | None = None[source]#
image_aspect_ratio: str | None = 'pad'[source]#
is_multimodal: bool | None = True[source]#
use_image_start_end: bool | None = True[source]#
sep_style: str | None = 'plain'[source]#
class lmflow.args.FinetunerArguments[source]#

Bases: transformers.TrainingArguments

Adapt transformers.TrainingArguments

eval_dataset_path: str | None = None[source]#
remove_unused_columns: bool | None = False[source]#
finetune_part: str | None = 'language_projection'[source]#
save_language_projection: str | None = False[source]#
use_lisa: bool = False[source]#
lisa_activated_layers: int = 2[source]#
lisa_interval_steps: int = 20[source]#
lisa_layers_attribute: str = 'model.model.layers'[source]#
use_customized_optim: bool = False[source]#
customized_optim: str = 'sign_sgd'[source]#
customized_optim_args: str = None[source]#
optim_dummy_beta1: float = 0.9[source]#
optim_dummy_beta2: float = 0.999[source]#
optim_adam_beta1: float = 0.9[source]#
optim_adam_beta2: float = 0.999[source]#
optim_beta1: float = 0.9[source]#
optim_beta2: float = 0.999[source]#
optim_beta3: float = 0.9[source]#
optim_momentum: float = 0.999[source]#
optim_weight_decay: float = 0[source]#
class lmflow.args.RewardModelTunerArguments[source]#

Bases: FinetunerArguments

Arguments for reward modeling.

class lmflow.args.EvaluatorArguments[source]#

Define a class EvaluatorArguments using the dataclass decorator. The class contains several optional parameters that can be used to configure a evaluator.

local_rankstr

For distributed training: local_rank

random_shuffle : bool

use_wandb : bool

random_seed : int, default = 1

output_dir : str, default = ‘./output_dir’,

mixed_precisionstr, choice from [“bf16”,”fp16”].

mixed precision mode, whether to use bf16 or fp16

deepspeed :

Enable deepspeed and pass the path to deepspeed json config file (e.g. ds_config.json) or an already loaded json file as a dict

temperaturefloat

An argument of model.generate in huggingface to control the diversity of generation.

repetition_penaltyfloat

An argument of model.generate in huggingface to penalize repetitions.

local_rank: int = -1[source]#
random_shuffle: bool | None = False[source]#
use_wandb: bool | None = False[source]#
random_seed: int | None = 1[source]#
output_dir: str | None = './output_dir'[source]#
mixed_precision: str | None = 'bf16'[source]#
deepspeed: str | None = None[source]#
answer_type: str | None = 'text'[source]#
prompt_structure: str | None = '{input}'[source]#
evaluate_block_size: int | None = 512[source]#
metric: str | None = 'accuracy'[source]#
inference_batch_size_per_device: int | None = 1[source]#
use_accelerator_for_evaluator: bool = False[source]#
temperature: float = 0[source]#
repetition_penalty: float = 1[source]#
max_new_tokens: int = 100[source]#
class lmflow.args.InferencerArguments[source]#

Define a class InferencerArguments using the dataclass decorator. The class contains several optional parameters that can be used to configure a inferencer.

local_rankstr

For distributed training: local_rank

random_seed : int, default = 1 inference_batch_size : int, default = 1 deepspeed :

Enable deepspeed and pass the path to deepspeed json config file (e.g. ds_config.json) or an already loaded json file as a dict

mixed_precisionstr, choice from [“bf16”,”fp16”].

mixed precision mode, whether to use bf16 or fp16

temperaturefloat

An argument of model.generate in huggingface to control the diversity of generation.

repetition_penaltyfloat

An argument of model.generate in huggingface to penalize repetitions.

use_beam_searchOptional[bool]

Whether to use beam search during inference, By default False.

num_output_sequencesOptional[int]

Number of output sequences to return for the given prompt, currently only used in vllm inference, By default 8.

top_pOptional[float]

top_p for sampling, By default 1.0.

top_kOptional[int]

top_k for sampling, By default -1 (no top_k).

additional_stop_token_idsOptional[List[int]]

the ids of the end of sentence tokens, By default [].

apply_chat_templateOptional[bool]

Whether to apply chat template, By default True.

save_resultsOptional[bool]

Whether to save inference results, By default False.

results_pathOptional[str]

The json file path of inference results, By default None.

enable_decode_inference_resultOptional[bool]

Whether to detokenize the inference results.

NOTE: For iterative align pipelines, whether to detokenize depends on the homogeneity of the policy model and the reward model (i.e., if they have the same tokenizer).

use_vllm: bool, optional

Whether to use VLLM for inference, By default False.

vllm_tensor_parallel_size: int, optional

The tensor parallel size for VLLM inference.

vllm_gpu_memory_utilization: float, optional

The GPU memory utilization for VLLM inference. The proportion of GPU memory (per GPU) to use for VLLM inference.

device: str = 'gpu'[source]#
local_rank: int = -1[source]#
inference_batch_size: int = 1[source]#
vllm_inference_batch_size: int = 1[source]#
temperature: float = 0.0[source]#
repetition_penalty: float = 1[source]#
max_new_tokens: int = 100[source]#
random_seed: int | None = 1[source]#
deepspeed: str | None = None[source]#
mixed_precision: str | None = 'bf16'[source]#
do_sample: bool | None = False[source]#
use_accelerator: bool = False[source]#
num_output_sequences: int | None = 8[source]#
top_p: float | None = 1.0[source]#
top_k: int | None = -1[source]#
additional_stop_token_ids: List[int] | None = [][source]#
apply_chat_template: bool | None = True[source]#
enable_decode_inference_result: bool | None = False[source]#
tensor_parallel_size: int | None = 1[source]#
enable_distributed_inference: bool | None = False[source]#
distributed_inference_num_instances: int | None = 1[source]#
use_vllm: bool = False[source]#
vllm_tensor_parallel_size: int | None = 1[source]#
vllm_gpu_memory_utilization: float | None = 0.95[source]#
save_results: bool | None = False[source]#
results_path: str | None = None[source]#
__post_init__()[source]#
class lmflow.args.RaftAlignerArguments[source]#

Bases: transformers.TrainingArguments

Define a class RaftAlignerArguments to configure raft aligner.

output_reward_path: str | None = 'tmp/raft_aligner/'[source]#
output_min_length: int | None = 64[source]#
output_max_length: int | None = 128[source]#
num_raft_iteration: int | None = 20[source]#
raft_batch_size: int | None = 1024[source]#
top_reward_percentage: float | None = 0.2[source]#
inference_batch_size_per_device: int | None = 1[source]#
collection_strategy: str | None = 'top'[source]#
class lmflow.args.BenchmarkingArguments[source]#
dataset_name: str | None = None[source]#
lm_evaluation_metric: str | None = 'accuracy'[source]#
class lmflow.args.DPOAlignerArguments[source]#

The arguments for the DPO training script.

local_rank: int = -1[source]#
beta: float | None = 0.1[source]#
learning_rate: float | None = 0.0005[source]#
lr_scheduler_type: str | None = 'cosine'[source]#
warmup_steps: int | None = 100[source]#
weight_decay: float | None = 0.05[source]#
optimizer_type: str | None = 'paged_adamw_32bit'[source]#
per_device_train_batch_size: int | None = 4[source]#
per_device_eval_batch_size: int | None = 1[source]#
gradient_accumulation_steps: int | None = 4[source]#
gradient_checkpointing: bool | None = True[source]#
gradient_checkpointing_use_reentrant: bool | None = False[source]#
max_prompt_length: int | None = 512[source]#
max_length: int | None = 1024[source]#
max_steps: int | None = 1000[source]#
logging_steps: int | None = 10[source]#
save_steps: int | None = 100[source]#
eval_steps: int | None = 100[source]#
output_dir: str | None = './results'[source]#
log_freq: int | None = 1[source]#
sanity_check: bool | None = False[source]#
report_to: str | None = 'wandb'[source]#
seed: int | None = 0[source]#
run_name: str | None = 'dpo'[source]#
eval_dataset_path: str | None = None[source]#
class lmflow.args.DPOv2AlignerArguments[source]#

Bases: FinetunerArguments

The arguments for the DPOv2 training script.

random_seed: int | None = 42[source]#
accelerate_config_file: str | None = None[source]#
margin_scale: float | None = 1.0[source]#
sampling_paired_method: str | None = 'max_random'[source]#
length_penalty: float | None = 0[source]#
max_length: int | None = 2048[source]#
max_prompt_length: int | None = 1000[source]#
mask_prompt: bool | None = False[source]#
beta: float | None = 0.1[source]#
loss_type: str | None = 'sigmoid'[source]#
class lmflow.args.IterativeAlignerArguments[source]#

Bases: InferencerArguments

Arguments for iterative aligners.

dataset_path_list: List[str] = [][source]#
initial_iter_idx: int = 0[source]#
class lmflow.args.IterativeDPOAlignerArguments[source]#

Bases: IterativeAlignerArguments, DPOv2AlignerArguments

Arguments for iterative DPO aligners.

output_dir: str | None = './runs'[source]#
reward_model_inference_batch_size: int = 1[source]#
reward_model_inference_block_size: int = 2048[source]#
do_response_generation: bool = True[source]#
do_scoring: bool = True[source]#
do_dpo_align: bool = True[source]#
lmflow.args.PIPELINE_ARGUMENT_MAPPING[source]#
class lmflow.args.AutoArguments[source]#

Automatically choose arguments from FinetunerArguments or EvaluatorArguments.

get_pipeline_args_class()[source]#