lmflow.args#

This script defines dataclasses: ModelArguments and DatasetArguments, that contain the arguments for the model and dataset used in training.

It imports several modules, including dataclasses, field from typing, Optional from typing, require_version from transformers.utils.versions, MODEL_FOR_CAUSAL_LM_MAPPING, and TrainingArguments from transformers.

MODEL_CONFIG_CLASSES is assigned a list of the model config classes from MODEL_FOR_CAUSAL_LM_MAPPING. MODEL_TYPES is assigned a tuple of the model types extracted from the MODEL_CONFIG_CLASSES.

Attributes#

`MODEL_CONFIG_CLASSES`
`MODEL_TYPES`
`logger`
`PIPELINE_ARGUMENT_MAPPING`

Classes#

`OptimizerNames`
`ModelArguments`	Define a class ModelArguments using the dataclass decorator.
`VisModelArguments`	Define a class ModelArguments using the dataclass decorator.
`DatasetArguments`	Define a class DatasetArguments using the dataclass decorator.
`MultiModalDatasetArguments`	Define a class DatasetArguments using the dataclass decorator.
`FinetunerArguments`	Adapt transformers.TrainingArguments
`RewardModelTunerArguments`	Arguments for reward modeling.
`EvaluatorArguments`	Define a class EvaluatorArguments using the dataclass decorator. The class contains several optional
`InferencerArguments`	Define a class InferencerArguments using the dataclass decorator. The class contains several optional
`RaftAlignerArguments`	Define a class RaftAlignerArguments to configure raft aligner.
`BenchmarkingArguments`
`DPOAlignerArguments`	The arguments for the DPO training script.
`DPOv2AlignerArguments`	The arguments for the DPOv2 training script.
`IterativeAlignerArguments`	Arguments for iterative aligners.
`IterativeDPOAlignerArguments`	Arguments for iterative DPO aligners.
`AutoArguments`	Automatically choose arguments from FinetunerArguments or EvaluatorArguments.

Functions#

split_args(args)

Module Contents#

lmflow.args.MODEL_CONFIG_CLASSES[source]#

lmflow.args.MODEL_TYPES[source]#

lmflow.args.logger[source]#

class lmflow.args.OptimizerNames[source]#

DUMMY = 'dummy'[source]#

ADABELIEF = 'adabelief'[source]#

ADABOUND = 'adabound'[source]#

LARS = 'lars'[source]#

LAMB = 'lamb'[source]#

ADAMAX = 'adamax'[source]#

NADAM = 'nadam'[source]#

RADAM = 'radam'[source]#

ADAMP = 'adamp'[source]#

SGDP = 'sgdp'[source]#

YOGI = 'yogi'[source]#

SOPHIA = 'sophia'[source]#

ADAN = 'adan'[source]#

ADAM = 'adam'[source]#

NOVOGRAD = 'novograd'[source]#

ADADELTA = 'adadelta'[source]#

ADAGRAD = 'adagrad'[source]#

MUON = 'muon'[source]#

ADAMW_SCHEDULE_FREE = 'adamw_schedule_free'[source]#

SGD_SCHEDULE_FREE = 'sgd_schedule_free'[source]#

class lmflow.args.ModelArguments[source]#

Define a class ModelArguments using the dataclass decorator. The class contains several optional parameters that can be used to configure a model.

model_name_or_pathstr: a string representing the path or name of a pretrained model checkpoint for weights initialization. If None, a model will be trained from scratch.
model_typestr: a string representing the type of model to use if training from scratch. If not provided, a pretrained model will be used.
config_overridesstr: a string representing the default config settings to override when training a model from scratch.
config_namestr: a string representing the name or path of the pretrained config to use, if different from the model_name_or_path.
tokenizer_namestr: a string representing the name or path of the pretrained tokenizer to use, if different from the model_name_or_path.
cache_dirstr: a string representing the path to the directory where pretrained models downloaded from huggingface.co will be stored.
use_fast_tokenizerbool: a boolean indicating whether to use a fast tokenizer (backed by the tokenizers library) or not.
model_revisionstr: a string representing the specific model version to use (can be a branch name, tag name, or commit id).
tokenOptional[str]: Necessary when accessing a private model/dataset.
torch_dtypestr: a string representing the dtype to load the model under. If auto is passed, the dtype will be automatically derived from the model’s weights.
use_ram_optimized_loadbool: a boolean indicating whether to use disk mapping when memory is not enough.
use_int8bool: a boolean indicating whether to load int8 quantization for inference.
load_in_4bitbool: whether to load the model in 4bit
model_max_lengthint: The maximum length of the model.
truncation_sidestr: The side on which the model should have truncation applied.
arch_typestr: Model architecture type.
padding_sidestr: The side on which the tokenizer should have padding applied.
eos_paddingbool: whether to pad with eos token instead of pad token.
ignore_bias_buffersbool: fix for DDP issues with LM bias/mask buffers - invalid scalar type,`inplace operation.

model_name_or_path: str | None = None[source]#

lora_model_path: str | None = None[source]#

model_type: str | None = None[source]#

config_overrides: str | None = None[source]#

arch_type: str | None = 'decoder_only'[source]#

config_name: str | None = None[source]#

tokenizer_name: str | None = None[source]#

cache_dir: str | None = None[source]#

use_fast_tokenizer: bool = True[source]#

model_revision: str = 'main'[source]#

token: str | None = None[source]#

trust_remote_code: bool = False[source]#

torch_dtype: str | None = None[source]#

use_dora: bool = False[source]#

use_lora: bool = False[source]#

use_qlora: bool = False[source]#

bits: int | None = None[source]#

quant_bit: int = 4[source]#

quant_type: str = 'nf4'[source]#

double_quant: bool = True[source]#

lora_r: int = 8[source]#

lora_alpha: int = 32[source]#

lora_target_modules: str = None[source]#

lora_dropout: float = 0.1[source]#

save_aggregated_lora: bool = False[source]#

use_ram_optimized_load: bool = True[source]#

use_flash_attention: bool = False[source]#

truncate_to_model_max_length: bool = True[source]#

do_rope_scaling: bool = False[source]#

rope_pi_ratio: int = 1[source]#

rope_ntk_ratio: int = 1[source]#

use_int8: bool = False[source]#

load_in_4bit: bool | None = True[source]#

model_max_length: int | None = None[source]#

truncation_side: str = None[source]#

padding_side: str = 'right'[source]#

eos_padding: bool | None = False[source]#

ignore_bias_buffers: bool | None = False[source]#

__post_init__()[source]#

class lmflow.args.VisModelArguments[source]#

Bases: ModelArguments

Define a class ModelArguments using the dataclass decorator. The class contains several optional parameters that can be used to configure a model.

model_name_or_pathstr: a string representing the path or name of a pretrained model checkpoint for weights initialization. If None, a model will be trained from scratch.
model_typestr: a string representing the type of model to use if training from scratch. If not provided, a pretrained model will be used.
config_overridesstr: a string representing the default config settings to override when training a model from scratch.
config_namestr: a string representing the name or path of the pretrained config to use, if different from the model_name_or_path.
tokenizer_namestr: a string representing the name or path of the pretrained tokenizer to use, if different from the model_name_or_path.
cache_dirstr: a string representing the path to the directory where pretrained models downloaded from huggingface.co will be stored.
use_fast_tokenizerbool: a boolean indicating whether to use a fast tokenizer (backed by the tokenizers library) or not.
model_revisionstr: a string representing the specific model version to use (can be a branch name, tag name, or commit id).
tokenOptional[str]: Necessary when accessing a private model/dataset.
torch_dtypestr: a string representing the dtype to load the model under. If auto is passed, the dtype will be automatically derived from the model’s weights.
use_ram_optimized_loadbool: a boolean indicating whether to use disk mapping when memory is not enough.
use_int8bool: a boolean indicating whether to load int8 quantization for inference.
load_in_4bitbool: whether to load the model in 4bit
model_max_lengthint: The maximum length of the model.
truncation_sidestr: The side on which the model should have truncation applied.
arch_typestr: Model architecture type.
padding_sidestr: The side on which the tokenizer should have padding applied.
eos_paddingbool: whether to pad with eos token instead of pad token.
ignore_bias_buffersbool: fix for DDP issues with LM bias/mask buffers - invalid scalar type,`inplace operation.

low_resource: bool | None = False[source]#

custom_model: bool = False[source]#

pretrained_language_projection_path: str = None[source]#

custom_vision_model: bool = False[source]#

image_encoder_name_or_path: str | None = None[source]#

qformer_name_or_path: str | None = None[source]#

llm_model_name_or_path: str | None = None[source]#

use_prompt_cache: bool = False[source]#

prompt_cache_path: str | None = None[source]#

llava_loading: bool | None = False[source]#

with_qformer: bool | None = False[source]#

vision_select_layer: int | None = -2[source]#

llava_pretrain_model_path: str | None = None[source]#

save_pretrain_model_path: str | None = None[source]#

class lmflow.args.DatasetArguments[source]#

Define a class DatasetArguments using the dataclass decorator. The class contains several optional parameters that can be used to configure a dataset for a language model.

dataset_pathstr: a string representing the path of the dataset to use.
dataset_namestr: a string representing the name of the dataset to use. The default value is “customized”.
is_custom_datasetbool: a boolean indicating whether to use custom data. The default value is False.
customized_cache_dirstr: a string representing the path to the directory where customized dataset caches will be stored.
dataset_config_namestr: a string representing the configuration name of the dataset to use (via the datasets library).
train_filestr: a string representing the path to the input training data file (a text file).
validation_filestr: a string representing the path to the input evaluation data file to evaluate the perplexity on (a text file).
max_train_samplesint: an integer indicating the maximum number of training examples to use for debugging or quicker training. If set, the training dataset will be truncated to this number.
max_eval_samples: int: an integer indicating the maximum number of evaluation examples to use for debugging or quicker training. If set, the evaluation dataset will be truncated to this number.
streamingbool: a boolean indicating whether to enable streaming mode.
block_size: int: an integer indicating the optional input sequence length after tokenization. The training dataset will be truncated in blocks of this size for training.
train_on_prompt: bool: a boolean indicating whether to train on prompt for conversation datasets such as ShareGPT.
conversation_template: str: a string representing the template for conversation datasets.
dataset_cache_dir: str: a string representing the path to the dataset cache directory. Useful when the default cache dir (~/.cache/huggingface/datasets) has limited space.

The class also includes some additional parameters that can be used to configure the dataset further, such as overwrite_cache, validation_split_percentage, preprocessing_num_workers, disable_group_texts, demo_example_in_prompt, explanation_in_prompt, keep_linebreaks, and prompt_structure.

The field function is used to set default values and provide help messages for each parameter. The Optional type hint is used to indicate that a parameter is optional. The metadata argument is used to provide additional information about each parameter, such as a help message.

dataset_path: str | None = None[source]#

dataset_name: str | None = 'customized'[source]#

is_custom_dataset: bool | None = False[source]#

customized_cache_dir: str | None = '.cache/llm-ft/datasets'[source]#

dataset_config_name: str | None = None[source]#

train_file: str | None = None[source]#

validation_file: str | None = None[source]#

max_train_samples: int | None = None[source]#

max_eval_samples: int | None = 10000000000.0[source]#

streaming: bool = False[source]#

block_size: int | None = None[source]#

overwrite_cache: bool = False[source]#

validation_split_percentage: int | None = 5[source]#

preprocessing_num_workers: int | None = None[source]#

group_texts_batch_size: int = 1000[source]#

disable_group_texts: bool = True[source]#

keep_linebreaks: bool = True[source]#

test_file: str | None = None[source]#

train_on_prompt: bool = False[source]#

conversation_template: str | None = None[source]#

dataset_cache_dir: str | None = None[source]#

calculate_dataset_stats: bool = False[source]#

__post_init__()[source]#

class lmflow.args.MultiModalDatasetArguments[source]#

Bases: DatasetArguments

Define a class DatasetArguments using the dataclass decorator. The class contains several optional parameters that can be used to configure a dataset for a language model.

dataset_pathstr: a string representing the path of the dataset to use.
dataset_namestr: a string representing the name of the dataset to use. The default value is “customized”.
is_custom_datasetbool: a boolean indicating whether to use custom data. The default value is False.
customized_cache_dirstr: a string representing the path to the directory where customized dataset caches will be stored.
dataset_config_namestr: a string representing the configuration name of the dataset to use (via the datasets library).
train_filestr: a string representing the path to the input training data file (a text file).
validation_filestr: a string representing the path to the input evaluation data file to evaluate the perplexity on (a text file).
max_train_samplesint: an integer indicating the maximum number of training examples to use for debugging or quicker training. If set, the training dataset will be truncated to this number.
max_eval_samples: int: an integer indicating the maximum number of evaluation examples to use for debugging or quicker training. If set, the evaluation dataset will be truncated to this number.
streamingbool: a boolean indicating whether to enable streaming mode.
block_size: int: an integer indicating the optional input sequence length after tokenization. The training dataset will be truncated in blocks of this size for training.
train_on_prompt: bool: a boolean indicating whether to train on prompt for conversation datasets such as ShareGPT.
conversation_template: str: a string representing the template for conversation datasets.
dataset_cache_dir: str: a string representing the path to the dataset cache directory. Useful when the default cache dir (~/.cache/huggingface/datasets) has limited space.

The class also includes some additional parameters that can be used to configure the dataset further, such as overwrite_cache, validation_split_percentage, preprocessing_num_workers, disable_group_texts, demo_example_in_prompt, explanation_in_prompt, keep_linebreaks, and prompt_structure.

The field function is used to set default values and provide help messages for each parameter. The Optional type hint is used to indicate that a parameter is optional. The metadata argument is used to provide additional information about each parameter, such as a help message.

image_folder: str | None = None[source]#

image_aspect_ratio: str | None = 'pad'[source]#

is_multimodal: bool | None = True[source]#

use_image_start_end: bool | None = True[source]#

sep_style: str | None = 'plain'[source]#

class lmflow.args.FinetunerArguments[source]#

Bases: transformers.TrainingArguments

Adapt transformers.TrainingArguments

eval_dataset_path: str | None = None[source]#

remove_unused_columns: bool | None = False[source]#

finetune_part: str | None = 'language_projection'[source]#

save_language_projection: str | None = False[source]#

use_lisa: bool = False[source]#

lisa_activated_layers: int = 2[source]#

lisa_interval_steps: int = 20[source]#

lisa_layers_attribute: str = 'model.model.layers'[source]#

use_customized_optim: bool = False[source]#

customized_optim: str = 'sign_sgd'[source]#

customized_optim_args: str = None[source]#

optim_dummy_beta1: float = 0.9[source]#

optim_dummy_beta2: float = 0.999[source]#

optim_adam_beta1: float = 0.9[source]#

optim_adam_beta2: float = 0.999[source]#

optim_beta1: float = 0.9[source]#

optim_beta2: float = 0.999[source]#

optim_beta3: float = 0.9[source]#

optim_momentum: float = 0.999[source]#

optim_weight_decay: float = 0[source]#

class lmflow.args.RewardModelTunerArguments[source]#

Bases: FinetunerArguments

Arguments for reward modeling.

class lmflow.args.EvaluatorArguments[source]#

Define a class EvaluatorArguments using the dataclass decorator. The class contains several optional parameters that can be used to configure a evaluator.

local_rankstr: For distributed training: local_rank

random_shuffle : bool

use_wandb : bool

random_seed : int, default = 1

output_dir : str, default = ‘./output_dir’,

mixed_precisionstr, choice from [“bf16”,”fp16”].: mixed precision mode, whether to use bf16 or fp16
deepspeed :: Enable deepspeed and pass the path to deepspeed json config file (e.g. ds_config.json) or an already loaded json file as a dict
temperaturefloat: An argument of model.generate in huggingface to control the diversity of generation.
repetition_penaltyfloat: An argument of model.generate in huggingface to penalize repetitions.

local_rank: int = -1[source]#

random_shuffle: bool | None = False[source]#

use_wandb: bool | None = False[source]#

random_seed: int | None = 1[source]#

output_dir: str | None = './output_dir'[source]#

mixed_precision: str | None = 'bf16'[source]#

deepspeed: str | None = None[source]#

answer_type: str | None = 'text'[source]#

prompt_structure: str | None = '{input}'[source]#

evaluate_block_size: int | None = 512[source]#

metric: str | None = 'accuracy'[source]#

inference_batch_size_per_device: int | None = 1[source]#

use_accelerator_for_evaluator: bool | None = None[source]#

temperature: float = 0[source]#

repetition_penalty: float = 1[source]#

max_new_tokens: int = 100[source]#

minibatch_size: int = 1[source]#

__post_init__()[source]#

class lmflow.args.InferencerArguments[source]#

Define a class InferencerArguments using the dataclass decorator. The class contains several optional parameters that can be used to configure a inferencer.

local_rankstr: For distributed training: local_rank

random_seed : int, default = 1 inference_batch_size : int, default = 1 deepspeed :

Enable deepspeed and pass the path to deepspeed json config file (e.g. ds_config.json) or an already loaded json file as a dict

mixed_precisionstr, choice from [“bf16”,”fp16”].

mixed precision mode, whether to use bf16 or fp16

temperaturefloat

An argument of model.generate in huggingface to control the diversity of generation.

repetition_penaltyfloat

An argument of model.generate in huggingface to penalize repetitions.

use_beam_searchOptional[bool]

Whether to use beam search during inference, By default False.

num_output_sequencesOptional[int]

Number of output sequences to return for the given prompt, currently only used in vllm inference, By default 8.

top_pOptional[float]

top_p for sampling, By default 1.0.

top_kOptional[int]

top_k for sampling, By default -1 (no top_k).

additional_stop_token_idsOptional[list[int]]

the ids of the end of sentence tokens, By default [].

apply_chat_templateOptional[bool]

Whether to apply chat template, By default True.

save_resultsOptional[bool]

Whether to save inference results, By default False.

results_pathOptional[str]

The json file path of inference results, By default None.

enable_decode_inference_resultOptional[bool]

Whether to detokenize the inference results.

NOTE: For iterative align pipelines, whether to detokenize depends on the homogeneity of the policy model and the reward model (i.e., if they have the same tokenizer).

use_vllm: bool, optional

Whether to use VLLM for inference, By default False.

vllm_tensor_parallel_size: int, optional

The tensor parallel size for VLLM inference.

vllm_gpu_memory_utilization: float, optional

The GPU memory utilization for VLLM inference. The proportion of GPU memory (per GPU) to use for VLLM inference.

device: str = 'gpu'[source]#

local_rank: int = -1[source]#

inference_batch_size: int = 1[source]#

vllm_inference_batch_size: int = 1[source]#

temperature: float = 0.0[source]#

repetition_penalty: float = 1[source]#

max_new_tokens: int = 100[source]#

random_seed: int | None = 1[source]#

deepspeed: str | None = None[source]#

mixed_precision: str | None = 'bf16'[source]#

do_sample: bool | None = False[source]#

use_accelerator: bool | None = None[source]#

use_beam_search: bool | None = False[source]#

num_output_sequences: int | None = 8[source]#

top_p: float | None = 1.0[source]#

top_k: int | None = -1[source]#

additional_stop_token_ids: list[int] | None = [][source]#

apply_chat_template: bool | None = True[source]#

enable_decode_inference_result: bool | None = False[source]#

tensor_parallel_size: int | None = 1[source]#

enable_distributed_inference: bool | None = False[source]#

distributed_inference_num_instances: int | None = 1[source]#

use_vllm: bool = False[source]#

vllm_tensor_parallel_size: int | None = 1[source]#

vllm_gpu_memory_utilization: float | None = 0.95[source]#

save_results: bool | None = False[source]#

results_path: str | None = None[source]#

__post_init__()[source]#

class lmflow.args.RaftAlignerArguments[source]#

Bases: transformers.TrainingArguments

Define a class RaftAlignerArguments to configure raft aligner.

output_reward_path: str | None = 'tmp/raft_aligner/'[source]#

output_min_length: int | None = 64[source]#

output_max_length: int | None = 128[source]#

num_raft_iteration: int | None = 20[source]#

raft_batch_size: int | None = 1024[source]#

top_reward_percentage: float | None = 0.2[source]#

inference_batch_size_per_device: int | None = 1[source]#

collection_strategy: str | None = 'top'[source]#

class lmflow.args.BenchmarkingArguments[source]#

dataset_name: str | None = None[source]#

lm_evaluation_metric: str | None = 'accuracy'[source]#

class lmflow.args.DPOAlignerArguments[source]#

The arguments for the DPO training script.

local_rank: int = -1[source]#

beta: float | None = 0.1[source]#

learning_rate: float | None = 0.0005[source]#

lr_scheduler_type: str | None = 'cosine'[source]#

warmup_steps: int | None = 100[source]#

weight_decay: float | None = 0.05[source]#

optimizer_type: str | None = 'paged_adamw_32bit'[source]#

per_device_train_batch_size: int | None = 4[source]#

per_device_eval_batch_size: int | None = 1[source]#

gradient_accumulation_steps: int | None = 4[source]#

gradient_checkpointing: bool | None = True[source]#

gradient_checkpointing_use_reentrant: bool | None = False[source]#

max_prompt_length: int | None = 512[source]#

max_length: int | None = 1024[source]#

max_steps: int | None = 1000[source]#

logging_steps: int | None = 10[source]#

save_steps: int | None = 100[source]#

eval_steps: int | None = 100[source]#

output_dir: str | None = './results'[source]#

log_freq: int | None = 1[source]#

sanity_check: bool | None = False[source]#

report_to: str | None = 'wandb'[source]#

seed: int | None = 0[source]#

run_name: str | None = 'dpo'[source]#

eval_dataset_path: str | None = None[source]#

class lmflow.args.DPOv2AlignerArguments[source]#

Bases: FinetunerArguments

The arguments for the DPOv2 training script.

random_seed: int | None = 42[source]#

accelerate_config_file: str | None = None[source]#

margin_scale: float | None = 1.0[source]#

sampling_paired_method: str | None = 'max_random'[source]#

length_penalty: float | None = 0[source]#

max_length: int | None = 2048[source]#

max_prompt_length: int | None = 1000[source]#

mask_prompt: bool | None = False[source]#

beta: float | None = 0.1[source]#

loss_type: str | None = 'sigmoid'[source]#

class lmflow.args.IterativeAlignerArguments[source]#

Bases: InferencerArguments

Arguments for iterative aligners.

dataset_path_list: list[str] = [][source]#

initial_iter_idx: int = 0[source]#

class lmflow.args.IterativeDPOAlignerArguments[source]#

Bases: IterativeAlignerArguments, DPOv2AlignerArguments

Arguments for iterative DPO aligners.

output_dir: str | None = './runs'[source]#

reward_model_inference_batch_size: int = 1[source]#

reward_model_inference_block_size: int = 2048[source]#

do_response_generation: bool = True[source]#

do_scoring: bool = True[source]#

do_dpo_align: bool = True[source]#

lmflow.args.PIPELINE_ARGUMENT_MAPPING[source]#

class lmflow.args.AutoArguments[source]#

Automatically choose arguments from FinetunerArguments or EvaluatorArguments.

get_pipeline_args_class()[source]#

lmflow.args.split_args(args)[source]#

lmflow.args#

Attributes#

Classes#

Functions#

Module Contents#

This Page