We've released our memory-efficient finetuning algorithm LISA, check out [Paper][User Guide] for more details!

lmflow.args#

This script defines dataclasses: ModelArguments and DatasetArguments, that contain the arguments for the model and dataset used in training.

It imports several modules, including dataclasses, field from typing, Optional from typing, require_version from transformers.utils.versions, MODEL_FOR_CAUSAL_LM_MAPPING, and TrainingArguments from transformers.

MODEL_CONFIG_CLASSES is assigned a list of the model config classes from MODEL_FOR_CAUSAL_LM_MAPPING. MODEL_TYPES is assigned a tuple of the model types extracted from the MODEL_CONFIG_CLASSES.

Module Contents#

Classes#

ModelArguments

Define a class ModelArguments using the dataclass decorator.

VisModelArguments

Define a class ModelArguments using the dataclass decorator.

DatasetArguments

Define a class DatasetArguments using the dataclass decorator.

MultiModalDatasetArguments

Define a class DatasetArguments using the dataclass decorator.

FinetunerArguments

Adapt transformers.TrainingArguments

RewardModelingArguments

Arguments for reward modeling.

EvaluatorArguments

Define a class EvaluatorArguments using the dataclass decorator. The class contains several optional

InferencerArguments

Define a class InferencerArguments using the dataclass decorator. The class contains several optional

RaftAlignerArguments

Define a class RaftAlignerArguments to configure raft aligner.

BenchmarkingArguments

DPOAlignerArguments

The arguments for the DPO training script.

AutoArguments

Automatically choose arguments from FinetunerArguments or EvaluatorArguments.

Attributes#

MODEL_CONFIG_CLASSES

MODEL_TYPES

PIPELINE_ARGUMENT_MAPPING

lmflow.args.MODEL_CONFIG_CLASSES[source]#
lmflow.args.MODEL_TYPES[source]#
class lmflow.args.ModelArguments[source]#

Define a class ModelArguments using the dataclass decorator. The class contains several optional parameters that can be used to configure a model.

model_name_or_pathstr

a string representing the path or name of a pretrained model checkpoint for weights initialization. If None, a model will be trained from scratch.

model_typestr

a string representing the type of model to use if training from scratch. If not provided, a pretrained model will be used.

config_overridesstr

a string representing the default config settings to override when training a model from scratch.

config_namestr

a string representing the name or path of the pretrained config to use, if different from the model_name_or_path.

tokenizer_namestr

a string representing the name or path of the pretrained tokenizer to use, if different from the model_name_or_path.

cache_dirstr

a string representing the path to the directory where pretrained models downloaded from huggingface.co will be stored.

use_fast_tokenizerbool

a boolean indicating whether to use a fast tokenizer (backed by the tokenizers library) or not.

model_revisionstr

a string representing the specific model version to use (can be a branch name, tag name, or commit id).

use_auth_tokenbool

a boolean indicating whether to use the token generated when running huggingface-cli login (necessary to use this script with private models).

torch_dtypestr

a string representing the dtype to load the model under. If auto is passed, the dtype will be automatically derived from the model’s weights.

use_ram_optimized_loadbool

a boolean indicating whether to use disk mapping when memory is not enough.

use_int8bool

a boolean indicating whether to load int8 quantization for inference.

load_in_4bitbool

whether to load the model in 4bit

model_max_lengthint

The maximum length of the model.

truncation_sidestr

The side on which the model should have truncation applied.

model_name_or_path: str | None[source]#
lora_model_path: str | None[source]#
model_type: str | None[source]#
arch_type: str | None[source]#
config_overrides: str | None[source]#
arch_type: str | None[source]#
config_name: str | None[source]#
tokenizer_name: str | None[source]#
cache_dir: str | None[source]#
use_fast_tokenizer: bool[source]#
model_revision: str[source]#
use_auth_token: bool[source]#
trust_remote_code: bool[source]#
torch_dtype: str | None[source]#
use_lora: bool[source]#
use_qlora: bool[source]#
bits: int[source]#
quant_type: str[source]#
double_quant: bool[source]#
lora_r: int[source]#
lora_alpha: int[source]#
lora_target_modules: List[str][source]#
lora_dropout: float[source]#
save_aggregated_lora: bool[source]#
use_ram_optimized_load: bool[source]#
use_flash_attention: bool[source]#
truncate_to_model_max_length: bool[source]#
do_rope_scaling: bool[source]#
rope_pi_ratio: int[source]#
rope_ntk_ratio: int[source]#
use_int8: bool[source]#
load_in_4bit: bool | None[source]#
model_max_length: int | None[source]#
truncation_side: str[source]#
__post_init__()[source]#
class lmflow.args.VisModelArguments[source]#

Bases: ModelArguments

Define a class ModelArguments using the dataclass decorator. The class contains several optional parameters that can be used to configure a model.

model_name_or_pathstr

a string representing the path or name of a pretrained model checkpoint for weights initialization. If None, a model will be trained from scratch.

model_typestr

a string representing the type of model to use if training from scratch. If not provided, a pretrained model will be used.

config_overridesstr

a string representing the default config settings to override when training a model from scratch.

config_namestr

a string representing the name or path of the pretrained config to use, if different from the model_name_or_path.

tokenizer_namestr

a string representing the name or path of the pretrained tokenizer to use, if different from the model_name_or_path.

cache_dirstr

a string representing the path to the directory where pretrained models downloaded from huggingface.co will be stored.

use_fast_tokenizerbool

a boolean indicating whether to use a fast tokenizer (backed by the tokenizers library) or not.

model_revisionstr

a string representing the specific model version to use (can be a branch name, tag name, or commit id).

use_auth_tokenbool

a boolean indicating whether to use the token generated when running huggingface-cli login (necessary to use this script with private models).

torch_dtypestr

a string representing the dtype to load the model under. If auto is passed, the dtype will be automatically derived from the model’s weights.

use_ram_optimized_loadbool

a boolean indicating whether to use disk mapping when memory is not enough.

use_int8bool

a boolean indicating whether to load int8 quantization for inference.

load_in_4bitbool

whether to load the model in 4bit

model_max_lengthint

The maximum length of the model.

truncation_sidestr

The side on which the model should have truncation applied.

low_resource: bool | None[source]#
custom_model: bool[source]#
pretrained_language_projection_path: str[source]#
custom_vision_model: bool[source]#
image_encoder_name_or_path: str | None[source]#
qformer_name_or_path: str | None[source]#
llm_model_name_or_path: str | None[source]#
use_prompt_cache: bool[source]#
prompt_cache_path: str | None[source]#
llava_loading: bool | None[source]#
with_qformer: bool | None[source]#
vision_select_layer: int | None[source]#
llava_pretrain_model_path: str | None[source]#
save_pretrain_model_path: str | None[source]#
class lmflow.args.DatasetArguments[source]#

Define a class DatasetArguments using the dataclass decorator. The class contains several optional parameters that can be used to configure a dataset for a language model.

dataset_pathstr

a string representing the path of the dataset to use.

dataset_namestr

a string representing the name of the dataset to use. The default value is “customized”.

is_custom_datasetbool

a boolean indicating whether to use custom data. The default value is False.

customized_cache_dirstr

a string representing the path to the directory where customized dataset caches will be stored.

dataset_config_namestr

a string representing the configuration name of the dataset to use (via the datasets library).

train_filestr

a string representing the path to the input training data file (a text file).

validation_filestr

a string representing the path to the input evaluation data file to evaluate the perplexity on (a text file).

max_train_samplesint

an integer indicating the maximum number of training examples to use for debugging or quicker training. If set, the training dataset will be truncated to this number.

max_eval_samples: int

an integer indicating the maximum number of evaluation examples to use for debugging or quicker training. If set, the evaluation dataset will be truncated to this number.

streamingbool

a boolean indicating whether to enable streaming mode.

block_size: int

an integer indicating the optional input sequence length after tokenization. The training dataset will be truncated in blocks of this size for training.

train_on_prompt: bool

a boolean indicating whether to train on prompt for conversation datasets such as ShareGPT.

conversation_template: str

a string representing the template for conversation datasets.

The class also includes some additional parameters that can be used to configure the dataset further, such as overwrite_cache, validation_split_percentage, preprocessing_num_workers, disable_group_texts, demo_example_in_prompt, explanation_in_prompt, keep_linebreaks, and prompt_structure.

The field function is used to set default values and provide help messages for each parameter. The Optional type hint is used to indicate that a parameter is optional. The metadata argument is used to provide additional information about each parameter, such as a help message.

dataset_path: str | None[source]#
dataset_name: str | None[source]#
is_custom_dataset: bool | None[source]#
customized_cache_dir: str | None[source]#
dataset_config_name: str | None[source]#
train_file: str | None[source]#
validation_file: str | None[source]#
max_train_samples: int | None[source]#
max_eval_samples: int | None[source]#
streaming: bool[source]#
block_size: int | None[source]#
overwrite_cache: bool[source]#
validation_split_percentage: int | None[source]#
preprocessing_num_workers: int | None[source]#
group_texts_batch_size: int[source]#
disable_group_texts: bool[source]#
keep_linebreaks: bool[source]#
test_file: str | None[source]#
train_on_prompt: bool[source]#
conversation_template: str | None[source]#
__post_init__()[source]#
class lmflow.args.MultiModalDatasetArguments[source]#

Bases: DatasetArguments

Define a class DatasetArguments using the dataclass decorator. The class contains several optional parameters that can be used to configure a dataset for a language model.

dataset_pathstr

a string representing the path of the dataset to use.

dataset_namestr

a string representing the name of the dataset to use. The default value is “customized”.

is_custom_datasetbool

a boolean indicating whether to use custom data. The default value is False.

customized_cache_dirstr

a string representing the path to the directory where customized dataset caches will be stored.

dataset_config_namestr

a string representing the configuration name of the dataset to use (via the datasets library).

train_filestr

a string representing the path to the input training data file (a text file).

validation_filestr

a string representing the path to the input evaluation data file to evaluate the perplexity on (a text file).

max_train_samplesint

an integer indicating the maximum number of training examples to use for debugging or quicker training. If set, the training dataset will be truncated to this number.

max_eval_samples: int

an integer indicating the maximum number of evaluation examples to use for debugging or quicker training. If set, the evaluation dataset will be truncated to this number.

streamingbool

a boolean indicating whether to enable streaming mode.

block_size: int

an integer indicating the optional input sequence length after tokenization. The training dataset will be truncated in blocks of this size for training.

train_on_prompt: bool

a boolean indicating whether to train on prompt for conversation datasets such as ShareGPT.

conversation_template: str

a string representing the template for conversation datasets.

The class also includes some additional parameters that can be used to configure the dataset further, such as overwrite_cache, validation_split_percentage, preprocessing_num_workers, disable_group_texts, demo_example_in_prompt, explanation_in_prompt, keep_linebreaks, and prompt_structure.

The field function is used to set default values and provide help messages for each parameter. The Optional type hint is used to indicate that a parameter is optional. The metadata argument is used to provide additional information about each parameter, such as a help message.

image_folder: str | None[source]#
image_aspect_ratio: str | None[source]#
is_multimodal: bool | None[source]#
use_image_start_end: bool | None[source]#
sep_style: str | None[source]#
class lmflow.args.FinetunerArguments[source]#

Bases: transformers.TrainingArguments

Adapt transformers.TrainingArguments

eval_dataset_path: str | None[source]#
remove_unused_columns: bool | None[source]#
finetune_part: str | None[source]#
save_language_projection: str | None[source]#
use_lisa: bool[source]#
lisa_activated_layers: int[source]#
lisa_interval_steps: int[source]#
lisa_layers_attribute: str[source]#
class lmflow.args.RewardModelingArguments[source]#

Bases: FinetunerArguments

Arguments for reward modeling.

class lmflow.args.EvaluatorArguments[source]#

Define a class EvaluatorArguments using the dataclass decorator. The class contains several optional parameters that can be used to configure a evaluator.

local_rankstr

For distributed training: local_rank

random_shuffle : bool

use_wandb : bool

random_seed : int, default = 1

output_dir : str, default = ‘./output_dir’,

mixed_precisionstr, choice from [“bf16”,”fp16”].

mixed precision mode, whether to use bf16 or fp16

deepspeed :

Enable deepspeed and pass the path to deepspeed json config file (e.g. ds_config.json) or an already loaded json file as a dict

temperaturefloat

An argument of model.generate in huggingface to control the diversity of generation.

repetition_penaltyfloat

An argument of model.generate in huggingface to penalize repetitions.

local_rank: int[source]#
random_shuffle: bool | None[source]#
use_wandb: bool | None[source]#
random_seed: int | None[source]#
output_dir: str | None[source]#
mixed_precision: str | None[source]#
deepspeed: str | None[source]#
answer_type: str | None[source]#
prompt_structure: str | None[source]#
evaluate_block_size: int | None[source]#
metric: str | None[source]#
inference_batch_size_per_device: int | None[source]#
use_accelerator_for_evaluator: bool[source]#
temperature: float[source]#
repetition_penalty: float[source]#
max_new_tokens: int[source]#
class lmflow.args.InferencerArguments[source]#

Define a class InferencerArguments using the dataclass decorator. The class contains several optional parameters that can be used to configure a inferencer.

local_rankstr

For distributed training: local_rank

random_seed : int, default = 1

deepspeed :

Enable deepspeed and pass the path to deepspeed json config file (e.g. ds_config.json) or an already loaded json file as a dict

mixed_precisionstr, choice from [“bf16”,”fp16”].

mixed precision mode, whether to use bf16 or fp16

temperaturefloat

An argument of model.generate in huggingface to control the diversity of generation.

repetition_penaltyfloat

An argument of model.generate in huggingface to penalize repetitions.

device: str[source]#
local_rank: int[source]#
temperature: float[source]#
repetition_penalty: float[source]#
max_new_tokens: int[source]#
random_seed: int | None[source]#
deepspeed: str | None[source]#
mixed_precision: str | None[source]#
do_sample: bool | None[source]#
use_accelerator: bool[source]#
class lmflow.args.RaftAlignerArguments[source]#

Bases: transformers.TrainingArguments

Define a class RaftAlignerArguments to configure raft aligner.

output_reward_path: str | None[source]#
output_min_length: int | None[source]#
output_max_length: int | None[source]#
num_raft_iteration: int | None[source]#
raft_batch_size: int | None[source]#
top_reward_percentage: float | None[source]#
inference_batch_size_per_device: int | None[source]#
collection_strategy: str | None[source]#
class lmflow.args.BenchmarkingArguments[source]#
dataset_name: str | None[source]#
lm_evaluation_metric: str | None[source]#
class lmflow.args.DPOAlignerArguments[source]#

The arguments for the DPO training script.

local_rank: int[source]#
beta: float | None[source]#
learning_rate: float | None[source]#
lr_scheduler_type: str | None[source]#
warmup_steps: int | None[source]#
weight_decay: float | None[source]#
optimizer_type: str | None[source]#
per_device_train_batch_size: int | None[source]#
per_device_eval_batch_size: int | None[source]#
gradient_accumulation_steps: int | None[source]#
gradient_checkpointing: bool | None[source]#
gradient_checkpointing_use_reentrant: bool | None[source]#
max_prompt_length: int | None[source]#
max_length: int | None[source]#
max_steps: int | None[source]#
logging_steps: int | None[source]#
save_steps: int | None[source]#
eval_steps: int | None[source]#
output_dir: str | None[source]#
log_freq: int | None[source]#
sanity_check: bool | None[source]#
report_to: str | None[source]#
seed: int | None[source]#
run_name: str | None[source]#
lmflow.args.PIPELINE_ARGUMENT_MAPPING[source]#
class lmflow.args.AutoArguments[source]#

Automatically choose arguments from FinetunerArguments or EvaluatorArguments.

get_pipeline_args_class()[source]#