lmflow.args#
This script defines dataclasses: ModelArguments and DatasetArguments, that contain the arguments for the model and dataset used in training.
It imports several modules, including dataclasses, field from typing, Optional from typing, require_version from transformers.utils.versions, MODEL_FOR_CAUSAL_LM_MAPPING, and TrainingArguments from transformers.
MODEL_CONFIG_CLASSES is assigned a list of the model config classes from MODEL_FOR_CAUSAL_LM_MAPPING. MODEL_TYPES is assigned a tuple of the model types extracted from the MODEL_CONFIG_CLASSES.
Attributes#
Classes#
Define a class ModelArguments using the dataclass decorator. |
|
Define a class ModelArguments using the dataclass decorator. |
|
Define a class DatasetArguments using the dataclass decorator. |
|
Define a class DatasetArguments using the dataclass decorator. |
|
Adapt transformers.TrainingArguments |
|
Arguments for reward modeling. |
|
Define a class EvaluatorArguments using the dataclass decorator. The class contains several optional |
|
Define a class InferencerArguments using the dataclass decorator. The class contains several optional |
|
Define a class RaftAlignerArguments to configure raft aligner. |
|
The arguments for the DPO training script. |
|
The arguments for the DPOv2 training script. |
|
Arguments for iterative aligners. |
|
Arguments for iterative DPO aligners. |
|
Automatically choose arguments from FinetunerArguments or EvaluatorArguments. |
Module Contents#
- class lmflow.args.ModelArguments[source]#
Define a class ModelArguments using the dataclass decorator. The class contains several optional parameters that can be used to configure a model.
- model_name_or_pathstr
a string representing the path or name of a pretrained model checkpoint for weights initialization. If None, a model will be trained from scratch.
- model_typestr
a string representing the type of model to use if training from scratch. If not provided, a pretrained model will be used.
- config_overridesstr
a string representing the default config settings to override when training a model from scratch.
- config_namestr
a string representing the name or path of the pretrained config to use, if different from the model_name_or_path.
- tokenizer_namestr
a string representing the name or path of the pretrained tokenizer to use, if different from the model_name_or_path.
- cache_dirstr
a string representing the path to the directory where pretrained models downloaded from huggingface.co will be stored.
- use_fast_tokenizerbool
a boolean indicating whether to use a fast tokenizer (backed by the tokenizers library) or not.
- model_revisionstr
a string representing the specific model version to use (can be a branch name, tag name, or commit id).
- tokenOptional[str]
Necessary when accessing a private model/dataset.
- torch_dtypestr
a string representing the dtype to load the model under. If auto is passed, the dtype will be automatically derived from the model’s weights.
- use_ram_optimized_loadbool
a boolean indicating whether to use disk mapping when memory is not enough.
- use_int8bool
a boolean indicating whether to load int8 quantization for inference.
- load_in_4bitbool
whether to load the model in 4bit
- model_max_lengthint
The maximum length of the model.
- truncation_sidestr
The side on which the model should have truncation applied.
- arch_typestr
Model architecture type.
- padding_sidestr
The side on which the tokenizer should have padding applied.
- eos_paddingbool
whether to pad with eos token instead of pad token.
- ignore_bias_buffersbool
fix for DDP issues with LM bias/mask buffers - invalid scalar type,`inplace operation.
- class lmflow.args.VisModelArguments[source]#
Bases:
ModelArguments
Define a class ModelArguments using the dataclass decorator. The class contains several optional parameters that can be used to configure a model.
- model_name_or_pathstr
a string representing the path or name of a pretrained model checkpoint for weights initialization. If None, a model will be trained from scratch.
- model_typestr
a string representing the type of model to use if training from scratch. If not provided, a pretrained model will be used.
- config_overridesstr
a string representing the default config settings to override when training a model from scratch.
- config_namestr
a string representing the name or path of the pretrained config to use, if different from the model_name_or_path.
- tokenizer_namestr
a string representing the name or path of the pretrained tokenizer to use, if different from the model_name_or_path.
- cache_dirstr
a string representing the path to the directory where pretrained models downloaded from huggingface.co will be stored.
- use_fast_tokenizerbool
a boolean indicating whether to use a fast tokenizer (backed by the tokenizers library) or not.
- model_revisionstr
a string representing the specific model version to use (can be a branch name, tag name, or commit id).
- tokenOptional[str]
Necessary when accessing a private model/dataset.
- torch_dtypestr
a string representing the dtype to load the model under. If auto is passed, the dtype will be automatically derived from the model’s weights.
- use_ram_optimized_loadbool
a boolean indicating whether to use disk mapping when memory is not enough.
- use_int8bool
a boolean indicating whether to load int8 quantization for inference.
- load_in_4bitbool
whether to load the model in 4bit
- model_max_lengthint
The maximum length of the model.
- truncation_sidestr
The side on which the model should have truncation applied.
- arch_typestr
Model architecture type.
- padding_sidestr
The side on which the tokenizer should have padding applied.
- eos_paddingbool
whether to pad with eos token instead of pad token.
- ignore_bias_buffersbool
fix for DDP issues with LM bias/mask buffers - invalid scalar type,`inplace operation.
- class lmflow.args.DatasetArguments[source]#
Define a class DatasetArguments using the dataclass decorator. The class contains several optional parameters that can be used to configure a dataset for a language model.
- dataset_pathstr
a string representing the path of the dataset to use.
- dataset_namestr
a string representing the name of the dataset to use. The default value is “customized”.
- is_custom_datasetbool
a boolean indicating whether to use custom data. The default value is False.
- customized_cache_dirstr
a string representing the path to the directory where customized dataset caches will be stored.
- dataset_config_namestr
a string representing the configuration name of the dataset to use (via the datasets library).
- train_filestr
a string representing the path to the input training data file (a text file).
- validation_filestr
a string representing the path to the input evaluation data file to evaluate the perplexity on (a text file).
- max_train_samplesint
an integer indicating the maximum number of training examples to use for debugging or quicker training. If set, the training dataset will be truncated to this number.
- max_eval_samples: int
an integer indicating the maximum number of evaluation examples to use for debugging or quicker training. If set, the evaluation dataset will be truncated to this number.
- streamingbool
a boolean indicating whether to enable streaming mode.
- block_size: int
an integer indicating the optional input sequence length after tokenization. The training dataset will be truncated in blocks of this size for training.
- train_on_prompt: bool
a boolean indicating whether to train on prompt for conversation datasets such as ShareGPT.
- conversation_template: str
a string representing the template for conversation datasets.
The class also includes some additional parameters that can be used to configure the dataset further, such as overwrite_cache, validation_split_percentage, preprocessing_num_workers, disable_group_texts, demo_example_in_prompt, explanation_in_prompt, keep_linebreaks, and prompt_structure.
The field function is used to set default values and provide help messages for each parameter. The Optional type hint is used to indicate that a parameter is optional. The metadata argument is used to provide additional information about each parameter, such as a help message.
- class lmflow.args.MultiModalDatasetArguments[source]#
Bases:
DatasetArguments
Define a class DatasetArguments using the dataclass decorator. The class contains several optional parameters that can be used to configure a dataset for a language model.
- dataset_pathstr
a string representing the path of the dataset to use.
- dataset_namestr
a string representing the name of the dataset to use. The default value is “customized”.
- is_custom_datasetbool
a boolean indicating whether to use custom data. The default value is False.
- customized_cache_dirstr
a string representing the path to the directory where customized dataset caches will be stored.
- dataset_config_namestr
a string representing the configuration name of the dataset to use (via the datasets library).
- train_filestr
a string representing the path to the input training data file (a text file).
- validation_filestr
a string representing the path to the input evaluation data file to evaluate the perplexity on (a text file).
- max_train_samplesint
an integer indicating the maximum number of training examples to use for debugging or quicker training. If set, the training dataset will be truncated to this number.
- max_eval_samples: int
an integer indicating the maximum number of evaluation examples to use for debugging or quicker training. If set, the evaluation dataset will be truncated to this number.
- streamingbool
a boolean indicating whether to enable streaming mode.
- block_size: int
an integer indicating the optional input sequence length after tokenization. The training dataset will be truncated in blocks of this size for training.
- train_on_prompt: bool
a boolean indicating whether to train on prompt for conversation datasets such as ShareGPT.
- conversation_template: str
a string representing the template for conversation datasets.
The class also includes some additional parameters that can be used to configure the dataset further, such as overwrite_cache, validation_split_percentage, preprocessing_num_workers, disable_group_texts, demo_example_in_prompt, explanation_in_prompt, keep_linebreaks, and prompt_structure.
The field function is used to set default values and provide help messages for each parameter. The Optional type hint is used to indicate that a parameter is optional. The metadata argument is used to provide additional information about each parameter, such as a help message.
- class lmflow.args.FinetunerArguments[source]#
Bases:
transformers.TrainingArguments
Adapt transformers.TrainingArguments
- class lmflow.args.RewardModelTunerArguments[source]#
Bases:
FinetunerArguments
Arguments for reward modeling.
- class lmflow.args.EvaluatorArguments[source]#
Define a class EvaluatorArguments using the dataclass decorator. The class contains several optional parameters that can be used to configure a evaluator.
- local_rankstr
For distributed training: local_rank
random_shuffle : bool
use_wandb : bool
random_seed : int, default = 1
output_dir : str, default = ‘./output_dir’,
- mixed_precisionstr, choice from [“bf16”,”fp16”].
mixed precision mode, whether to use bf16 or fp16
- deepspeed :
Enable deepspeed and pass the path to deepspeed json config file (e.g. ds_config.json) or an already loaded json file as a dict
- temperaturefloat
An argument of model.generate in huggingface to control the diversity of generation.
- repetition_penaltyfloat
An argument of model.generate in huggingface to penalize repetitions.
- class lmflow.args.InferencerArguments[source]#
Define a class InferencerArguments using the dataclass decorator. The class contains several optional parameters that can be used to configure a inferencer.
- local_rankstr
For distributed training: local_rank
random_seed : int, default = 1 inference_batch_size : int, default = 1 deepspeed :
Enable deepspeed and pass the path to deepspeed json config file (e.g. ds_config.json) or an already loaded json file as a dict
- mixed_precisionstr, choice from [“bf16”,”fp16”].
mixed precision mode, whether to use bf16 or fp16
- temperaturefloat
An argument of model.generate in huggingface to control the diversity of generation.
- repetition_penaltyfloat
An argument of model.generate in huggingface to penalize repetitions.
- use_beam_searchOptional[bool]
Whether to use beam search during inference, By default False.
- num_output_sequencesOptional[int]
Number of output sequences to return for the given prompt, currently only used in vllm inference, By default 8.
- top_pOptional[float]
top_p for sampling, By default 1.0.
- top_kOptional[int]
top_k for sampling, By default -1 (no top_k).
- additional_stop_token_idsOptional[List[int]]
the ids of the end of sentence tokens, By default [].
- apply_chat_templateOptional[bool]
Whether to apply chat template, By default True.
- save_resultsOptional[bool]
Whether to save inference results, By default False.
- results_pathOptional[str]
The json file path of inference results, By default None.
- enable_decode_inference_resultOptional[bool]
Whether to detokenize the inference results.
NOTE: For iterative align pipelines, whether to detokenize depends on the homogeneity of the policy model and the reward model (i.e., if they have the same tokenizer).
- use_vllm: bool, optional
Whether to use VLLM for inference, By default False.
- vllm_tensor_parallel_size: int, optional
The tensor parallel size for VLLM inference.
- vllm_gpu_memory_utilization: float, optional
The GPU memory utilization for VLLM inference. The proportion of GPU memory (per GPU) to use for VLLM inference.
- class lmflow.args.RaftAlignerArguments[source]#
Bases:
transformers.TrainingArguments
Define a class RaftAlignerArguments to configure raft aligner.
- class lmflow.args.DPOv2AlignerArguments[source]#
Bases:
FinetunerArguments
The arguments for the DPOv2 training script.
- class lmflow.args.IterativeAlignerArguments[source]#
Bases:
InferencerArguments
Arguments for iterative aligners.
- class lmflow.args.IterativeDPOAlignerArguments[source]#
Bases:
IterativeAlignerArguments
,DPOv2AlignerArguments
Arguments for iterative DPO aligners.