lmflow.models.hf_model_mixin#
Attributes#
Classes#
Module Contents#
- class lmflow.models.hf_model_mixin.HFModelMixin(model_args: lmflow.args.ModelArguments, do_train: bool, device: str | None = 'gpu', hf_auto_model_additional_args: dict | None = None, *args, **kwargs)[source]#
Bases:
lmflow.models.base_model.BaseModel- __prepare_tokenizer(model_args: lmflow.args.ModelArguments) transformers.PreTrainedTokenizer | transformers.PreTrainedTokenizerFast[source]#
- __prepare_dtype(model_args: lmflow.args.ModelArguments) torch.dtype[source]#
- __prepare_model_config(model_args: lmflow.args.ModelArguments, hf_auto_model_additional_args: dict | None = None)[source]#
Prepare model configuration for hf auto register, Parameters ———- model_args : ModelArguments
LMFlow model arguments.
- hf_auto_model_additional_argsOptional[dict], optional
Special configurations such as num_labels in AutoModelForSequenceClassification (commonly used in reward modeling) will not preset in __prepare_model_config, so it should be passed in hf_auto_model_additional_args.
Returns#
- configModelConfig
hf model config.
- __prepare_quant_config(model_args: lmflow.args.ModelArguments)[source]#
- __prepare_peft_config(model_args: lmflow.args.ModelArguments)[source]#
- __model_module_inject(model_args: lmflow.args.ModelArguments) None[source]#
Override some model modules with custom implementations.
Current implementations: - Position interpolation (model_args.do_rope_scaling):
replace llama embeddings with condense embeddings.
- __prepare_model_for_training(model_args: lmflow.args.ModelArguments, hf_auto_model: HF_AUTOMODEL_TYPE)[source]#
- __prepare_model_for_inference(model_args: lmflow.args.ModelArguments, hf_auto_model: HF_AUTOMODEL_TYPE)[source]#
- __prepare_model_for_vllm_inference(model_args: lmflow.args.ModelArguments, gpu_memory_utilization: float, tensor_parallel_size: int, data_parallel_size: int = 1, max_model_len: int | None = None)[source]#
- __prepare_model_for_sglang_inference(model_args: lmflow.args.ModelArguments, gpu_memory_utilization: float | None = None, tensor_parallel_size: int | None = None, enable_deterministic_inference: bool = False, attention_backend: str | None = None)[source]#
- activate_model_for_inference(inference_engine: Literal['huggingface', 'vllm', 'sglang'] = 'huggingface', gpu_memory_utilization: float | None = None, tensor_parallel_size: int | None = None, data_parallel_size: int = 1, max_model_len: int | None = None, enable_deterministic_inference: bool = False, attention_backend: str | None = None)[source]#
- deactivate_model_for_inference(inference_engine: Literal['huggingface', 'vllm', 'sglang'] = 'huggingface')[source]#
Deactivate the model and release the resources.
NOTE: For vllm (>=0.8), the best-effort release below works for most single-GPU, inference-only use cases. It remains unreliable when
tensor_parallel_size > 1, CUDA graphs are enabled, or the same process also holds an HF training model — in those cases useMemorySafeVLLMInferencer, which isolates inference in a subprocess. vllm still has no official in-process shutdown API (RFC vllm-project/vllm#24885);MemorySafeVLLMInferenceris kept for backward compatibility and will be migrated to vllm sleep mode in a follow-up.