lmflow.pipeline.utils.dpov2_trainer =================================== .. py:module:: lmflow.pipeline.utils.dpov2_trainer Attributes ---------- .. autoapisummary:: lmflow.pipeline.utils.dpov2_trainer.logger Classes ------- .. autoapisummary:: lmflow.pipeline.utils.dpov2_trainer.DPOv2Trainer Module Contents --------------- .. py:data:: logger .. py:class:: DPOv2Trainer(model: Union[transformers.PreTrainedModel, torch.nn.Module] = None, ref_model: Optional[Union[transformers.PreTrainedModel, torch.nn.Module]] = None, beta: float = 0.1, loss_type: Literal['sigmoid', 'hinge', 'cross_entropy', 'kl', 'rev_kl', 'raft'] = 'rev_kl', args: transformers.TrainingArguments = None, data_collator: Optional[transformers.DataCollator] = None, label_pad_token_id: int = -100, padding_value: int = 0, truncation_mode: str = 'keep_end', train_dataset: Optional[datasets.Dataset] = None, eval_dataset: Optional[Union[datasets.Dataset, Dict[str, datasets.Dataset]]] = None, tokenizer: Optional[transformers.PreTrainedTokenizerBase] = None, model_init: Optional[Callable[[], transformers.PreTrainedModel]] = None, callbacks: Optional[List[transformers.trainer_callback.TrainerCallback]] = None, optimizers: Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR] = (None, None), preprocess_logits_for_metrics: Optional[Callable[[torch.Tensor, torch.Tensor], torch.Tensor]] = None, max_length: Optional[int] = None, max_prompt_length: Optional[int] = None, max_target_length: Optional[int] = None, peft_config: Optional[Dict] = None, is_encoder_decoder: Optional[bool] = None, disable_dropout: bool = True, generate_during_eval: bool = False, compute_metrics: Optional[Callable[[transformers.trainer_utils.EvalLoopOutput], Dict]] = None, mask_prompt: Optional[bool] = False, len_penalty: float = 0, preprocessing_num_workers: int = 1) Bases: :py:obj:`trl.DPOTrainer` .. py:attribute:: use_dpo_data_collator :value: True .. py:attribute:: len_penalty .. py:method:: dpo_loss(policy_chosen_logps: torch.FloatTensor, policy_rejected_logps: torch.FloatTensor, reference_chosen_logps: torch.FloatTensor, reference_rejected_logps: torch.FloatTensor, reference_free: bool = False, margin: Optional[torch.FloatTensor] = None, len_penalty: float = 0) -> Tuple[torch.FloatTensor, torch.FloatTensor, torch.FloatTensor] Compute the DPO loss for a batch of policy and reference model log probabilities. Args: policy_chosen_logps: Log probabilities of the policy model for the chosen responses. Shape: (batch_size,) policy_rejected_logps: Log probabilities of the policy model for the rejected responses. Shape: (batch_size,) reference_chosen_logps: Log probabilities of the reference model for the chosen responses. Shape: (batch_size,) reference_rejected_logps: Log probabilities of the reference model for the rejected responses. Shape: (batch_size,) beta: Temperature parameter for the DPO loss, typically something in the range of 0.1 to 0.5. We ignore the reference model as beta -> 0. reference_free: If True, we ignore the _provided_ reference model and implicitly use a reference model that assigns equal probability to all responses. Returns: A tuple of three tensors: (losses, chosen_rewards, rejected_rewards). The losses tensor contains the DPO loss for each example in the batch. The chosen_rewards and rejected_rewards tensors contain the rewards for the chosen and rejected responses, respectively. .. !! processed by numpydoc !! .. py:method:: get_batch_loss_metrics(model, batch: Dict[str, Union[List, torch.LongTensor]], train_eval: Literal['train', 'eval'] = 'train') .. py:method:: get_batch_metrics(model, batch: Dict[str, Union[List, torch.LongTensor]], train_eval: Literal['train', 'eval'] = 'train') Compute the DPO loss and other metrics for the given batch of inputs for train or test. .. !! processed by numpydoc !!