lmflow.pipeline.dpov2_aligner#
Attributes#
Classes#
Module Contents#
- class lmflow.pipeline.dpov2_aligner.DPOv2Aligner(model_args: lmflow.args.ModelArguments, data_args: lmflow.args.DatasetArguments, aligner_args: lmflow.args.DPOv2AlignerArguments, ref_model_args: lmflow.args.ModelArguments)[source]#
Bases:
lmflow.pipeline.base_aligner.BaseAligner
- align(model: lmflow.models.hf_decoder_model.HFDecoderModel, ref_model: lmflow.models.hf_decoder_model.HFDecoderModel, train_dataset: lmflow.datasets.dataset.Dataset, eval_dataset: lmflow.datasets.dataset.Dataset, transform_dataset_in_place: bool = True)[source]#
- __prepare_training_args(args: lmflow.args.DPOv2AlignerArguments) transformers.TrainingArguments [source]#
- convert_to_paired_dataset(source_dataset: lmflow.datasets.dataset.Dataset, sampling_paired_method: str = 'random', length_penalty: float = 0.0, margin_scale: float = 1.0, use_fast: bool = False) lmflow.datasets.dataset.Dataset [source]#
Convert a scored one to multiple (text_to_scored_textlist) to a paired dataset by rejection sampling.
- _calc_reward_with_length_penalty(rewards: List[float], lengths: List[int], length_penalty: float) List[float] [source]#
When length_penalty > 0, penalize the longer sequence by subtracting length_penalty * length from the reward. Vice versa when length_penalty < 0.
- sampling_paired_idx_from_rewards(rewards: List[float], sampling_paired_method: str = 'random', use_fast: bool = False) Tuple[int, int] [source]#
Prepare the dataset for DPO training by rejection sampling. We implement different strategies to select pairs, including random: randomly select two instances max_min: best v.s. worst max_max: best v.s. second best max_random: best v.s. random from the remaining
- class lmflow.pipeline.dpov2_aligner.MemorySafeDPOv2Aligner(model_args: lmflow.args.ModelArguments, data_args: lmflow.args.DatasetArguments, aligner_args: lmflow.args.DPOv2AlignerArguments, ref_model_args: lmflow.args.ModelArguments)[source]#