lmflow.pipeline.dpov2_aligner#

Attributes#

Classes#

Module Contents#

lmflow.pipeline.dpov2_aligner.logger[source]#
lmflow.pipeline.dpov2_aligner.ReferenceModelArguments[source]#
class lmflow.pipeline.dpov2_aligner.DPOv2Aligner(model_args: lmflow.args.ModelArguments, data_args: lmflow.args.DatasetArguments, aligner_args: lmflow.args.DPOv2AlignerArguments, ref_model_args: lmflow.args.ModelArguments)[source]#

Bases: lmflow.pipeline.base_aligner.BaseAligner

model_args[source]#
ref_model_args[source]#
data_args[source]#
aligner_args[source]#
align(model: lmflow.models.hf_decoder_model.HFDecoderModel, ref_model: lmflow.models.hf_decoder_model.HFDecoderModel, train_dataset: lmflow.datasets.dataset.Dataset, eval_dataset: lmflow.datasets.dataset.Dataset, transform_dataset_in_place: bool = True)[source]#
__prepare_training_args(args: lmflow.args.DPOv2AlignerArguments) transformers.TrainingArguments[source]#
convert_to_paired_dataset(source_dataset: lmflow.datasets.dataset.Dataset, sampling_paired_method: str = 'random', length_penalty: float = 0.0, margin_scale: float = 1.0, use_fast: bool = False) lmflow.datasets.dataset.Dataset[source]#

Convert a scored one to multiple (text_to_scored_textlist) to a paired dataset by rejection sampling.

_calc_response_lengths(outputs: List[str | Dict[str, str]], dataset_type: str) List[int][source]#
_calc_reward_with_length_penalty(rewards: List[float], lengths: List[int], length_penalty: float) List[float][source]#

When length_penalty > 0, penalize the longer sequence by subtracting length_penalty * length from the reward. Vice versa when length_penalty < 0.

sampling_paired_idx_from_rewards(rewards: List[float], sampling_paired_method: str = 'random', use_fast: bool = False) Tuple[int, int][source]#

Prepare the dataset for DPO training by rejection sampling. We implement different strategies to select pairs, including random: randomly select two instances max_min: best v.s. worst max_max: best v.s. second best max_random: best v.s. random from the remaining

_sampling_paired_idx_from_rewards(rewards: List[float], sampling_paired_method: str = 'random') Tuple[int, int][source]#
_sampling_paired_idx_from_rewards_fast(rewards: List[float], sampling_paired_method: str = 'random') Tuple[int, int][source]#
class lmflow.pipeline.dpov2_aligner.MemorySafeDPOv2Aligner(model_args: lmflow.args.ModelArguments, data_args: lmflow.args.DatasetArguments, aligner_args: lmflow.args.DPOv2AlignerArguments, ref_model_args: lmflow.args.ModelArguments)[source]#
model_args[source]#
ref_model_args[source]#
data_args[source]#
aligner_args[source]#
aligner_file_path[source]#
align()[source]#