lmflow.pipeline.utils.dpov2_dataprocessor#

Attributes#

Classes#

Module Contents#

lmflow.pipeline.utils.dpov2_dataprocessor.logger[source]#
class lmflow.pipeline.utils.dpov2_dataprocessor.PreferenceDataCollatorWithPadding[source]#
tokenizer: transformers.PreTrainedTokenizerBase[source]#
model: transformers.PreTrainedModel | None = None[source]#
padding: bool | str = True[source]#
max_length: int | None = None[source]#
max_prompt_length: int | None = None[source]#
label_pad_token_id: int = -100[source]#
padding_value: int = 0[source]#
truncation_mode: str = 'keep_end'[source]#
is_encoder_decoder: bool | None = False[source]#
max_target_length: int | None = None[source]#
mask_prompt: bool | None = False[source]#
tokenize_batch_element(prompt: str, chosen: str, rejected: str) Dict[source]#

Tokenize a single batch element.

At this stage, we don’t convert to PyTorch tensors yet; we just handle the truncation

in case the prompt + chosen or prompt + rejected responses is/are too long. First we truncate the prompt; if we’re still too long, we truncate the chosen/rejected.

We also create the labels for the chosen/rejected responses, which are of length equal to

the sum of the length of the prompt and the chosen/rejected response, with label_pad_token_id for the prompt tokens.

collate(batch)[source]#
__call__(features: List[Dict[str, Any]]) Dict[str, Any][source]#