lmflow.tokenization.hf_text_regression_model#

Attributes#

Functions#

blocking_paired(→ Dict)

blocking(→ Dict)

blocking_text_to_textlist(→ Dict)

paired_conversation_tokenize_function(→ Dict)

conversation_tokenize_function(→ Dict)

Handels conversation datasets tokenization

tokenize_function(→ Dict)

Handels text_only and text2text datasets tokenization

text_to_textlist_tokenize_function(→ Dict)

For rm inference, and don't need attn mask and labels.

Module Contents#

lmflow.tokenization.hf_text_regression_model.logger[source]#
lmflow.tokenization.hf_text_regression_model.tok_logger[source]#
lmflow.tokenization.hf_text_regression_model.blocking_paired(token_dict: Dict, column_names: List, block_size: int, model_max_length: int, pad_token_id: int, padding_side: str, truncation_side: str = 'right') Dict[source]#
lmflow.tokenization.hf_text_regression_model.blocking(token_dict: Dict, block_size: int, model_max_length: int, pad_token_id: int, padding_side: str, truncation_side: str = 'right') Dict[source]#
lmflow.tokenization.hf_text_regression_model.blocking_text_to_textlist(token_dict: Dict, block_size: int, model_max_length: int, pad_token_id: int, padding_side: str, truncation_side: str = 'right') Dict[source]#
lmflow.tokenization.hf_text_regression_model.paired_conversation_tokenize_function(examples, data_args: lmflow.args.DatasetArguments, tokenizer: transformers.PreTrainedTokenizer | transformers.PreTrainedTokenizerFast, column_names, conversation_template: lmflow.utils.conversation_template.ConversationTemplate) Dict[source]#
lmflow.tokenization.hf_text_regression_model.conversation_tokenize_function(examples, data_args: lmflow.args.DatasetArguments, tokenizer: transformers.PreTrainedTokenizer | transformers.PreTrainedTokenizerFast, column_names, conversation_template: lmflow.utils.conversation_template.ConversationTemplate) Dict[source]#

Handels conversation datasets tokenization

lmflow.tokenization.hf_text_regression_model.tokenize_function(examples, data_args: lmflow.args.DatasetArguments, tokenizer: transformers.PreTrainedTokenizer | transformers.PreTrainedTokenizerFast, column_names, label_columns, tokenized_column_order, add_special_tokens, use_truncation) Dict[source]#

Handels text_only and text2text datasets tokenization

lmflow.tokenization.hf_text_regression_model.text_to_textlist_tokenize_function(examples, data_args: lmflow.args.DatasetArguments, tokenizer: transformers.PreTrainedTokenizer | transformers.PreTrainedTokenizerFast, column_names, add_special_tokens, use_truncation) Dict[source]#

For rm inference, and don’t need attn mask and labels. NOTE: input_ids here refers to the tokenized input_ids of the input and output