lmflow.utils.conversation_template.base#

Attributes#

Classes#

TemplateComponent

The minimal unit of a template, which can be a token, a string, or a list of tools.

Formatter

Helper class that provides a standard way to create an ABC using

EmptyFormatter

Helper class that provides a standard way to create an ABC using

StringFormatter

Helper class that provides a standard way to create an ABC using

ConversationTemplate

ConversationTemplateForTool

Module Contents#

lmflow.utils.conversation_template.base.logger[source]#
class lmflow.utils.conversation_template.base.TemplateComponent[source]#

The minimal unit of a template, which can be a token, a string, or a list of tools.

Parameters:
typeLiteral[‘token’, ‘token_id’, ‘string’, ‘tools’]
  • Type of the component.

  • When the component is a token or a string, the content should be string.

The difference between the two is that token will be converted to token ids by the tokenizer.convert_tokens_to_ids() method, while string will be directly encoded by the tokenizer.encode() method. Specially, since the bos token and eos token are frequently used across different templates, we provide the convenience to use ‘bos_token’ and ‘eos_token’ to represent the actual bos and eos tokens when type of the TemplateComponent is token. For example:

`python TemplateComponent(type='token', content='bos_token') `

After encoding, the content will be replaced by the actual token id of the bos token. Please do remember that if you set the type to string, the tokenizer will try to encode the string ‘bos_token’ instead of providing the actual bos token.

  • When the component is token_id, the content should be int or List[int], and

will be directly appended to the encoded token ids.

  • Tools are not supported yet.

contentUnion[str, int, List[str], List[int]]

Content of the component.

type: Literal['token', 'token_id', 'string', 'tools'][source]#
content: str | int | List[str] | List[int][source]#
mask: bool | None = True[source]#
__post_init__()[source]#
__repr__() str[source]#
__str__() str[source]#
class lmflow.utils.conversation_template.base.Formatter[source]#

Bases: abc.ABC

Helper class that provides a standard way to create an ABC using inheritance.

template: List[TemplateComponent] = [][source]#
abstract format(**kwargs) List[TemplateComponent][source]#
has_placeholder()[source]#
class lmflow.utils.conversation_template.base.EmptyFormatter[source]#

Bases: Formatter

Helper class that provides a standard way to create an ABC using inheritance.

__post_init__()[source]#
format(**kwargs) list[source]#

Empty formatter for when no formatting is needed. This is useful when user has already applied formatting to the dataset.

Returns:
list

Original template.

class lmflow.utils.conversation_template.base.StringFormatter[source]#

Bases: Formatter

Helper class that provides a standard way to create an ABC using inheritance.

__post_init__()[source]#
format(**kwargs) list[source]#

Format the string components with the provided keyword arguments. Mostly used for formatting system prompt, user and assistant messages.

Parameters:
**kwargsdict

Keyword arguments containing values to replace in the template components.

Returns:
list

Formatted template.

class lmflow.utils.conversation_template.base.ConversationTemplate[source]#
user_formatter: Formatter[source]#
assistant_formatter: Formatter[source]#
function_formatter: Formatter | None = None[source]#
observation_formatter: Formatter | None = None[source]#
system_formatter: Formatter | None = None[source]#
force_system: bool = False[source]#
tools_formatter: Formatter | None = None[source]#
separator: TemplateComponent | None = None[source]#
remove_last_sep: bool = False[source]#
special_starter: TemplateComponent | None = None[source]#
special_stopper: TemplateComponent | None = None[source]#
template_name: str | None = None[source]#
system_default: str | None = None[source]#
__post_init__()[source]#
encode_conversation(tokenizer: transformers.PreTrainedTokenizer, messages: List[Dict[str, str]], system: str | None = None, tools: List[str] | None = None, **kwargs) Sequence[Tuple[List[int], List[int]]][source]#

Messages here should be guaranteed to be in pairs, with the first message being the user message and the second message being the system message. Data example: ```json {

“conversation_id”: 2, “system”: “sysinfo1”, “tools”: [“tool_1_desc”], “messages”: [

{

“role”: “user”, “content”: “hi”

}, {

“role”: “assistant”, “content”: “Hello!”

}

]

}#

_encode(tokenizer: transformers.PreTrainedTokenizer, messages: List[Dict[str, str]], system: str | None = None, tools: str | None = None, **kwargs) Sequence[Tuple[List[int], List[int]]][source]#
_encode_template(template: List[TemplateComponent], tokenizer: transformers.PreTrainedTokenizer, **kwargs) List[int][source]#

Encode template components into token ids.

Parameters:
templateList[TemplateComponent]

Formatted template components.

tokenizerPreTrainedTokenizer

Tokenizer to convert tokens into token ids.

Returns:
List[int]

Encoded token ids.

post_process_pairs(encoded_pairs, tokenizer)[source]#
remove_last_separator(encoded_pairs: Sequence[Tuple[List[int], List[int]]], tokenizer: transformers.PreTrainedTokenizer) Sequence[Tuple[List[int], List[int]]][source]#
add_special_starter(encoded_pairs: Sequence[Tuple[List[int], List[int]]], tokenizer: transformers.PreTrainedTokenizer) Sequence[Tuple[List[int], List[int]]][source]#
add_special_stopper(encoded_pairs: Sequence[Tuple[List[int], List[int]]], tokenizer: transformers.PreTrainedTokenizer) Sequence[Tuple[List[int], List[int]]][source]#
_ensure_id_list(obj: int | List[int]) List[int][source]#

Make sure the object is a list of integers. Useful for handling token ids.

class lmflow.utils.conversation_template.base.ConversationTemplateForTool[source]#

Bases: ConversationTemplate

encode_conversation(tokenizer: transformers.PreTrainedTokenizer, messages: List[Dict[str, str]], system: str | None = None, tools: List[str] | None = None, **kwargs) Sequence[Tuple[List[int], List[int]]][source]#

Messages here should be guaranteed to be in pairs, with the first message being the user message and the second message being the system message. Data example: ```json {

“conversation_id”: 2, “system”: “sysinfo1”, “tools”: [“tool_1_desc”], “messages”: [

{

“role”: “user”, “content”: “hi”

}, {

“role”: “assistant”, “content”: “Hello!”

}

]

}#

_encode(tokenizer: transformers.PreTrainedTokenizer, messages: List[Dict[str, str]], system: str | None = None, tools: str | None = None, **kwargs) Sequence[Tuple[List[int], List[int]]][source]#
_encode_template(template: List[TemplateComponent], tokenizer: transformers.PreTrainedTokenizer, **kwargs) List[int][source]#

Encode template components into token ids.

Parameters:
templateList[TemplateComponent]

Formatted template components.

tokenizerPreTrainedTokenizer

Tokenizer to convert tokens into token ids.

Returns:
List[int]

Encoded token ids.

_handle_tools(tools: List[str] | None) str[source]#
lmflow.utils.conversation_template.base.EMPTY_TEMPLATE[source]#
lmflow.utils.conversation_template.base.EMPTY_NO_SPECIAL_TOKENS_TEMPLATE[source]#