lmflow.utils.conversation_template.base#
Attributes#
Classes#
The minimal unit of a template, which can be a token, a string, or a list of tools. |
|
Helper class that provides a standard way to create an ABC using |
|
Helper class that provides a standard way to create an ABC using |
|
Helper class that provides a standard way to create an ABC using |
|
Helper class that provides a standard way to create an ABC using |
|
Module Contents#
- class lmflow.utils.conversation_template.base.TemplateComponent[source]#
The minimal unit of a template, which can be a token, a string, or a list of tools.
- Parameters:
- typeLiteral[‘token’, ‘token_id’, ‘string’, ‘tools’]
Type of the component.
When the component is a token or a string, the content should be string.
The difference between the two is that token will be converted to token ids by the tokenizer.convert_tokens_to_ids() method, while string will be directly encoded by the tokenizer.encode() method. Specially, since the bos token and eos token are frequently used across different templates, we provide the convenience to use ‘bos_token’ and ‘eos_token’ to represent the actual bos and eos tokens when type of the TemplateComponent is token. For example:
`python TemplateComponent(type='token', content='bos_token') `
After encoding, the content will be replaced by the actual token id of the bos token. Please do remember that if you set the type to string, the tokenizer will try to encode the string ‘bos_token’ instead of providing the actual bos token.
When the component is token_id, the content should be int or List[int], and
will be directly appended to the encoded token ids.
Tools are not supported yet.
- contentUnion[str, int, List[str], List[int]]
Content of the component.
- class lmflow.utils.conversation_template.base.Formatter[source]#
Bases:
abc.ABC
Helper class that provides a standard way to create an ABC using inheritance.
- template: List[TemplateComponent][source]#
- abstract format(**kwargs) List[TemplateComponent] [source]#
- class lmflow.utils.conversation_template.base.EmptyFormatter[source]#
Bases:
Formatter
Helper class that provides a standard way to create an ABC using inheritance.
- class lmflow.utils.conversation_template.base.StringFormatter[source]#
Bases:
Formatter
Helper class that provides a standard way to create an ABC using inheritance.
- format(**kwargs) list [source]#
Format the string components with the provided keyword arguments. Mostly used for formatting system prompt, user and assistant messages.
- Parameters:
- **kwargsdict
Keyword arguments containing values to replace in the template components.
- Returns:
- list
Formatted template.
- class lmflow.utils.conversation_template.base.ListFormatter[source]#
Bases:
Formatter
Helper class that provides a standard way to create an ABC using inheritance.
- class lmflow.utils.conversation_template.base.ConversationTemplate[source]#
-
- separator: TemplateComponent | None = None[source]#
- special_starter: TemplateComponent | None = None[source]#
- special_stopper: TemplateComponent | None = None[source]#
- encode_conversation(tokenizer: transformers.PreTrainedTokenizer, messages: List[Dict[str, str]], system: str | None = None, tools: List[str] | None = None, remove_last_sep: bool = False, **kwargs) Sequence[Tuple[List[int], List[int]]] [source]#
Messages here should be guaranteed to be in pairs, with the first message being the user message and the second message being the system message. Data example: ```json {
“conversation_id”: 2, “system”: “sysinfo1”, “tools”: [“tool_1_desc”], “messages”: [
- {
“role”: “user”, “content”: “hi”
}, {
“role”: “assistant”, “content”: “Hello!”
}
]
}#
- _encode(tokenizer: transformers.PreTrainedTokenizer, messages: List[Dict[str, str]], system: str | None = None, tools: str | None = None, **kwargs) Sequence[Tuple[List[int], List[int]]] [source]#
- _encode_template(template: List[TemplateComponent], tokenizer: transformers.PreTrainedTokenizer, **kwargs) List[int] [source]#
Encode template components into token ids.
- Parameters:
- templateList[TemplateComponent]
Formatted template components.
- tokenizerPreTrainedTokenizer
Tokenizer to convert tokens into token ids.
- Returns:
- List[int]
Encoded token ids.
- remove_last_separator(encoded_pairs: Sequence[Tuple[List[int], List[int]]], tokenizer: transformers.PreTrainedTokenizer) Sequence[Tuple[List[int], List[int]]] [source]#
- add_special_starter(encoded_pairs: Sequence[Tuple[List[int], List[int]]], tokenizer: transformers.PreTrainedTokenizer) Sequence[Tuple[List[int], List[int]]] [source]#
- class lmflow.utils.conversation_template.base.ConversationTemplateForTool[source]#
Bases:
ConversationTemplate
- encode_conversation(tokenizer: transformers.PreTrainedTokenizer, messages: List[Dict[str, str]], system: str | None = None, tools: List[str] | None = None, remove_last_sep: bool = False, **kwargs) Sequence[Tuple[List[int], List[int]]] [source]#
Messages here should be guaranteed to be in pairs, with the first message being the user message and the second message being the system message. Data example: ```json {
“conversation_id”: 2, “system”: “sysinfo1”, “tools”: [“tool_1_desc”], “messages”: [
- {
“role”: “user”, “content”: “hi”
}, {
“role”: “assistant”, “content”: “Hello!”
}
]
}#
- _encode(tokenizer: transformers.PreTrainedTokenizer, messages: List[Dict[str, str]], system: str | None = None, tools: str | None = None, **kwargs) Sequence[Tuple[List[int], List[int]]] [source]#
- _encode_template(template: List[TemplateComponent], tokenizer: transformers.PreTrainedTokenizer, **kwargs) List[int] [source]#
Encode template components into token ids.
- Parameters:
- templateList[TemplateComponent]
Formatted template components.
- tokenizerPreTrainedTokenizer
Tokenizer to convert tokens into token ids.
- Returns:
- List[int]
Encoded token ids.