lmflow.utils.conversation_template.base
=======================================

.. py:module:: lmflow.utils.conversation_template.base


Attributes
----------

.. autoapisummary::

   lmflow.utils.conversation_template.base.logger
   lmflow.utils.conversation_template.base.EMPTY_TEMPLATE
   lmflow.utils.conversation_template.base.EMPTY_NO_SPECIAL_TOKENS_TEMPLATE


Classes
-------

.. autoapisummary::

   lmflow.utils.conversation_template.base.TemplateComponent
   lmflow.utils.conversation_template.base.Formatter
   lmflow.utils.conversation_template.base.EmptyFormatter
   lmflow.utils.conversation_template.base.StringFormatter
   lmflow.utils.conversation_template.base.ListFormatter
   lmflow.utils.conversation_template.base.ConversationTemplate
   lmflow.utils.conversation_template.base.ConversationTemplateForTool


Module Contents
---------------

.. py:data:: logger

.. py:class:: TemplateComponent

   
   The minimal unit of a template, which can be a token, a string, or a list of tools.


   :Parameters:

       **type** : Literal['token', 'token_id', 'string', 'tools']
           - Type of the component.  
           
           - When the component is a token or a string, the content should be `string`. 
           The difference between the two is that token will be converted to token ids 
           by the tokenizer.convert_tokens_to_ids() method, while string will be directly 
           encoded by the tokenizer.encode() method. Specially, since the bos token and eos
           token are frequently used across different templates, we provide the convenience
           to use `'bos_token'` and `'eos_token'` to represent the actual bos and eos tokens when
           `type` of the `TemplateComponent` is `token`. For example:  
           
           ```python
           TemplateComponent(type='token', content='bos_token')
           ```
           
           After encoding, the content will be replaced by the actual token id of the bos token.
           Please do remember that if you set the `type` to `string`, the tokenizer will try to 
           encode the string 'bos_token' instead of providing the actual bos token.
           
           - When the component is token_id, the content should be `int` or `List[int]`, and 
           will be directly appended to the encoded token ids.
           
           - Tools are not supported yet.

       **content** : Union[str, int, List[str], List[int]]
           Content of the component.


   ..
       !! processed by numpydoc !!

   .. py:attribute:: type
      :type:  Literal['token', 'token_id', 'string', 'tools']


   .. py:attribute:: content
      :type:  Union[str, int, List[str], List[int]]


   .. py:attribute:: mask
      :type:  Optional[bool]
      :value: True


   .. py:method:: __post_init__()


   .. py:method:: __repr__() -> str


   .. py:method:: __str__() -> str


.. py:class:: Formatter

   Bases: :py:obj:`abc.ABC`


   Helper class that provides a standard way to create an ABC using
   inheritance.


   ..
       !! processed by numpydoc !!

   .. py:attribute:: template
      :type:  List[TemplateComponent]


   .. py:method:: format(**kwargs) -> List[TemplateComponent]
      :abstractmethod:


   .. py:method:: has_placeholder()


.. py:class:: EmptyFormatter

   Bases: :py:obj:`Formatter`


   Helper class that provides a standard way to create an ABC using
   inheritance.


   ..
       !! processed by numpydoc !!

   .. py:method:: __post_init__()


   .. py:method:: format(**kwargs) -> list

      
      Empty formatter for when no formatting is needed.
      This is useful when user has already applied formatting to the dataset.


      :Returns:

          list
              Original template.


      ..
          !! processed by numpydoc !!


.. py:class:: StringFormatter

   Bases: :py:obj:`Formatter`


   Helper class that provides a standard way to create an ABC using
   inheritance.


   ..
       !! processed by numpydoc !!

   .. py:method:: __post_init__()


   .. py:method:: format(**kwargs) -> list

      
      Format the string components with the provided keyword arguments. 
      Mostly used for formatting system prompt, user and assistant messages.


      :Parameters:

          **\*\*kwargs** : dict
              Keyword arguments containing values to replace in the template components.


      :Returns:

          list
              Formatted template.


      ..
          !! processed by numpydoc !!


.. py:class:: ListFormatter

   Bases: :py:obj:`Formatter`


   Helper class that provides a standard way to create an ABC using
   inheritance.


   ..
       !! processed by numpydoc !!

   .. py:method:: format(**kwargs) -> list


.. py:class:: ConversationTemplate

   .. py:attribute:: user_formatter
      :type:  Formatter


   .. py:attribute:: assistant_formatter
      :type:  Formatter


   .. py:attribute:: function_formatter
      :type:  Optional[Formatter]
      :value: (None,)


   .. py:attribute:: observation_formatter
      :type:  Optional[Formatter]
      :value: (None,)


   .. py:attribute:: system_formatter
      :type:  Optional[Formatter]
      :value: None


   .. py:attribute:: tools_formatter
      :type:  Optional[Formatter]
      :value: None


   .. py:attribute:: separator
      :type:  Optional[TemplateComponent]
      :value: None


   .. py:attribute:: special_starter
      :type:  Optional[TemplateComponent]
      :value: None


   .. py:attribute:: special_stopper
      :type:  Optional[TemplateComponent]
      :value: None


   .. py:attribute:: template_name
      :type:  Optional[str]
      :value: None


   .. py:method:: __post_init__()


   .. py:method:: encode_conversation(tokenizer: transformers.PreTrainedTokenizer, messages: List[Dict[str, str]], system: Optional[str] = None, tools: Optional[List[str]] = None, remove_last_sep: bool = False, **kwargs) -> Sequence[Tuple[List[int], List[int]]]

      
      Messages here should be guaranteed to be in pairs, with the first message being the user message and the second message being the system message.
      Data example: 
      ```json
      {
          "conversation_id": 2,
          "system": "sysinfo1",
          "tools": ["tool_1_desc"],
          "messages": [
              {
                  "role": "user",
                  "content": "hi"
              },
              {
                  "role": "assistant",
                  "content": "Hello!"
              }
          ]
      }
      ```


      ..
          !! processed by numpydoc !!


   .. py:method:: _encode(tokenizer: transformers.PreTrainedTokenizer, messages: List[Dict[str, str]], system: Optional[str] = None, tools: Optional[str] = None, **kwargs) -> Sequence[Tuple[List[int], List[int]]]


   .. py:method:: _encode_template(template: List[TemplateComponent], tokenizer: transformers.PreTrainedTokenizer, **kwargs) -> List[int]

      
      Encode template components into token ids.


      :Parameters:

          **template** : List[TemplateComponent]
              Formatted template components.

          **tokenizer** : PreTrainedTokenizer
              Tokenizer to convert tokens into token ids.


      :Returns:

          List[int]
              Encoded token ids.


      ..
          !! processed by numpydoc !!


   .. py:method:: remove_last_separator(encoded_pairs: Sequence[Tuple[List[int], List[int]]], tokenizer: transformers.PreTrainedTokenizer) -> Sequence[Tuple[List[int], List[int]]]


   .. py:method:: add_special_starter(encoded_pairs: Sequence[Tuple[List[int], List[int]]], tokenizer: transformers.PreTrainedTokenizer) -> Sequence[Tuple[List[int], List[int]]]


   .. py:method:: add_special_stopper(encoded_pairs: Sequence[Tuple[List[int], List[int]]], tokenizer: transformers.PreTrainedTokenizer) -> Sequence[Tuple[List[int], List[int]]]


   .. py:method:: _ensure_id_list(obj: Union[int, List[int]]) -> List[int]

      
      Make sure the object is a list of integers. Useful for handling token ids.


      ..
          !! processed by numpydoc !!


.. py:class:: ConversationTemplateForTool

   Bases: :py:obj:`ConversationTemplate`


   .. py:method:: encode_conversation(tokenizer: transformers.PreTrainedTokenizer, messages: List[Dict[str, str]], system: Optional[str] = None, tools: Optional[List[str]] = None, remove_last_sep: bool = False, **kwargs) -> Sequence[Tuple[List[int], List[int]]]

      
      Messages here should be guaranteed to be in pairs, with the first message being the user message and the second message being the system message.
      Data example: 
      ```json
      {
          "conversation_id": 2,
          "system": "sysinfo1",
          "tools": ["tool_1_desc"],
          "messages": [
              {
                  "role": "user",
                  "content": "hi"
              },
              {
                  "role": "assistant",
                  "content": "Hello!"
              }
          ]
      }
      ```


      ..
          !! processed by numpydoc !!


   .. py:method:: _encode(tokenizer: transformers.PreTrainedTokenizer, messages: List[Dict[str, str]], system: Optional[str] = None, tools: Optional[str] = None, **kwargs) -> Sequence[Tuple[List[int], List[int]]]


   .. py:method:: _encode_template(template: List[TemplateComponent], tokenizer: transformers.PreTrainedTokenizer, **kwargs) -> List[int]

      
      Encode template components into token ids.


      :Parameters:

          **template** : List[TemplateComponent]
              Formatted template components.

          **tokenizer** : PreTrainedTokenizer
              Tokenizer to convert tokens into token ids.


      :Returns:

          List[int]
              Encoded token ids.


      ..
          !! processed by numpydoc !!


.. py:data:: EMPTY_TEMPLATE

.. py:data:: EMPTY_NO_SPECIAL_TOKENS_TEMPLATE