lmflow.utils.data_utils
=======================

.. py:module:: lmflow.utils.data_utils

.. autoapi-nested-parse::

   The program includes several functions: setting a random seed, 
   loading data from a JSON file, batching data, and extracting answers from generated text.

   ..
       !! processed by numpydoc !!


Classes
-------

.. autoapisummary::

   lmflow.utils.data_utils.VLLMInferenceResultWithInput
   lmflow.utils.data_utils.RewardModelInferenceResultWithInput


Functions
---------

.. autoapisummary::

   lmflow.utils.data_utils.set_random_seed
   lmflow.utils.data_utils.load_data
   lmflow.utils.data_utils.batchlize
   lmflow.utils.data_utils.preview_file
   lmflow.utils.data_utils.get_dataset_type_fast
   lmflow.utils.data_utils.check_dataset_instances_key_fast
   lmflow.utils.data_utils.answer_extraction
   lmflow.utils.data_utils.process_image_flag


Module Contents
---------------

.. py:function:: set_random_seed(seed: int)

   
   Set the random seed for `random`, `numpy`, `torch`, `torch.cuda`.


   :Parameters:

       **seed** : int
           The default seed.


   ..
       !! processed by numpydoc !!

.. py:function:: load_data(file_name: str)

   
   Load data with file name.


   :Parameters:

       **file_name** : str.
           The dataset file name.


   :Returns:

       **inputs** : list.
           The input texts of the dataset.

       **outputs** : list.
           The output texts file datasets.    

       **len** : int.
           The length of the dataset.


   ..
       !! processed by numpydoc !!

.. py:function:: batchlize(examples: list, batch_size: int, random_shuffle: bool)

   
   Convert examples to a dataloader.


   :Parameters:

       **examples** : list.
           Data list.

       **batch_size** : int.
           ..

       **random_shuffle** : bool
           If true, the dataloader shuffle the training data.


   :Returns:

       dataloader:
           Dataloader with batch generator.


   ..
       !! processed by numpydoc !!

.. py:function:: preview_file(file_path: str, chars: int = 100)

   
   Returns the first and last specified number of characters from a file
   without loading the entire file into memory, working with any file type.

   Args:
       file_path (str): Path to the file to be previewed
       chars (int, optional): Number of characters to show from start and end. Defaults to 100.

   Returns:
       tuple: (first_chars, last_chars) - The first and last characters from the file


   ..
       !! processed by numpydoc !!

.. py:function:: get_dataset_type_fast(file_path: str, max_chars: int = 100) -> Union[str, None]

   
   Get the type values from the first and last n lines of a large json dataset.


   ..
       !! processed by numpydoc !!

.. py:function:: check_dataset_instances_key_fast(file_path: str, instances_key: str, max_lines: int = 100) -> bool

   
   Check if the dataset instances key matches the instance_key.


   ..
       !! processed by numpydoc !!

.. py:function:: answer_extraction(response, answer_type=None)

   
   Use this funtion to extract answers from generated text


   :Parameters:

       **args**
           Arguments.

       **response** : str
           plain string response.


   :Returns:

       answer:
           Decoded answer (such as A, B, C, D, E for mutiple-choice QA).


   ..
       !! processed by numpydoc !!

.. py:function:: process_image_flag(text, image_flag='<ImageHere>')

.. py:class:: VLLMInferenceResultWithInput

   Bases: :py:obj:`TypedDict`


   ..
       !! processed by numpydoc !!

   .. py:attribute:: input
      :type:  str


   .. py:attribute:: output
      :type:  Union[List[str], List[List[int]]]


.. py:class:: RewardModelInferenceResultWithInput

   Bases: :py:obj:`TypedDict`


   ..
       !! processed by numpydoc !!

   .. py:attribute:: input
      :type:  str


   .. py:attribute:: output
      :type:  List[Dict[str, Union[str, float]]]