lmflow.utils.data_utils#

The program includes several functions: setting a random seed, loading data from a JSON file, batching data, and extracting answers from generated text.

Classes#

`VLLMInferenceResultWithInput`
`RewardModelInferenceResultWithInput`

Functions#

`set_random_seed`(seed)	Set the random seed for random, numpy, torch, torch.cuda.
`load_data`(file_name)	Load data with file name.
`batchlize`(examples, batch_size, random_shuffle)	Convert examples to a dataloader.
`preview_file`(file_path[, chars])	Returns the first and last specified number of characters from a file
`get_dataset_type_fast`(→ Union[str, None])	Get the type values from the first and last n lines of a large json dataset.
`check_dataset_instances_key_fast`(→ bool)	Check if the dataset instances key matches the instance_key.
`answer_extraction`(response[, answer_type])	Use this funtion to extract answers from generated text
`process_image_flag`(text[, image_flag])

Module Contents#

lmflow.utils.data_utils.set_random_seed(seed: int)[source]#

Set the random seed for random, numpy, torch, torch.cuda.

Parameters:

seedint: The default seed.

lmflow.utils.data_utils.load_data(file_name: str)[source]#

Load data with file name.

Parameters:

file_namestr.: The dataset file name.

Returns:

inputslist.: The input texts of the dataset.
outputslist.: The output texts file datasets.
lenint.: The length of the dataset.

lmflow.utils.data_utils.batchlize(examples: list, batch_size: int, random_shuffle: bool)[source]#

Convert examples to a dataloader.

Parameters:

exampleslist.: Data list.
batch_sizeint.
random_shufflebool: If true, the dataloader shuffle the training data.

Returns:

dataloader:: Dataloader with batch generator.

lmflow.utils.data_utils.preview_file(file_path: str, chars: int = 100)[source]#

Returns the first and last specified number of characters from a file without loading the entire file into memory, working with any file type.

Args:: file_path (str): Path to the file to be previewed chars (int, optional): Number of characters to show from start and end. Defaults to 100.
Returns:: tuple: (first_chars, last_chars) - The first and last characters from the file

lmflow.utils.data_utils.get_dataset_type_fast(file_path: str, max_chars: int = 100) → str | None[source]#: Get the type values from the first and last n lines of a large json dataset.

lmflow.utils.data_utils.check_dataset_instances_key_fast(file_path: str, instances_key: str, max_lines: int = 100) → bool[source]#: Check if the dataset instances key matches the instance_key.

lmflow.utils.data_utils.answer_extraction(response, answer_type=None)[source]#

Use this funtion to extract answers from generated text

Parameters:

args: Arguments.
responsestr: plain string response.

Returns:

answer:: Decoded answer (such as A, B, C, D, E for mutiple-choice QA).

lmflow.utils.data_utils.process_image_flag(text, image_flag='<ImageHere>')[source]#

class lmflow.utils.data_utils.VLLMInferenceResultWithInput[source]#

Bases: TypedDict

input: str[source]#

output: list[str] | list[list[int]][source]#

class lmflow.utils.data_utils.RewardModelInferenceResultWithInput[source]#

Bases: TypedDict

input: str[source]#

output: list[dict[str, str | float]][source]#

lmflow.utils.data_utils#

Classes#

Functions#

Module Contents#

This Page