We've released our memory-efficient finetuning algorithm LISA, check out [Paper][User Guide] for more details!

lmflow.datasets#

This Python code defines a class Dataset with methods for initializing, loading, and manipulating datasets from different backends such as Hugging Face and JSON.

The Dataset class includes methods for loading datasets from a dictionary and a Hugging Face dataset, mapping datasets, and retrieving the backend dataset and arguments.

Submodules#

Package Contents#

Classes#

Dataset

Initializes the Dataset object with the given parameters.

CustomMultiModalDataset

Dataset for Multi Modal data

class lmflow.datasets.Dataset(data_args=None, backend: str = 'huggingface', *args, **kwargs)[source]#

Initializes the Dataset object with the given parameters.

Parameters:
data_argsDatasetArguments object.

Contains the arguments required to load the dataset.

backendstr, default=”huggingface”

A string representing the dataset backend. Defaults to “huggingface”.

argsOptional.

Positional arguments.

kwargsOptional.

Keyword arguments.

__len__()[source]#
_check_data_format()[source]#

Checks if data type and data structure matches

Raise messages with hints if not matched.

from_dict(dict_obj: dict, *args, **kwargs)[source]#

Create a Dataset object from a dictionary.

Return a Dataset given a dict with format:
{

“type”: TYPE, “instances”: [

{

“key_1”: VALUE_1.1, “key_2”: VALUE_1.2, …

}, {

“key_1”: VALUE_2.1, “key_2”: VALUE_2.2, …

]

}

Parameters:
dict_objdict.

A dictionary containing the dataset information.

argsOptional.

Positional arguments.

kwargsOptional.

Keyword arguments.

Returns:
selfDataset object.
classmethod create_from_dict(dict_obj, *args, **kwargs)[source]#
Returns:
Returns a Dataset object given a dict.
to_dict()[source]#
Returns:
Return a dict represents the dataset:
{

“type”: TYPE, “instances”: [

{

“key_1”: VALUE_1.1, “key_2”: VALUE_1.2, …

}, {

“key_1”: VALUE_2.1, “key_2”: VALUE_2.2, …

]

}

A python dict object represents the content of this dataset.
to_list()[source]#

Returns a list of instances.

map(*args, **kwargs)[source]#
Parameters:
argsOptional.

Positional arguments.

kwargsOptional.

Keyword arguments.

Returns:
selfDataset object.
get_backend() str | None[source]#
Returns:
self.backend
get_backend_dataset()[source]#
Returns:
self.backend_dataset
get_fingerprint()[source]#
Returns:
Fingerprint of the backend_dataset which controls the cache
get_data_args()[source]#
Returns:
self.data_args
get_type()[source]#
Returns:
self.type
class lmflow.datasets.CustomMultiModalDataset(dataset_path: str, data_args: lmflow.args.DatasetArguments)[source]#

Bases: torch.utils.data.Dataset

Dataset for Multi Modal data

__len__()[source]#
register_tokenizer(tokenizer, image_processor=None)[source]#
__getitem__(i)[source]#