lmflow.datasets.dataset#

This Python code defines a class Dataset with methods for initializing, loading, and manipulating datasets from different backends such as Hugging Face and JSON.

The Dataset class includes methods for loading datasets from a dictionary and a Hugging Face dataset, mapping datasets, and retrieving the backend dataset and arguments.

Attributes#

`logger`
`DATASET_TYPES`
`KEY_TYPE`
`KEY_INSTANCES`
`KEY_SCORE`

Classes#

Dataset

Initializes the Dataset object with the given parameters.

Module Contents#

lmflow.datasets.dataset.logger[source]#

lmflow.datasets.dataset.DATASET_TYPES = ['text_only', 'text2text', 'float_only', 'image_text', 'conversation', 'paired_conversation',...[source]#

lmflow.datasets.dataset.KEY_TYPE = 'type'[source]#

lmflow.datasets.dataset.KEY_INSTANCES = 'instances'[source]#

lmflow.datasets.dataset.KEY_SCORE = 'score'[source]#

class lmflow.datasets.dataset.Dataset(data_args: lmflow.args.DatasetArguments = None, backend: str = 'huggingface', *args, **kwargs)[source]#

Initializes the Dataset object with the given parameters.

Parameters:

data_argsDatasetArguments object.: Contains the arguments required to load the dataset.
backendstr, default=”huggingface”: A string representing the dataset backend. Defaults to “huggingface”.
argsOptional.: Positional arguments.
kwargsOptional.: Keyword arguments.

data_args = None[source]#

backend = 'huggingface'[source]#

backend_dataset = None[source]#

type = None[source]#

dataset_path[source]#

__len__()[source]#

_check_instance_format()[source]#: Checks if data (instances) have required fields. Raises messages with hints if not matched.

_check_hf_json_format(data_files: list[str])[source]#

from_dict(dict_obj: dict, *args, **kwargs)[source]#

Create a Dataset object from a dictionary.

Return a Dataset given a dict with format:

{

“type”: TYPE, “instances”: [

{
“key_1”: VALUE_1.1, “key_2”: VALUE_1.2, …

}, {

“key_1”: VALUE_2.1, “key_2”: VALUE_2.2, …

]

}

Parameters:

dict_objdict.: A dictionary containing the dataset information.
argsOptional.: Positional arguments.
kwargsOptional.: Keyword arguments.

Returns:

selfDataset object.

classmethod create_from_dict(dict_obj, *args, **kwargs)[source]#

Returns:

Returns a Dataset object given a dict.

to_dict()[source]#

Returns:

Return a dict represents the dataset:

{

“type”: TYPE, “instances”: [

{
“key_1”: VALUE_1.1, “key_2”: VALUE_1.2, …

}, {

“key_1”: VALUE_2.1, “key_2”: VALUE_2.2, …

]

}

A python dict object represents the content of this dataset.

to_list()[source]#: Returns a list of instances.

map(*args, **kwargs)[source]#

Parameters:

argsOptional.: Positional arguments.
kwargsOptional.: Keyword arguments.

Returns:

selfDataset object.

get_backend() → str | None[source]#

Returns:

self.backend

get_backend_dataset()[source]#

Returns:

self.backend_dataset

get_fingerprint()[source]#

Returns:

Fingerprint of the backend_dataset which controls the cache

get_data_args()[source]#

Returns:

self.data_args

get_type() → str[source]#

Returns:

self.type

save(file_path: str, format: str = 'json')[source]#

Save the dataset to a json file.

Parameters:

file_pathstr.: The path to the file where the dataset will be saved.

sample(n: int, seed: int = 42)[source]#

Sample n instances from the dataset.

Parameters:

nint.: The number of instances to sample from the dataset.

Returns:

sample_datasetDataset object.: A new dataset object containing the sampled instances.

train_test_split(test_size: float = 0.2, shuffle: bool = True, seed: int = 42)[source]#

Split the dataset into training and testing sets.

Parameters:

test_sizefloat, default=0.2.: The proportion of the dataset that will be used for testing.

Returns:

train_datasetDataset object.: A new dataset object containing the training instances.
test_datasetDataset object.: A new dataset object containing the testing instances.

drop_instances(indices: list)[source]#

Drop instances from the dataset.

Parameters:

indiceslist.: A list of indices of the instances to drop from the dataset.

sanity_check(drop_invalid: bool = True)[source]#: Perform a sanity check on the dataset.

hf_dataset_sanity_check(drop_invalid: bool = True)[source]#: Perform a sanity check on the HuggingFace dataset.

lmflow.datasets.dataset#

Attributes#

Classes#

Module Contents#

This Page