lmflow.datasets.dataset#

This Python code defines a class Dataset with methods for initializing, loading, and manipulating datasets from different backends such as Hugging Face and JSON.

The Dataset class includes methods for loading datasets from a dictionary and a Hugging Face dataset, mapping datasets, and retrieving the backend dataset and arguments.

Attributes#

Classes#

Dataset

Initializes the Dataset object with the given parameters.

Module Contents#

lmflow.datasets.dataset.logger[source]#
lmflow.datasets.dataset.DATASET_TYPES = ['text_only', 'text2text', 'float_only', 'image_text', 'conversation', 'paired_conversation',...[source]#
lmflow.datasets.dataset.KEY_TYPE = 'type'[source]#
lmflow.datasets.dataset.KEY_INSTANCES = 'instances'[source]#
lmflow.datasets.dataset.KEY_SCORE = 'score'[source]#
class lmflow.datasets.dataset.Dataset(data_args: lmflow.args.DatasetArguments = None, backend: str = 'huggingface', *args, **kwargs)[source]#

Initializes the Dataset object with the given parameters.

Parameters:
data_argsDatasetArguments object.

Contains the arguments required to load the dataset.

backendstr, default=”huggingface”

A string representing the dataset backend. Defaults to “huggingface”.

argsOptional.

Positional arguments.

kwargsOptional.

Keyword arguments.

data_args = None[source]#
backend = 'huggingface'[source]#
backend_dataset = None[source]#
type = None[source]#
dataset_path[source]#
__len__()[source]#
_check_data_format()[source]#

Checks if data type and data structure matches

Raise messages with hints if not matched.

from_dict(dict_obj: dict, *args, **kwargs)[source]#

Create a Dataset object from a dictionary.

Return a Dataset given a dict with format:
{

“type”: TYPE, “instances”: [

{

“key_1”: VALUE_1.1, “key_2”: VALUE_1.2, …

}, {

“key_1”: VALUE_2.1, “key_2”: VALUE_2.2, …

]

}

Parameters:
dict_objdict.

A dictionary containing the dataset information.

argsOptional.

Positional arguments.

kwargsOptional.

Keyword arguments.

Returns:
selfDataset object.
classmethod create_from_dict(dict_obj, *args, **kwargs)[source]#
Returns:
Returns a Dataset object given a dict.
to_dict()[source]#
Returns:
Return a dict represents the dataset:
{

“type”: TYPE, “instances”: [

{

“key_1”: VALUE_1.1, “key_2”: VALUE_1.2, …

}, {

“key_1”: VALUE_2.1, “key_2”: VALUE_2.2, …

]

}

A python dict object represents the content of this dataset.
to_list()[source]#

Returns a list of instances.

map(*args, **kwargs)[source]#
Parameters:
argsOptional.

Positional arguments.

kwargsOptional.

Keyword arguments.

Returns:
selfDataset object.
get_backend() str | None[source]#
Returns:
self.backend
get_backend_dataset()[source]#
Returns:
self.backend_dataset
get_fingerprint()[source]#
Returns:
Fingerprint of the backend_dataset which controls the cache
get_data_args()[source]#
Returns:
self.data_args
get_type() str[source]#
Returns:
self.type
save(file_path: str, format: str = 'json')[source]#

Save the dataset to a json file.

Parameters:
file_pathstr.

The path to the file where the dataset will be saved.

sample(n: int, seed: int = 42)[source]#

Sample n instances from the dataset.

Parameters:
nint.

The number of instances to sample from the dataset.

Returns:
sample_datasetDataset object.

A new dataset object containing the sampled instances.

train_test_split(test_size: float = 0.2, shuffle: bool = True, seed: int = 42)[source]#

Split the dataset into training and testing sets.

Parameters:
test_sizefloat, default=0.2.

The proportion of the dataset that will be used for testing.

Returns:
train_datasetDataset object.

A new dataset object containing the training instances.

test_datasetDataset object.

A new dataset object containing the testing instances.

drop_instances(indices: list)[source]#

Drop instances from the dataset.

Parameters:
indiceslist.

A list of indices of the instances to drop from the dataset.

sanity_check(drop_invalid: bool = True)[source]#

Perform a sanity check on the dataset.

hf_dataset_sanity_check(drop_invalid: bool = True)[source]#

Perform a sanity check on the HuggingFace dataset.