lmflow.datasets#
This Python code defines a class Dataset with methods for initializing, loading, and manipulating datasets from different backends such as Hugging Face and JSON.
The Dataset class includes methods for loading datasets from a dictionary and a Hugging Face dataset, mapping datasets, and retrieving the backend dataset and arguments.
Submodules#
Classes#
Initializes the Dataset object with the given parameters. |
Functions#
Package Contents#
- class lmflow.datasets.Dataset(data_args: lmflow.args.DatasetArguments = None, backend: str = 'huggingface', *args, **kwargs)[source]#
Initializes the Dataset object with the given parameters.
- Parameters:
- data_argsDatasetArguments object.
Contains the arguments required to load the dataset.
- backendstr, default=”huggingface”
A string representing the dataset backend. Defaults to “huggingface”.
- argsOptional.
Positional arguments.
- kwargsOptional.
Keyword arguments.
- data_args#
- backend#
- backend_dataset = None#
- type = None#
- dataset_path#
- _check_data_format()[source]#
Checks if data type and data structure matches
Raise messages with hints if not matched.
- from_dict(dict_obj: dict, *args, **kwargs)[source]#
Create a Dataset object from a dictionary.
- Return a Dataset given a dict with format:
- {
“type”: TYPE, “instances”: [
- {
“key_1”: VALUE_1.1, “key_2”: VALUE_1.2, …
}, {
“key_1”: VALUE_2.1, “key_2”: VALUE_2.2, …
]
}
- Parameters:
- dict_objdict.
A dictionary containing the dataset information.
- argsOptional.
Positional arguments.
- kwargsOptional.
Keyword arguments.
- Returns:
- selfDataset object.
- classmethod create_from_dict(dict_obj, *args, **kwargs)[source]#
- Returns:
- Returns a Dataset object given a dict.
- to_dict()[source]#
- Returns:
- Return a dict represents the dataset:
- {
“type”: TYPE, “instances”: [
- {
“key_1”: VALUE_1.1, “key_2”: VALUE_1.2, …
}, {
“key_1”: VALUE_2.1, “key_2”: VALUE_2.2, …
]
}
- A python dict object represents the content of this dataset.
- map(*args, **kwargs)[source]#
- Parameters:
- argsOptional.
Positional arguments.
- kwargsOptional.
Keyword arguments.
- Returns:
- selfDataset object.
- save(file_path: str, format: str = 'json')[source]#
Save the dataset to a json file.
- Parameters:
- file_pathstr.
The path to the file where the dataset will be saved.
- sample(n: int, seed: int = 42)[source]#
Sample n instances from the dataset.
- Parameters:
- nint.
The number of instances to sample from the dataset.
- Returns:
- sample_datasetDataset object.
A new dataset object containing the sampled instances.
- train_test_split(test_size: float = 0.2, shuffle: bool = True, seed: int = 42)[source]#
Split the dataset into training and testing sets.
- Parameters:
- test_sizefloat, default=0.2.
The proportion of the dataset that will be used for testing.
- Returns:
- train_datasetDataset object.
A new dataset object containing the training instances.
- test_datasetDataset object.
A new dataset object containing the testing instances.