lmflow.datasets#
This Python code defines a class Dataset with methods for initializing, loading, and manipulating datasets from different backends such as Hugging Face and JSON.
The Dataset class includes methods for loading datasets from a dictionary and a Hugging Face dataset, mapping datasets, and retrieving the backend dataset and arguments.
Submodules#
Classes#
Initializes the Dataset object with the given parameters.  | 
Package Contents#
- class lmflow.datasets.Dataset(data_args: lmflow.args.DatasetArguments = None, backend: str = 'huggingface', *args, **kwargs)[source]#
 Initializes the Dataset object with the given parameters.
- Parameters:
 - data_argsDatasetArguments object.
 Contains the arguments required to load the dataset.
- backendstr, default=”huggingface”
 A string representing the dataset backend. Defaults to “huggingface”.
- argsOptional.
 Positional arguments.
- kwargsOptional.
 Keyword arguments.
- data_args = None#
 
- backend = 'huggingface'#
 
- backend_dataset = None#
 
- type = None#
 
- dataset_path#
 
- _check_instance_format()[source]#
 Checks if data (instances) have required fields. Raises messages with hints if not matched.
- from_dict(dict_obj: dict, *args, **kwargs)[source]#
 Create a Dataset object from a dictionary.
- Return a Dataset given a dict with format:
 - {
 “type”: TYPE, “instances”: [
- {
 “key_1”: VALUE_1.1, “key_2”: VALUE_1.2, …
}, {
“key_1”: VALUE_2.1, “key_2”: VALUE_2.2, …
]
}
- Parameters:
 - dict_objdict.
 A dictionary containing the dataset information.
- argsOptional.
 Positional arguments.
- kwargsOptional.
 Keyword arguments.
- Returns:
 - selfDataset object.
 
- classmethod create_from_dict(dict_obj, *args, **kwargs)[source]#
 - Returns:
 - Returns a Dataset object given a dict.
 
- to_dict()[source]#
 - Returns:
 - Return a dict represents the dataset:
 - {
 “type”: TYPE, “instances”: [
- {
 “key_1”: VALUE_1.1, “key_2”: VALUE_1.2, …
}, {
“key_1”: VALUE_2.1, “key_2”: VALUE_2.2, …
]
}
- A python dict object represents the content of this dataset.
 
- map(*args, **kwargs)[source]#
 - Parameters:
 - argsOptional.
 Positional arguments.
- kwargsOptional.
 Keyword arguments.
- Returns:
 - selfDataset object.
 
- save(file_path: str, format: str = 'json')[source]#
 Save the dataset to a json file.
- Parameters:
 - file_pathstr.
 The path to the file where the dataset will be saved.
- sample(n: int, seed: int = 42)[source]#
 Sample n instances from the dataset.
- Parameters:
 - nint.
 The number of instances to sample from the dataset.
- Returns:
 - sample_datasetDataset object.
 A new dataset object containing the sampled instances.
- train_test_split(test_size: float = 0.2, shuffle: bool = True, seed: int = 42)[source]#
 Split the dataset into training and testing sets.
- Parameters:
 - test_sizefloat, default=0.2.
 The proportion of the dataset that will be used for testing.
- Returns:
 - train_datasetDataset object.
 A new dataset object containing the training instances.
- test_datasetDataset object.
 A new dataset object containing the testing instances.