lmflow.datasets =============== .. py:module:: lmflow.datasets .. autoapi-nested-parse:: This Python code defines a class Dataset with methods for initializing, loading, and manipulating datasets from different backends such as Hugging Face and JSON. The `Dataset` class includes methods for loading datasets from a dictionary and a Hugging Face dataset, mapping datasets, and retrieving the backend dataset and arguments. .. !! processed by numpydoc !! Submodules ---------- .. toctree:: :maxdepth: 1 /autoapi/lmflow/datasets/dataset/index /autoapi/lmflow/datasets/multi_modal_dataset/index Classes ------- .. autoapisummary:: lmflow.datasets.Dataset Functions --------- .. autoapisummary:: lmflow.datasets.is_multimodal_available Package Contents ---------------- .. py:function:: is_multimodal_available() .. py:class:: Dataset(data_args: lmflow.args.DatasetArguments = None, backend: str = 'huggingface', *args, **kwargs) Initializes the Dataset object with the given parameters. :Parameters: **data_args** : DatasetArguments object. Contains the arguments required to load the dataset. **backend** : str, default="huggingface" A string representing the dataset backend. Defaults to "huggingface". **args** : Optional. Positional arguments. **kwargs** : Optional. Keyword arguments. .. !! processed by numpydoc !! .. py:attribute:: data_args .. py:attribute:: backend .. py:attribute:: backend_dataset :value: None .. py:attribute:: type :value: None .. py:attribute:: dataset_path .. py:method:: __len__() .. py:method:: _check_data_format() Checks if data type and data structure matches Raise messages with hints if not matched. .. !! processed by numpydoc !! .. py:method:: from_dict(dict_obj: dict, *args, **kwargs) Create a Dataset object from a dictionary. Return a Dataset given a dict with format: { "type": TYPE, "instances": [ { "key_1": VALUE_1.1, "key_2": VALUE_1.2, ... }, { "key_1": VALUE_2.1, "key_2": VALUE_2.2, ... }, ... ] } :Parameters: **dict_obj** : dict. A dictionary containing the dataset information. **args** : Optional. Positional arguments. **kwargs** : Optional. Keyword arguments. :Returns: **self** : Dataset object. .. .. !! processed by numpydoc !! .. py:method:: create_from_dict(dict_obj, *args, **kwargs) :classmethod: :Returns: Returns a Dataset object given a dict. .. .. !! processed by numpydoc !! .. py:method:: to_dict() :Returns: Return a dict represents the dataset: { "type": TYPE, "instances": [ { "key_1": VALUE_1.1, "key_2": VALUE_1.2, ... }, { "key_1": VALUE_2.1, "key_2": VALUE_2.2, ... }, ... ] } A python dict object represents the content of this dataset. .. .. !! processed by numpydoc !! .. py:method:: to_list() Returns a list of instances. .. !! processed by numpydoc !! .. py:method:: map(*args, **kwargs) :Parameters: **args** : Optional. Positional arguments. **kwargs** : Optional. Keyword arguments. :Returns: **self** : Dataset object. .. .. !! processed by numpydoc !! .. py:method:: get_backend() -> Optional[str] :Returns: self.backend .. .. !! processed by numpydoc !! .. py:method:: get_backend_dataset() :Returns: self.backend_dataset .. .. !! processed by numpydoc !! .. py:method:: get_fingerprint() :Returns: Fingerprint of the backend_dataset which controls the cache .. .. !! processed by numpydoc !! .. py:method:: get_data_args() :Returns: self.data_args .. .. !! processed by numpydoc !! .. py:method:: get_type() -> str :Returns: self.type .. .. !! processed by numpydoc !! .. py:method:: save(file_path: str, format: str = 'json') Save the dataset to a json file. :Parameters: **file_path** : str. The path to the file where the dataset will be saved. .. !! processed by numpydoc !! .. py:method:: sample(n: int, seed: int = 42) Sample n instances from the dataset. :Parameters: **n** : int. The number of instances to sample from the dataset. :Returns: **sample_dataset** : Dataset object. A new dataset object containing the sampled instances. .. !! processed by numpydoc !! .. py:method:: train_test_split(test_size: float = 0.2, shuffle: bool = True, seed: int = 42) Split the dataset into training and testing sets. :Parameters: **test_size** : float, default=0.2. The proportion of the dataset that will be used for testing. :Returns: **train_dataset** : Dataset object. A new dataset object containing the training instances. **test_dataset** : Dataset object. A new dataset object containing the testing instances. .. !! processed by numpydoc !! .. py:method:: drop_instances(indices: list) Drop instances from the dataset. :Parameters: **indices** : list. A list of indices of the instances to drop from the dataset. .. !! processed by numpydoc !! .. py:method:: sanity_check(drop_invalid: bool = True) Perform a sanity check on the dataset. .. !! processed by numpydoc !! .. py:method:: hf_dataset_sanity_check(drop_invalid: bool = True) Perform a sanity check on the HuggingFace dataset. .. !! processed by numpydoc !!