We've released our memory-efficient finetuning algorithm LISA, check out [Paper][User Guide] for more details!

lmflow.models.vision2seq_model#

Module Contents#

Classes#

CustomAutoVision2SeqModel

class lmflow.models.vision2seq_model.CustomAutoVision2SeqModel(config: transformers.Blip2Config, image_encoder_name_or_path=None, qformer_name_or_path=None, language_model_name_or_path=None, low_resource=False)[source]#

Bases: transformers.Blip2ForConditionalGeneration, lmflow.models.base_model.BaseModel

get_backend_model()[source]#
vision_model_from_pretrained(pretrained_path)[source]#
qformer_from_pretrained(pretrained_path)[source]#
language_model_from_pretrained(pretrained_path, low_resource=False, use_prompt_cache=False)[source]#
vision_feature_select(image_forward_outs)[source]#
register_prompt_cache(prompt_ids, prompt_keys_values)[source]#

Udpate the prompt id and embedding for reuse in the future

Args:

prompt_ids (torch.LongTensor): The id of the prompt. prompt_keys_values (torch.FloatTensor): The embedding of the prompt.

Returns:

None

save_prompt_cache(path)[source]#

Save prompt embedding and id.

Args:

path: The path to save the prompt embedding and id.

Returns:

None

load_prompt_cache(path)[source]#

Load prompt embedding and id. Args:

path: The path to load the prompt embedding and id.

Returns:

None

get_tokenizer()[source]#
forward(input_ids: torch.LongTensor = None, pixel_values: torch.FloatTensor | None = None, images: torch.FloatTensor | None = None, attention_mask: torch.Tensor | None = None, past_key_values: List[torch.FloatTensor] | None = None, inputs_embeds: torch.FloatTensor | None = None, labels: torch.LongTensor | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, image_token_indexes: List | None = [0], one_sample_multiple_images: bool = False) Tuple | transformers.modeling_outputs.CausalLMOutputWithPast[source]#
processor_image_token_in_minigpt4(input_ids, language_model_inputs, attention_mask, image_token_indexes, pixel_values, batch_size=1)[source]#
generate(pixel_values: torch.FloatTensor, input_ids: torch.LongTensor | None = None, attention_mask: torch.LongTensor | None = None, image_token_indexes: List | None = [0], one_sample_multiple_images: bool | None = False, images: torch.LongTensor | None = None, **generate_kwargs) torch.LongTensor[source]#

Overrides generate function to be able to use the model as a conditional generator.

Args:
pixel_values (torch.FloatTensor of shape (batch_size, num_channels, height, width)):

Input images to be processed.

input_ids (torch.LongTensor of shape (batch_size, sequence_length), optional):

The sequence used as a prompt for the generation.

attention_mask (torch.LongTensor of shape (batch_size, sequence_length), optional):

Mask to avoid performing attention on padding token indices

image_token_indexes (bool, optional):

The index for inserting the image tokens.

one_sample_multiple_images: (bool, optional):

The flag for inference that the input batch size is 1 and contain multiple images.

Returns:

captions (list): A list of strings of length batch_size * num_captions.