lmflow.models.vision2seq_model#

Classes#

CustomAutoVision2SeqModel

Module Contents#

class lmflow.models.vision2seq_model.CustomAutoVision2SeqModel(config: transformers.Blip2Config, image_encoder_name_or_path=None, qformer_name_or_path=None, language_model_name_or_path=None, low_resource=False)[source]#

Bases: transformers.Blip2ForConditionalGeneration, lmflow.models.base_model.BaseModel

custom_vision_model[source]#

with_qformer[source]#

language_model[source]#

hidden_size[source]#

get_backend_model()[source]#

vision_model_from_pretrained(pretrained_path)[source]#

qformer_from_pretrained(pretrained_path)[source]#

language_model_from_pretrained(pretrained_path, low_resource=False, use_prompt_cache=False)[source]#

vision_feature_select(image_forward_outs)[source]#

register_prompt_cache(prompt_ids, prompt_keys_values)[source]#

Udpate the prompt id and embedding for reuse in the future

Args:: prompt_ids (torch.LongTensor): The id of the prompt. prompt_keys_values (torch.FloatTensor): The embedding of the prompt.
Returns:: None

save_prompt_cache(path)[source]#

Save prompt embedding and id.

Args:: path: The path to save the prompt embedding and id.
Returns:: None

load_prompt_cache(path)[source]#

Load prompt embedding and id. Args:

path: The path to load the prompt embedding and id.

Returns:: None

get_tokenizer()[source]#

forward(input_ids: torch.LongTensor = None, pixel_values: torch.FloatTensor | None = None, images: torch.FloatTensor | None = None, attention_mask: torch.Tensor | None = None, past_key_values: list[torch.FloatTensor] | None = None, inputs_embeds: torch.FloatTensor | None = None, labels: torch.LongTensor | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None, image_token_indexes: list | None = None, one_sample_multiple_images: bool = False) → tuple | transformers.modeling_outputs.CausalLMOutputWithPast[source]#

processor_image_token_in_minigpt4(input_ids, language_model_inputs, attention_mask, image_token_indexes, pixel_values, batch_size=1)[source]#

generate(pixel_values: torch.FloatTensor, input_ids: torch.LongTensor | None = None, attention_mask: torch.LongTensor | None = None, image_token_indexes: list | None = None, one_sample_multiple_images: bool | None = False, images: torch.LongTensor | None = None, **generate_kwargs) → torch.LongTensor[source]#

Overrides generate function to be able to use the model as a conditional generator.

Args:

pixel_values (torch.FloatTensor of shape (batch_size, num_channels, height, width)):: Input images to be processed.
input_ids (torch.LongTensor of shape (batch_size, sequence_length), optional):: The sequence used as a prompt for the generation.
attention_mask (torch.LongTensor of shape (batch_size, sequence_length), optional):: Mask to avoid performing attention on padding token indices
image_token_indexes (bool, optional):: The index for inserting the image tokens.
one_sample_multiple_images: (bool, optional):: The flag for inference that the input batch size is 1 and contain multiple images.

Returns:

captions (list): A list of strings of length batch_size * num_captions.

lmflow.models.vision2seq_model#

Classes#

Module Contents#

This Page