We've released our memory-efficient finetuning algorithm LISA, check out [Paper][User Guide] for more details!

lmflow.models.vision_encoder.clip_encoder#

Module Contents#

Classes#

CLIPVisionTower

Functions#

build_vision_tower(vision_tower_cfg, **kwargs)

lmflow.models.vision_encoder.clip_encoder.build_vision_tower(vision_tower_cfg, **kwargs)[source]#
class lmflow.models.vision_encoder.clip_encoder.CLIPVisionTower(vision_tower, args, delay_load=False)[source]#

Bases: torch.nn.Module

property dummy_feature[source]#
property dtype[source]#
property device[source]#
property config[source]#
property hidden_size[source]#
property num_patches[source]#
load_model()[source]#
encode_images(images, language_projection)[source]#
feature_select(image_forward_outs)[source]#
forward(images)[source]#
prepare_inputs_labels_for_multimodal(input_ids, attention_mask, past_key_values, labels, images, language_projection=None, language_model=None, **kwargs)[source]#

Copy from the LLAVA code base. Should be polished.