HuggingFaceM4/WebSight
Viewer • Updated • 2.75M • 16.1k • 393
How to use pdufour/Llama-3.2-11B-Vision-Instruct-WebSight with PEFT:
from peft import PeftModel
from transformers import AutoModelForCausalLM
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-11B-Vision-Instruct")
model = PeftModel.from_pretrained(base_model, "pdufour/Llama-3.2-11B-Vision-Instruct-WebSight")LLama 3.2 Vision Instruct trained on 10k samples from https://huggingface.co/datasets/HuggingFaceM4/WebSight.
from transformers import AutoModelForVision2Seq, AutoTokenizer, AutoProcessor
from peft import PeftModel
from PIL import Image
import torch
model = PeftModel.from_pretrained(
AutoModelForVision2Seq.from_pretrained("meta-llama/Llama-3.2-11B-Vision-Instruct", device_map="auto", load_in_4bit=True),
"pdufour/Llama-3.2-11B-Vision-WebSight"
)
tokenizer = AutoTokenizer.from_pretrained("pdufour/Llama-3.2-11B-Vision-WebSight")
processor = AutoProcessor.from_pretrained("meta-llama/Llama-3.2-11B-Vision-Instruct")
inputs = processor(text="Generate code for a web page that looks exactly like this. <|image|>", images=Image.open("fashion.jpg"), return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(input_ids=inputs['input_ids'], max_new_tokens=4096, do_sample=True, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Vision-language dataset used for instruction tuning.
LoRA-tuned Vision-Language Model based on Llama architecture.
For questions about this model, please file an issue on the GitHub repository.
Base model
meta-llama/Llama-3.2-11B-Vision-Instruct