mrankitvish577/Qwen3-4B-Instruct-2507-GGUF
This repository hosts a fine-tuned and quantized version of the Qwen3-4B-Instruct-2507 model, optimized for efficiency and performance with Unsloth. The model has been fine-tuned on the Maxime Labonne's FineTome-100k dataset and converted to GGUF format for use with llama.cpp and Ollama.
Model Details
- Base Model: unsloth/Qwen3-4B-Instruct-2507
- Fine-tuning Library: Unsloth AI
- Fine-tuning Dataset: mlabonne/FineTome-100k
- Quantization Methods:
q4_k_m,q8_0,q5_k_m(GGUF)
How to use with llama.cpp / Ollama
These GGUF files are designed for use with llama.cpp or Ollama. You can download the .gguf files directly and use them with the respective tools.
Example llama.cpp usage:
./llama.cpp/llama-cli --model qwen3-4b-instruct-2507.Q5_K_M.gguf -p "<|im_start|>user\nContinue the sequence: 1, 1, 2, 3, 5, 8,<|im_end|>\n<|im_start|>assistant\n"
Example Ollama usage:
If you've used ollama create with the provided Modelfile (available in this repository), you can run:
ollama run mrankitvish577/Qwen3-4B-Instruct-2507-GGUF
How to load and use this model (Unsloth)
If you want to load the LoRA adapters or the merged model back into Unsloth, you can do so as follows:
Loading LoRA adapters (requires the original base model):
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Qwen3-4B-Instruct-2507", # The base model
max_seq_length = 2048,
load_in_4bit = True,
)
# Load the LoRA adapters
model.load_adapter("mrankitvish577/qwen_lora") # Assuming you also pushed lora adapters
# Prepare for inference
messages = [
{"role" : "user", "content" : "Continue the sequence: 1, 1, 2, 3, 5, 8,"
]
text = tokenizer.apply_chat_template(
messages,
tokenize = False,
add_generation_prompt = True,
)
from transformers import TextStreamer
_ = model.generate(
**tokenizer(text, return_tensors = "pt").to("cuda"),
max_new_tokens = 1000,
temperature = 0.7, top_p = 0.8, top_k = 20,
streamer = TextStreamer(tokenizer, skip_prompt = True),
)
Loading the merged 4-bit or 16-bit model (if merged versions were pushed):
from unsloth import FastLanguageModel
# For 4-bit merged model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "mrankitvish577/qwen_finetune_4bit", # Or your merged 16bit model
max_seq_length = 2048,
load_in_4bit = True, # Use load_in_4bit=False for 16bit merged models
)
# Prepare for inference
messages = [
{"role" : "user", "content" : "Continue the sequence: 1, 1, 2, 3, 5, 8,"
]
text = tokenizer.apply_chat_template(
messages,
tokenize = False,
add_generation_prompt = True,
)
from transformers import TextStreamer
_ = model.generate(
**tokenizer(text, return_tensors = "pt").to("cuda"),
max_new_tokens = 1000,
temperature = 0.7, top_p = 0.8, top_k = 20,
streamer = TextStreamer(tokenizer, skip_prompt = True),
)
License
This model is licensed under the LGPL-3.0 license.
Acknowledgements
This model was fine-tuned using Unsloth AI, which provides efficient tools for LLM fine-tuning.
- Downloads last month
- 153
4-bit
5-bit
8-bit
Model tree for mrankitvish577/Qwen3-4B-Instruct-2507-GGUF
Base model
Qwen/Qwen3-4B-Instruct-2507