mrankitvish577/Qwen3-4B-Instruct-2507-GGUF

This repository hosts a fine-tuned and quantized version of the Qwen3-4B-Instruct-2507 model, optimized for efficiency and performance with Unsloth. The model has been fine-tuned on the Maxime Labonne's FineTome-100k dataset and converted to GGUF format for use with llama.cpp and Ollama.

Model Details

How to use with llama.cpp / Ollama

These GGUF files are designed for use with llama.cpp or Ollama. You can download the .gguf files directly and use them with the respective tools.

Example llama.cpp usage:

./llama.cpp/llama-cli --model qwen3-4b-instruct-2507.Q5_K_M.gguf -p "<|im_start|>user\nContinue the sequence: 1, 1, 2, 3, 5, 8,<|im_end|>\n<|im_start|>assistant\n"

Example Ollama usage:

If you've used ollama create with the provided Modelfile (available in this repository), you can run:

ollama run mrankitvish577/Qwen3-4B-Instruct-2507-GGUF

How to load and use this model (Unsloth)

If you want to load the LoRA adapters or the merged model back into Unsloth, you can do so as follows:

Loading LoRA adapters (requires the original base model):

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen3-4B-Instruct-2507", # The base model
    max_seq_length = 2048,
    load_in_4bit = True,
)

# Load the LoRA adapters
model.load_adapter("mrankitvish577/qwen_lora") # Assuming you also pushed lora adapters

# Prepare for inference
messages = [
    {"role" : "user", "content" : "Continue the sequence: 1, 1, 2, 3, 5, 8,"
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True,
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 1000,
    temperature = 0.7, top_p = 0.8, top_k = 20,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

Loading the merged 4-bit or 16-bit model (if merged versions were pushed):

from unsloth import FastLanguageModel

# For 4-bit merged model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "mrankitvish577/qwen_finetune_4bit", # Or your merged 16bit model
    max_seq_length = 2048,
    load_in_4bit = True, # Use load_in_4bit=False for 16bit merged models
)

# Prepare for inference
messages = [
    {"role" : "user", "content" : "Continue the sequence: 1, 1, 2, 3, 5, 8,"
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True,
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 1000,
    temperature = 0.7, top_p = 0.8, top_k = 20,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

License

This model is licensed under the LGPL-3.0 license.

Acknowledgements

This model was fine-tuned using Unsloth AI, which provides efficient tools for LLM fine-tuning.

Downloads last month
153
GGUF
Model size
4B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mrankitvish577/Qwen3-4B-Instruct-2507-GGUF

Quantized
(26)
this model

Dataset used to train mrankitvish577/Qwen3-4B-Instruct-2507-GGUF