You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

TrOCR Fine-Tuned for Nepali Language

Model Description

This model is a fine-tuned version of Microsoft's TrOCR model for optical character recognition (OCR) tasks, specifically trained to recognize and generate Nepali text from handwritten or printed image inputs. It leverages a VisionEncoderDecoder architecture with a DeiT-based encoder and a BERT-based decoder.

Model Architecture

Encoder: Vision Transformer (DeiT)
Decoder: BERT-like architecture adapted for OCR tasks
Pretrained Base: microsoft/trocr-base-handwritten
Tokenizer: Nepali BERT tokenizer from Shushant/nepaliBERT

Training Details

Dataset: Fine-tuned using a Nepali dataset consisting of handwritten and printed text.
Objective: Generate accurate Nepali text outputs from images containing textual content.
Optimization: Trained with a combination of beam search and length penalty to enhance the quality of text generation.
Beam Search Parameters:
- num_beams = 8
- length_penalty = 2.0
- max_length = 47
- no_repeat_ngram_size = 3

Usage

Inference Example

To use this model for OCR tasks, you can follow the steps below:

from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image

# Load the fine-tuned model and processor
model = VisionEncoderDecoderModel.from_pretrained("rockerritesh/trOCR_ne")
processor = TrOCRProcessor.from_pretrained("rockerritesh/trOCR_ne")

# Load an image
image = Image.open("path_to_image.jpg").convert("RGB")

# Preprocess image and generate predictions
pixel_values = processor(images=image, return_tensors="pt").pixel_values
output_ids = model.generate(pixel_values, num_beams=8, max_length=47, early_stopping=True)
decoded_text = processor.tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0]

print("Recognized Text:", decoded_text)

Hugging Face Hub

You can access the model and its processor on the Hugging Face Hub:

Model: rockerritesh/trOCR_ne
Processor: rockerritesh/trOCR_ne

Features

OCR for Nepali: Trained to accurately recognize Nepali text in handwritten and printed formats.
Robust Tokenizer: Utilizes the Nepali BERT tokenizer for efficient tokenization.
Efficient Inference: Supports beam search and length penalties to optimize generation quality.

Fine-Tuning Details

Hyperparameters

Hyperparameter	Value
Batch Size	16
Learning Rate	5e-5
Epochs	5
Optimizer	AdamW
Beam Search Beams	8
Max Length	47
Length Penalty	2.0
No Repeat N-Gram Size	3

Model Configuration

The model was configured as follows:

Decoder

Activation Function: ReLU
Attention Heads: 8
Layers: 6
Hidden Size: 256
FFN Size: 1024

Encoder

Hidden Size: 384
Layers: 12
Attention Heads: 6
Image Size: 384

Dataset Details

The dataset used for fine-tuning consists of diverse handwritten and printed Nepali text from publicly available and custom datasets.

Limitations and Bias

The model's performance depends on the quality and diversity of the fine-tuning dataset.
It may not generalize well to unseen handwriting styles or printed text with unconventional fonts.

Citation

If you use this model in your research or applications, please cite:

@article{rockerritesh-trocr-nepali,
  title={Fine-Tuned TrOCR Model for Nepali Language},
  author={Sumit Yadav},
  year={2024},
  url={https://huggingface.co/rockerritesh/trOCR_ne}
}

License

license: apache-2.0

Downloads last month: -

Safetensors

Model size

44.4M params

Tensor type

F32

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for rockerritesh/trOCR_ne

Base model

microsoft/trocr-base-handwritten

Finetuned

(34)

this model