PhoBERT Vietnamese News Recommendation Model

This model is fine-tuned from vinai/phobert-base for Vietnamese news recommendation using contrastive learning on news categories.

Dataset Structure

The model was trained on a Vietnamese news dataset with the following columns:

  • URL: Article URL
  • Title: Article title
  • Summary: Article summary
  • Contents: Full article content
  • Date: Publication date
  • Author(s): Article author(s)
  • Category: Article category (used as labels)
  • Tags: Related tags

Model Details

  • Base Model: vinai/phobert-base
  • Task: Content-Based News Recommendation
  • Language: Vietnamese
  • Method: Classification-based contrastive learning + FAISS similarity search
  • Input: Title + Summary (concatenated)

Usage

Load Model and Generate Recommendations

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import faiss
import numpy as np
import pickle
from sklearn.preprocessing import normalize
from huggingface_hub import hf_hub_download

# Load model and tokenizer
model_name = "htNghiaaa/phobert-vietnamese-recommendation-1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Load FAISS index and metadata
index_path = hf_hub_download(repo_id=model_name, filename="faiss_index.index")
metadata_path = hf_hub_download(repo_id=model_name, filename="metadata.pkl")

index = faiss.read_index(index_path)
with open(metadata_path, "rb") as f:
    metadata = pickle.load(f)

# Function to get embeddings
def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=256)
    with torch.no_grad():
        outputs = model.roberta(**inputs)
        embedding = outputs.last_hidden_state.mean(dim=1).numpy()
    return normalize(embedding, norm='l2').astype('float32')

# Function to get recommendations
def get_recommendations(query_text, top_k=5):
    query_embedding = get_embedding(query_text)
    similarities, indices = index.search(query_embedding, top_k)
    
    results = []
    for sim, idx in zip(similarities[0], indices[0]):
        results.append({
            'title': metadata['titles'][idx],
            'summary': metadata['summaries'][idx],
            'category': metadata['categories'][idx],
            'tags': metadata['tags'][idx],
            'url': metadata['urls'][idx],
            'similarity': float(sim)
        })
    return results

# Example
recommendations = get_recommendations("Bóng đá Việt Nam")
for rec in recommendations:
    print(f"[{rec['category']}] {rec['title']} (sim: {rec['similarity']:.4f})")

Training Details

  • Model: vinai/phobert-base
  • Training Method: Classification on news categories
  • Embedding Method: Mean pooling of last hidden state
  • Similarity Metric: Cosine similarity (via FAISS)
  • GPU: NVIDIA P100/T4
  • Batch Size: 16
  • Epochs: 3

Files Included

  • Model weights and config
  • Tokenizer files
  • faiss_index.index: FAISS index for similarity search
  • embeddings.npy: Pre-computed embeddings
  • metadata.pkl: Article metadata (urls, titles, summaries, categories, tags, authors, dates)
  • label_encoder.pkl: Category label encoder
Downloads last month
2
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support