PhoBERT Vietnamese News Recommendation Model
This model is fine-tuned from vinai/phobert-base for Vietnamese news recommendation using contrastive learning on news categories.
Dataset Structure
The model was trained on a Vietnamese news dataset with the following columns:
URL: Article URLTitle: Article titleSummary: Article summaryContents: Full article contentDate: Publication dateAuthor(s): Article author(s)Category: Article category (used as labels)Tags: Related tags
Model Details
- Base Model: vinai/phobert-base
- Task: Content-Based News Recommendation
- Language: Vietnamese
- Method: Classification-based contrastive learning + FAISS similarity search
- Input: Title + Summary (concatenated)
Usage
Load Model and Generate Recommendations
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import faiss
import numpy as np
import pickle
from sklearn.preprocessing import normalize
from huggingface_hub import hf_hub_download
# Load model and tokenizer
model_name = "htNghiaaa/phobert-vietnamese-recommendation-1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Load FAISS index and metadata
index_path = hf_hub_download(repo_id=model_name, filename="faiss_index.index")
metadata_path = hf_hub_download(repo_id=model_name, filename="metadata.pkl")
index = faiss.read_index(index_path)
with open(metadata_path, "rb") as f:
metadata = pickle.load(f)
# Function to get embeddings
def get_embedding(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=256)
with torch.no_grad():
outputs = model.roberta(**inputs)
embedding = outputs.last_hidden_state.mean(dim=1).numpy()
return normalize(embedding, norm='l2').astype('float32')
# Function to get recommendations
def get_recommendations(query_text, top_k=5):
query_embedding = get_embedding(query_text)
similarities, indices = index.search(query_embedding, top_k)
results = []
for sim, idx in zip(similarities[0], indices[0]):
results.append({
'title': metadata['titles'][idx],
'summary': metadata['summaries'][idx],
'category': metadata['categories'][idx],
'tags': metadata['tags'][idx],
'url': metadata['urls'][idx],
'similarity': float(sim)
})
return results
# Example
recommendations = get_recommendations("Bóng đá Việt Nam")
for rec in recommendations:
print(f"[{rec['category']}] {rec['title']} (sim: {rec['similarity']:.4f})")
Training Details
- Model: vinai/phobert-base
- Training Method: Classification on news categories
- Embedding Method: Mean pooling of last hidden state
- Similarity Metric: Cosine similarity (via FAISS)
- GPU: NVIDIA P100/T4
- Batch Size: 16
- Epochs: 3
Files Included
- Model weights and config
- Tokenizer files
faiss_index.index: FAISS index for similarity searchembeddings.npy: Pre-computed embeddingsmetadata.pkl: Article metadata (urls, titles, summaries, categories, tags, authors, dates)label_encoder.pkl: Category label encoder
- Downloads last month
- 2
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support