How to classify large quantities of text?

MRiabov · September 14, 2025, 7:35am

Sup,

I currently have a dataset of 170k documents on me, each is some 100-1000 words long which I want to filter and then update a SQL database with each.

I need to classify two things:

Is this doc relevant to this task? (e.g. does it the document talk about code-related tasks or devops, at all)

I am building a curriculum learning-like dataset, so is it an advanced doc (talks about advanced concepts) or is it an entry-level beginner-friendly doc? Rate 1-5.

Afterwards, actually extract the data.

I know Embedding models exist for the purpose of classification, but I don’t know if they can readily be applied for a classification model.

One part of me says “hey, you are earning some 200$ a day on your job, just load it in some Openai-compatible API and don’t overoptimize”

Another part of me says “I’ll do this again, and spending 200$ to classify 1/10th of your dataset is waste.”

How do you filter this kind of data? I know set-based models exist for relevant/irrelevant tasks. The task two should be a 3b model fine-tuned on this data.

My current plan - do the project in 3 stages - first filter via a tiny model, then the rating, then the extraction.

What would you do?

Cheers.

John6666 · September 14, 2025, 1:01pm

Embedding models are fast enough to run properly even on just a CPU, so it’s easy to just give it a try.
For fine-tuning, the HF Course should be a good reference.

juhoinkinen · September 18, 2025, 8:46am

You could try out Annif for this, see also Annif-tutorial.

Callum-McMahon · January 22, 2026, 7:03pm

If you did go with something like a BERT based classification model, you would still need training data.

These days with LLMs it’s kind of a no-brainer to bootstrap this initial annotation step with frontier models for say 1000 docs, then re-assess from there.

Depending on how you implement it, the LLM calls don’t need to break the bank. e.g. for a tool I’m building (everyrow.io), there’s quite a few cost optimisations available beyond a naïve per-document LLM evaluation, including batching and dynamically adjusting reasoning effort based on how obvious the examples are. Also depending on how much you care about a globally consistent definition of your 1-5 rating, I would do that in a separate stage after the initial filtering, so you can spend more compute per document getting the ratings cohesive without wasting time on the irrelevant docs.

From there, either conclude that the cost is worth it to scale to the whole dataset, otherwise you now have perfect training data to distill into a smaller model you can train yourself, following something like Text classification

Topic		Replies	Views
Classification Problem - Which class of Hugging Face LLM models should I try? Intermediate	2	4921	September 3, 2023
Is there any model for document prioritization 🤗Hub	1	61	March 28, 2025
Email classification, labeling and entity classification/extraction Beginners	4	872	January 22, 2026
Model for Long Text Classification Beginners	0	368	February 20, 2024
General question about text classification Models Beginners	4	448	November 21, 2024

How to classify large quantities of text?

Related topics