I currently have a dataset of 170k documents on me, each is some 100-1000 words long which I want to filter and then update a SQL database with each.
I need to classify two things:
Is this doc relevant to this task? (e.g. does it the document talk about code-related tasks or devops, at all)
I am building a curriculum learning-like dataset, so is it an advanced doc (talks about advanced concepts) or is it an entry-level beginner-friendly doc? Rate 1-5.
Afterwards, actually extract the data.
I know Embedding models exist for the purpose of classification, but I don’t know if they can readily be applied for a classification model.
One part of me says “hey, you are earning some 200$ a day on your job, just load it in some Openai-compatible API and don’t overoptimize”
Another part of me says “I’ll do this again, and spending 200$ to classify 1/10th of your dataset is waste.”
How do you filter this kind of data? I know set-based models exist for relevant/irrelevant tasks. The task two should be a 3b model fine-tuned on this data.
My current plan - do the project in 3 stages - first filter via a tiny model, then the rating, then the extraction.
If you did go with something like a BERT based classification model, you would still need training data.
These days with LLMs it’s kind of a no-brainer to bootstrap this initial annotation step with frontier models for say 1000 docs, then re-assess from there.
Depending on how you implement it, the LLM calls don’t need to break the bank. e.g. for a tool I’m building (everyrow.io), there’s quite a few cost optimisations available beyond a naïve per-document LLM evaluation, including batching and dynamically adjusting reasoning effort based on how obvious the examples are. Also depending on how much you care about a globally consistent definition of your 1-5 rating, I would do that in a separate stage after the initial filtering, so you can spend more compute per document getting the ratings cohesive without wasting time on the irrelevant docs.
From there, either conclude that the cost is worth it to scale to the whole dataset, otherwise you now have perfect training data to distill into a smaller model you can train yourself, following something like Text classification