Good model, but with limitations - Good at English credentials but has worse performance in non-English ones

by dof-studio-org - opened 15 days ago

Discussion

dof-studio-org

15 days ago

PAY ATTENTION GUYS:

Good at English credentials but has worse performance in non-English ones

For example, if one address has some non-alphabet chars, detection may fail!

e.g. Baby、今夜大和ホテル鳳舞館に行くよ。そこの304で待っててね

As far as I am concerned, the exact room number should be masked or I may have options to choose to mask it.

But IN ENGLISH

Babe I'm gonna be at the Yamato Hotel Hobukan Room tonight. Room 304—wait for me there!

THE ADDRESS IS MAKRED. You can test this case.

Just a warning for those guys who want to use in multilingual places. :)

NodeLinker

15 days ago

Not just worse, it seems to ignore other languages altogether.

hamzah0asadullah

15 days ago

•

edited about 13 hours ago

@dof-studio-org , you are hereby greeted,

The model card states this (as of commit 7ffa9a043d54d1be65afb281eddf0ffbe629385b on file README.md):

Line 148: - Language(s): Primarily English; selected multilingual robustness evaluation reported
Line 167: Performance may drop on non-English text, non-Latin scripts, protected-group naming patterns, or domains that are out of distribution compared to model training.

Your example (Baby、今夜大和ホテル鳳舞館に行くよ。そこの304で待っててね) will therefore fail because:

L148 reports that the tensors were primarily trained on English, indicating that other languages are not guaranteed to work (including Japanese)
L167 reports that performance "may" (it's better to interpret this as "will") drop on non-English and on non-Latin languages, explicitly mentioning "poor" (or no) support for languages like Japanese, Arabic and Thai.

You are hereby bidden farewell.

NodeLinker

11 days ago

•

edited 11 days ago

e.g. Baby、今夜大和ホテル鳳舞館に行くよ。そこの304で待っててね

Nvidia's model is more powerful and even multilingual, and it was originally trained as a classifier. You can try it here.

By the way, your example works on it—it handles it just fine.

mihaimaruseac

OpenAI org 11 days ago

We released the model with some support for other languages, but with the idea that the model can be fine-tuned to support them (as well as other labels), much faster.

We can improve in the next version.

By the way, your example works on it—it handles it just fine.

I wouldn't call it "just fine". But we should not get into these types of comparisons. Running a benchmark against real datasets is better than just one example

NodeLinker

10 days ago

I wouldn't call it "just fine". But we should not get into these types of comparisons. Running a benchmark against real datasets is better than just one example

Regarding fine-tuning, I already mentioned it in another discussion. Thank you for paying attention to this example and for running it on Nvidia's model yourself. I tried a few examples on Nvidia and on yours — there's no clear leader in terms of quality, both have issues. It's just that Nvidia initially has more classes (even risk levels) and multilingual support, as well as a public dataset on this topic (I don't know if they actually trained on it, so I won't make that claim).

mihaimaruseac

OpenAI org 10 days ago

We are not in competition. We will benchmark against GLiner and can work towards improving.

We have selected fewer classes as we also wanted to have a smaller model for this release. Since people have the ability to finetune for other classes, we didn't feel that was a big issue.

mihaimaruseac changed discussion status to closed 10 days ago

mihaimaruseac

OpenAI org 3 days ago

For transparency: A third-party benchmark against GLiNER dropped.

There are places where this model works better, there are places where we need to work more.

NodeLinker

1 day ago

For transparency: A third-party benchmark against GLiNER dropped.

There are places where this model works better, there are places where we need to work more.

Write in the discussion closed by you 👍

I saw this benchmark a few days ago on Reddit.
https://www.reddit.com/r/LocalLLaMA/comments/1t0sov1/openais_privacy_filter_vs_gliner_on_600_pii/

The person has GPT image generation on GitHub, so perhaps I drew his attention to the Nvidia model, and he in turn made a benchmark and published it on Reddit.

In any case, the community has become more aware of PII models, despite the bile in the discussions.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment