Turns out : if we predict 🌏 earth we can save a lot of time looking for interesting things and less time looking at things that we expect to see.
Sentinel-2 imagery 🛰️basically takes a long time to download towards earth. so our "near real time" systems are quite far from that in practical terms.
meanwhile , if we "predict" what we will see , based on what we do see , we can send down much less data in a timely way , and prioritize 📡earth-bound response .
I'm talking about illegal fishing , logging , mining or building in nature reserves , the more of that we predict early the more we're able to stop it on time.
Just sharing a little breakthrough with Gherbal LID where we managed to distinguish the 15 variants of Arabic with 6 variants above 90%, 10 variants above 85% accuracy, practically distinguishing Moroccan and Algerian (which overlap massively).
It also embraces the duality of MSA and arabic variants pioneered in ALDi by @AMR-KELEG et al.
Now we're only bottlenecked by the availability of high quality data for the low scoring variants such as Iraqi, Libyan, Sudanese, Adeni ...
We got Qwen 3.5 to count Rs in Strawberry correctly! 🚨
Building on Sawtone, we’ve been testing a different way to feed language into an LLM to build the next generation of multilingual AI.
The usual setup gives the model tokenized text and asks it to perform various linguistic tasks. That works surprisingly well, until it doesn’t. Accents disappear. Words get mangled. Internal structure gets blurred away. And the cost of that gets higher once you move into multilingual and lower-resource settings.
So we tried adding a second path.
In addition to the normal text input, the model also receives Sawtone: a byte-level word representation that preserves how a word is written, how it sounds, and how it is structured.
Same LLM. Better interface.
In this proof of concept with Qwen 3.5 0.8B, that pushed our eval from 64% to 88%. The gains showed up exactly where tokenized models usually get shaky: diacritics, character order, exact spelling, and other form-sensitive behavior.
Sawtone itself is tokenizer-free, byte-level, and pre-trained across 507 languages.
• 10 Regional Leaderboards • 17 LID models (+7 new, incl. non-fastText based) • 449 languages in total (200+ additional) • Fixed: F1 macro reporting error • Normalized language codes for more accurate results
The dataset is also updated, now with individual model predictions to reproduce and validate our findings.