Masakhane NLP

community

https://www.masakhane.io/

MasakhaneNLP

masakhane-io

Activity Feed Request to join this org

AI & ML interests

NLP for African languages, MT, NER, POS, QA, ...

Recent Activity

dutkulang updated a Space 6 days ago

masakhane/dialogue-chat

abumafrim updated a dataset 11 days ago

masakhane/afriscience_mt

abumafrim published a dataset 11 days ago

masakhane/afriscience_mt

View all activity

dutkulang

updated a Space 6 days ago

Chat with Masakhane Dialogue Models

🌍

abumafrim

updated a dataset 11 days ago

masakhane/afriscience_mt

Viewer • Updated 11 days ago • 2.03M • 129 • 1

abumafrim

published a dataset 11 days ago

masakhane/afriscience_mt

Viewer • Updated 11 days ago • 2.03M • 129 • 1

israel

updated a collection 23 days ago

AfrIFact

Collection

a multilingual information retrieval, evidence retrieval and fact checking benchmark covering healthcare, culturally grounded content • 4 items • Updated 23 days ago

Tonic

posted an update about 1 month ago

Post

2928

🙋🏻‍♂️ Hey there folks ,

Turns out : if we predict 🌏 earth we can save a lot of time looking for interesting things and less time looking at things that we expect to see.

Sentinel-2 imagery 🛰️basically takes a long time to download towards earth. so our "near real time" systems are quite far from that in practical terms.

meanwhile , if we "predict" what we will see , based on what we do see , we can send down much less data in a timely way , and prioritize 📡earth-bound response .

I'm talking about illegal fishing , logging , mining or building in nature reserves , the more of that we predict early the more we're able to stop it on time.

At least that's the concept !

check out the blog : https://huggingface.co/blog/Tonic/save-patagonia-by-predicting-earth

- Collection: https://huggingface.co/collections/NuTonic/earth-observation-with-temporal-and-general-understanding
- Code: https://github.com/Josephrp/Nutonic
- Dataset: NuTonic/sat-vl-sft-training-ready-v1
- Model: NuTonic/lspace
- Training: NuTonic/lspace-trackio
- Evals: NuTonic/Patagonia_Eval

2 replies

Tonic

posted an update about 2 months ago

Post

4334

🙋🏻‍♂️ Hey there folks,

since everyone liked my previous announcement post ( https://huggingface.co/posts/Tonic/338509028435394 ) so much , i'm back with more high quality proceedural datasets in the Geospacial domain for SFT training !

Check this one out :
NuTonic/sat-bbox-metadata-sft-v1

the goal is to be able to train vision models on multiple images for remote sensing analysis with one shot .

hope you like it ! 🚀

2 replies

Tonic

posted an update about 2 months ago

Post

3669

🙋🏻‍♂️ Hey there folks ,

I'm sharing huggingface's largest dataset of annotated statelite images today.

check it out here : NuTonic/sat-image-boundingbox-sft-full

I hope you like it , the idea is to be able to use this with small vision models 🚀

omarkamali

posted an update 2 months ago

Post

1006

Just sharing a little breakthrough with Gherbal LID where we managed to distinguish the 15 variants of Arabic with 6 variants above 90%, 10 variants above 85% accuracy, practically distinguishing Moroccan and Algerian (which overlap massively).

It also embraces the duality of MSA and arabic variants pioneered in ALDi by @AMR-KELEG et al.

Now we're only bottlenecked by the availability of high quality data for the low scoring variants such as Iraqi, Libyan, Sudanese, Adeni ...

More on Gherbal at:
https://omneitylabs.com/models/gherbal

1 reply

omarkamali

posted an update 2 months ago

Post

4595

We got Qwen 3.5 to count Rs in Strawberry correctly! 🚨

Building on Sawtone, we’ve been testing a different way to feed language into an LLM to build the next generation of multilingual AI.

The usual setup gives the model tokenized text and asks it to perform various linguistic tasks. That works surprisingly well, until it doesn’t. Accents disappear. Words get mangled. Internal structure gets blurred away. And the cost of that gets higher once you move into multilingual and lower-resource settings.

So we tried adding a second path.

In addition to the normal text input, the model also receives Sawtone: a byte-level word representation that preserves how a word is written, how it sounds, and how it is structured.

Same LLM. Better interface.

In this proof of concept with Qwen 3.5 0.8B, that pushed our eval from 64% to 88%. The gains showed up exactly where tokenized models usually get shaky: diacritics, character order, exact spelling, and other form-sensitive behavior.

Sawtone itself is tokenizer-free, byte-level, and pre-trained across 507 languages.

Still early, but promising!

5 replies

israel

authored 6 papers 2 months ago

CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation

Paper • 2505.24456 • Published May 30, 2025

AfroXLMR-Social: Adapting Pre-trained Language Models for African Languages Social Media Text

Paper • 2503.18247 • Published Mar 24, 2025

Afri-MCQA: Multimodal Cultural Question Answering for African Languages

Paper • 2601.05699 • Published Jan 9 • 3

Accept or Deny? Evaluating LLM Fairness and Performance in Loan Approval across Table-to-Text Serialization Approaches

Paper • 2508.21512 • Published Aug 29, 2025

Ethio-ASR: Joint Multilingual Speech Recognition and Language Identification for Ethiopian Languages

Paper • 2603.23654 • Published Mar 24

AfrIFact: Cultural Information Retrieval, Evidence Extraction and Fact Checking for African Languages

Paper • 2604.00706 • Published Apr 1

israel

updated a collection 3 months ago

AfrIFact

Collection

a multilingual information retrieval, evidence retrieval and fact checking benchmark covering healthcare, culturally grounded content • 4 items • Updated 23 days ago

omarkamali

posted an update 3 months ago

Post

233

🌐 LID Benchmark update:

• 10 Regional Leaderboards
• 17 LID models (+7 new, incl. non-fastText based)
• 449 languages in total (200+ additional)
• Fixed: F1 macro reporting error
• Normalized language codes for more accurate results

The dataset is also updated, now with individual model predictions to reproduce and validate our findings.

omneity-labs/lid-benchmark

israel

updated a dataset 3 months ago

masakhane/AfrIFact

Viewer • Updated Apr 2 • 9.81k • 137 • 2

israel

updated a collection 3 months ago

AfrIFact

Collection

a multilingual information retrieval, evidence retrieval and fact checking benchmark covering healthcare, culturally grounded content • 4 items • Updated 23 days ago

AI & ML interests

Recent Activity

Team members 113

masakhane's activity

Chat with Masakhane Dialogue Models