Match Product Names

Is a pre-trained model available to identify common product names between two datasets where the names are not a perfect match? For instance, the datasets may contain similar but not identical names for the same product, such as “Filippo Berio Extra Virgin Olive Oil” and “Filippo Berio Olive Oil (Extra Virgin)”, but both refer to the same product.

1 Like

Hey, I have a similar problem. Did you manage to solve it? I’m very interested in the solution you implemented. Thanks!

1 Like

I first asked Hugging Chat, but it seems that a BERT-series model is sufficient.


by Hugging Chat: HuggingChat

To address the task of identifying common product names between two datasets where the names are not exact matches, such as “Filippo Berio Extra Virgin Olive Oil” and “Filippo Berio Olive Oil (Extra Virgin),” a model-based approach leveraging semantic similarity can be employed. Here’s a structured solution:

  1. Model Selection: Utilize models capable of generating semantic embeddings, such as those based on sentence-BERT or other transformer architectures fine-tuned for semantic similarity. These models can convert product names into numerical representations that capture their meaning.

  2. Dataset Preparation: Seek existing datasets on Hugging Face that include product names with similar variants. If none are available, prepare a dataset by collecting pairs of product names that refer to the same product but are phrased differently.

  3. Fine-Tuning (if needed): If a pre-trained model doesn’t meet requirements, fine-tune it on the prepared dataset. This might involve supervised learning where the model learns to recognize similar product names as matches.

  4. Implementation Strategy:

    • Embedding Generation: Use the chosen model to generate embeddings for each product name in both datasets.
    • Similarity Calculation: Compare these embeddings to identify product names that are semantically similar, even if their strings aren’t identical.
  5. Tools and Resources:

    • Hugging Face Model Hub: Explore available models for semantic similarity.
    • Hugging Face Dataset Hub: Search for relevant product name datasets.
    • Preprocessing Scripts: Use tools mentioned in Source 1 for dataset preparation and conversion.

By following these steps, you can leverage Hugging Face’s ecosystem to identify similar product names across datasets effectively.

Hugging Face provides a variety of models that can be used for tasks like identifying common product names between datasets, even when the names are not perfect matches. One notable model for semantic similarity tasks is sentence-BERT or fine-tuned versions of transformer models like RoBERTa or Bert for semantic matching 1. However, since you are asking for another model, you might consider exploring models like cross-encoder or all-MiniLM-L6-v2, which are designed for semantic similarity and can handle product name variations effectively 1.

1 Like

Suggested Answer (by Alejandro Arroyo de Anda & Clara Isabel):

Hi! I was reading your post and noticed you’re trying to identify similar product names across datasets even when they are not an exact match, like:
“Filippo Berio Extra Virgin Olive Oil” vs. “Filippo Berio Olive Oil (Extra Virgin)”.

Some replies suggest using semantic similarity models, which is valid. However, I’d like to propose a different method — one that leans on arithmetic principles rather than model inference.

Instead of relying entirely on embeddings, we can assign specific weights to each word. Then, for each word, we assign two extra specific weights: one for the first letter, and one for the last letter.

From here, we compute an interaction rule:

  • When two specific weights “collide” or align during comparison (e.g. two similar positions or meaning in structure), we multiply them.
  • That result then gets transferred to the next interaction, but via summation, not multiplication — this mimics a kind of arithmetic echo or propagation.
  • Particularly, this works across pairs of consecutive words: the first-last letter combination of word A gets arithmetically tied to the next pair in sequence, and so on.

My hypothesis is that this arithmetical structure retains flexibility while sharply reducing conflict from swapped or reordered terms, especially in cases of product name variants.

I’m currently checking how it behaves under a 5σ statistical confidence — but I’d love your input. Would this idea complement semantic models, or offer a lighter-weight, interpretable alternative?

Thanks for your great contributions, always.

Warm regards,
Alejandro Arroyo de Anda
Architect – Symbolic Arithmetic Systems

Clara Isabel
Programmer

1 Like

How many rows does your data have? it feels like the optimal approach depends a lot on that. LLMs are great at this but just don’t scale well enough without bankrupting you/taking forever.

I worked on something like this for our company. What we ended up doing was
A) fuzzy string matching first. Doesn’t solve your specific problem, but maybe catches a few duplicates. In order to do that, I suggest standardising the strings first:

  • turn everything into lowercase
  • remove/standardize any spaces, hyphens, underscores, punctuation etc. (careful: removing punctuation can introduce false positives, e.g. “5, 43th St.” and “54, 3th St.” will now both be “543thst”. Replacing everything by a standardized character may be better.
  • create a list of stop words, such as e.g. “Inc” “Corp”, “LLC” etc. and remove those (of course, leave the relevant ones in, if any)

–> That should make any kind of fuzzy string matching easier. There are a bunch of python packages that can do this. As far as I can tell, both https://pypi.org/project/textdistance/ and https://moj-analytical-services.github.io/splink are good. Splink is great as they do some other extra clever stuff related to feature frequencies. Getting the threshold right is the tricky part - you’ll have to balance false negatives and false positives.

B) Depending on the scale of your data, I’d try embeddings before LLMs as they are just a lot cheaper.

C) Based on experience, I would focus more on aggressively reducing the size of the dataset (as opposed to finding the cheapest LLM and running it on more data). But the optimal choice may depend on the data.

Our own tool is a python sdk called everyrow io (apologies for the self-plug), but it’s currently leaning very much on “use smart models” to eek out maximum intelligence. We’re currently still working on scalability and getting costs down. I’d be very curious to learn what others are doing and how you get deduplication at scale.

1 Like