A curated guide to Hebrew language AI models
A comprehensive collection of Hebrew language AI models on Hugging Face, covering LLMs, speech recognition, sentiment analysis, and more.
Living in Israel and working extensively with AI, I have a natural interest in the state of Hebrew language models — and a personal stake in their quality. Hebrew presents some genuinely fascinating challenges for NLP that make it one of the more interesting languages to watch in the AI space: it's a Semitic language with a non-Latin script, right-to-left writing, extraordinarily complex morphology where a single word can encode subject, object, tense, and prepositions simultaneously, and modern written Hebrew typically omits vowel diacritics, meaning the same sequence of consonants can represent entirely different words depending on context. For someone who uses Hebrew daily and AI tools constantly, the quality of Hebrew language models directly affects how useful these tools are for half of my communication. I put together a curated collection of Hebrew AI models on GitHub to serve as a starting point for anyone working in this space.
A pathfinder repo (index) to some Hebrew language LLMs on Hugging Face
What the collection covers
The guide covers the full spectrum of Hebrew AI models available on Hugging Face and beyond. On the LLM side, there are Mistral, Mixtral, and Gemma fine-tunes specifically trained for Hebrew, plus niche models for tasks that matter in practice: summarisation, biblical text analysis (Israel being the kind of country where that's a real use case), metaphor detection in Hebrew literature, translation between Hebrew and other languages, offensive language detection for content moderation, punctuation restoration (critical for a language that typically drops vowels and punctuation), and sentiment analysis tuned for Hebrew's unique morphological patterns. There are also specialised models for Hebrew-to-SQL conversion and medical term named entity recognition, which are exactly the kind of vertical applications that signal a language's NLP ecosystem is maturing beyond basic translation.
The benchmark picture and what it reveals
The repo includes ASR models — wav2vec2 and Whisper fine-tunes for Hebrew speech recognition, which I care about personally since I dictate a lot of my work and Hebrew dictation accuracy is still noticeably behind English. I also link to the Hebrew LLM Leaderboard, which reveals an interesting and somewhat counterintuitive finding: large multilingual models generally outperform specialised Hebrew models, simply because their significantly larger parameter counts compensate for less Hebrew-specific training data. The specialised models perform impressively given their size constraints, but the practical takeaway for most users is that GPT-4 or Claude will give you better Hebrew than a purpose-built Hebrew model with a fraction of the parameters. Whether that changes as Hebrew-focused training data grows remains to be seen.
The guide also includes links to key organisations like Dicta (which does excellent work on Hebrew NLP) and MAFAT's National Natural Language Processing Plan of Israel, academic papers on Hebrew LLM classification, and researchers worth following like Yam Peleg on Hugging Face. Israel being a technology hub means there's genuine momentum in this space, and I update the collection as new models appear. The full list is on GitHub.
A pathfinder repo (index) to some Hebrew language LLMs on Hugging Face