A friend of mine, Mike, sent me a WhatsApp message the other day. "Somebody told me Gemini is much better for research into Jewish texts than ChatGPT, because it makes stuff up a lot less. Any thoughts on the best AI for textual research?"
It's a great question — and the answer is that it's the wrong question. The whole frame ("which model is best at X domain?") is pointing at the part of the system that doesn't actually determine the outcome. I ended up answering Mike with an entire podcast episode, but here's the short version.
Also on Spotify.
Reasoning is not the bottleneck
When ML people talk about model comparisons, they talk about tasks. So what task is "research into Jewish text"? It's analysis, which is a reasoning problem. And reasoning is not the bottleneck. Gemini 3, GPT-5, Claude, DeepSeek, Qwen — they all clear the bar for processing a passage and making observations about it. That's PhD-level reasoning territory; the benchmarks have been saturated for a while.
Mike's actual problem is hallucinations. He gave Claude an Eliezer Berkovits text and got back a translation of Midrash Rabba — a different work entirely. That's not a reasoning failure. That's a knowledge failure. And the two are very different things.
Hallucinations are a knowledge problem
A frontier model knows what's in its training corpus. The training corpus is a generalist scrape — the Common Crawl, plus whatever else. Niche source material — a 15th-century rabbinical commentary, a specific Berkovits essay — has roughly zero chance of being in there with any fidelity. Copyright pressure makes this worse: ambiguous-rights material gets filtered out of training pipelines. Nobody at OpenAI is hand-picking obscure responsa to make sure they're in the next checkpoint.
So when you ask a generalist model to translate or analyse a niche specialist text, it does what generalist models do under pressure: it pattern-matches, infers what the answer probably looks like, and confabulates. The confabulation is fluent, grammatical, and confident. And wrong.
You can't fix this by picking a different model. You fix it by giving the model the actual text.
Grounding: RAG vs MCP
"Grounding" is the term of art for connecting a model to a verified source of information. Two main flavours.
RAG (retrieval-augmented generation) — the traditional answer. Build a vector database of your source material, retrieve relevant chunks at query time, stuff them into the prompt. Works. But it's a build project: you have to ingest the corpus, chunk it, embed it, host the vector store, maintain it as the source updates.
MCP (Model Context Protocol) — Anthropic's protocol for letting an agent talk to an external API in a structured way. Instead of building your own data pipeline, you point the agent at someone else's. If a domain authority has already digitised the corpus and exposed it via MCP, you skip the entire RAG plumbing.
For Jewish texts, Sefaria is the obvious example. They've spent years digitising and structuring the rabbinical canon. They publish an MCP server. An agent connected to it can search, fetch, and cross-reference texts directly from Sefaria's library — not from a probabilistic guess about what's in the model's weights.
This matters more than it sounds. Sefaria's corpus isn't static — they're continuously digitising new material. A RAG store you built in 2024 is already drifting from reality. An MCP server tracks the source automatically.
How to actually build it
If you're Mike and you want a reliable Jewish-text agent — say, a sermon-writer that pulls real sources — the recipe is roughly:
1. Pick any modern model. Genuinely, any of them. The model is the reasoning engine; it's not where the knowledge comes from.
2. Connect it to Sefaria via MCP. That's the source-of-truth for the texts.
3. Write a system prompt that constrains the role. "You are a sermon-drafting assistant. You retrieve all source material via the Sefaria tool. If a source can't be found, say so. Never invent citations."
4. Optionally add a guardrail. A whitelist of trusted external domains for supplementary material. This is the most brittle layer — "only go to these domains" is hard to enforce reliably — but it's getting better.
5. Add a second integration if you want output to land somewhere useful. Google Drive MCP, for example, so the drafted sermon writes itself into a Doc instead of just appearing in chat.
ChatGPT's Custom GPTs and the new agent surface will let you do most of this without writing code. So will Claude's projects/skills system. The pieces are off-the-shelf now in a way they weren't twelve months ago.
Bonus: clean your context
One aside that's worth flagging because it's a common cause of model weirdness: every message you send to a chat model includes the entire prior conversation. If you've been talking to ChatGPT about flights to Barbados for forty minutes and then ask it about Berkovits, the model's working state is full of unrelated context, and reasoning quality degrades. Start a new thread aggressively, or compact when the tool supports it. This alone reduces hallucinations more than people expect.
The summary for Mike
There's no "best AI for Jewish textual research". There's no best AI for legal research, or PubMed research, or any specialist-corpus research. The premise is wrong. The model is a reasoning engine; the corpus is a separate concern, and it's the corpus that determines whether the answers are real or fluent fiction.
If you depend on reliable retrieval from a specialist body of text — Jewish sources, Israeli case law, medical literature, your own company's documentation — the move is the same every time: take a competent model, connect it to a curated source via RAG or MCP, and constrain the role with a system prompt. The differences between models at the reasoning layer are real, but they're not what's making your assistant cite a translation of the wrong book.
The reason all of this feels suddenly tractable is that the connective tissue — MCP, agent skills, well-built domain servers like Sefaria's — has matured to the point where you can wire it up in an afternoon. That's the part of the stack worth paying attention to. Not which model has a marginal edge on a benchmark you'll never run.