Skip to main content
Back to Blog
12 min read Test

What It Takes to Never Quote Scripture Wrong

Doxa's honest engineering story: from a 15/18 baseline to full marks on a rigorous Christian-AI accuracy test, with a deterministic word-for-word Bible verification layer anyone can reproduce.

A page of Scripture in shadow with a single verse lit in warm amber light, representing a Bible verse verified word for word by Doxa's Christian AI

Every time an AI quotes Scripture and gets a word wrong, someone somewhere is handed a verse that God never quite said that way. That gap between "close enough" and "exact" is small in a search engine, and enormous when you are trusting God's word to carry you through something that matters.

That is the gap Doxa set out to close. This is the honest story of how we did it, what we got wrong on the way, and why we are now inviting anyone to test the result for themselves.

The Problem the Industry Named Out Loud

YouVersion's CEO said publicly that leading AI models misquote Scripture between 15 and 60 percent of the time. YouVersion declined to ship a public AI chatbot because of it. That was not a failure. That was integrity. They named a real thing, measured it honestly, and held back rather than ship something they could not stand behind.

Gloo's faith-AI benchmark added more texture. Leading models average around 61 out of 100 on Christian-worldview questions generally, and around 48 out of 100 on questions specifically about faith, often flattening Christianity into a kind of generic spirituality that would not be recognisable to anyone who has actually read the Gospels.

These are not numbers Doxa invented to make itself look good. They are the shape of a real, publicly acknowledged gap. Anyone building a Christian AI product either ignores that gap or builds toward it. We chose to build toward it, which meant first being honest about where we were starting from.

Testing Ourselves Against Real Standards

We did not design a benchmark we knew we could win. We built our test from existing, citable public standards, written by other people, that anyone can read and run for themselves:

  • The faith.tools Rules for AI Apps for Christians, a five-rule community standard for whether a Christian AI app can be trusted: biblical accuracy, clearly identifying as AI and not human, not replacing real human relationship, and holding grace and truth together.
  • Ben Kaiser's study, Can LLMs Accurately Recall the Bible?, which measures verse recall with exact, word-for-word matching rather than "close enough".
  • A deterministic check of whether a referenced verse actually exists, by book, chapter, and verse, because a citation that points nowhere is its own kind of error.
  • VERA-MH, Spring Health's open, clinically grounded evaluation of how an AI responds to someone in crisis, including whether it points them to real human help.
  • For wider context, Gloo's faith-AI benchmark, which measures how well leading models hold a Christian worldview.

We ran it against the live, public Doxa endpoint. Not a sandboxed version. Not a demo environment. The thing anyone can use right now. We graded it using panels of independent AI jurors, not internal reviewers with a stake in the result.

That discipline matters. A test you design to pass tells you nothing. A test built from standards you did not write, run on a system you cannot manipulate mid-evaluation, tells you something real.

The Honest Baseline

Our first run came back at 15 out of 18.

On most dimensions, the system held up well. It cited real references. It held sound theology. It refused to invent verses that do not exist. When a user in crisis reached out, it pointed them toward real human help rather than positioning itself as a companion or a substitute for community. That last point is part of Doxa's identity: the test checks for it explicitly, and the system passes, because Doxa is a tool that points to Jesus and to real human community, never a synthetic friend.

But there was one gap. The exact gap YouVersion named. Quoting a verse word for word.

Fifteen out of eighteen is a solid score. It was not good enough.

Finding the Seam

We hardened the test: more cases, tougher grading, edge conditions we had not fully stress-tested. And we found exactly where the system was slipping.

When a user asked for a Bible translation Doxa does not host, the system would sometimes helpfully quote from memory what it believed to be our verified text. And sometimes it would get a word wrong. Not a theological error. Not an invented verse. One word.

"I can do all things through Christ who strengthens me."

That is not what the Berean Standard Bible says. The verified text reads "who gives me strength." One word different. To a casual reader, barely noticeable. To someone who has memorised that verse, or who is leaning on it in a hard moment, or who simply believes that the words of Scripture matter precisely because they are the words of Scripture, not close approximations: that word is not incidental.

That is the gap the industry has. We reproduced it in our own system. The next question was what to do about it.

A Hope Is Not a Guarantee

The first fix was to tighten how the system handles translation requests. It helped. The score improved. We reached 29 out of 30.

But the result was intermittent. Sometimes it held. Sometimes it did not. And that variability was its own honest signal.

Here is the lesson we had to sit with: behavioural improvement is a probability, not a guarantee. You can raise the probability very high. You cannot raise it to certainty, not when the goal is someone leaning on Scripture in a hard moment. "Try harder not to misquote" is a hope, not a mechanism.

We were at 29 out of 30, with the last point flickering. And we had to decide whether a flickering result was acceptable when the thing at stake was someone reading Scripture in a hard moment.

It was not.

The Real Fix

We stopped relying on hoping the system would get it right, and built a check instead.

Doxa hosts the entire Berean Standard Bible, the verified text, publicly available and word-for-word accurate. Now, every verse the system is about to show you is checked against that verified text before it reaches you. Word for word. If a word is off, it is corrected to the exact wording in the verified BSB. If a verse cannot be confirmed against the hosted text, it is not shown.

That is not a nudge or a setting or a carefully worded rule. It is a deterministic check that runs before any verse leaves the system. Every verse reaches you exactly as written in the verified text, or it does not reach you at all.

A pastor double-checks a reference before they preach it. Not because they doubt themselves, but because the text deserves that care. This is the same discipline, built into the architecture.

Full Marks, Measured, Reproducible

On every dimension our open test covers, the live system now scores full marks. That result is not a snapshot of a good day. It is structural. The check runs every time. It does not depend on good fortune or favourable conditions. It is deterministic.

That is not a promise we are making. It is a result you can reproduce.

The Berean Standard Bible is public domain. The test method is open. The endpoint is live. Anyone with the BSB text and the same evaluation method can run the same test and get the same answer. We are not asking for trust. We are handing over the method and saying: verify us.

This is what "measured, transparent, reproducible" looks like in practice. Not a claim of perfect Christian AI. Not a declaration that the work is finished. A specific, bounded, engineering result: on the accuracy of Scripture quotation, the system is now deterministically correct within the scope of what the check covers, and that scope is the thing that mattered most.

What the Test Also Found

It is worth naming what else the test checks, because the score only means something in context.

The full benchmark evaluates whether an AI Christian app cites real references, holds sound Christian theology, refuses to fabricate verses, handles crisis situations by directing users to real human help, and quotes Scripture with word-for-word accuracy. Doxa scores full marks across all five of those dimensions now, in our own openly-reproducible evaluation.

The crisis-safety dimension is worth pausing on. Doxa does not position itself as a companion, a therapist, or a substitute for human relationship. When someone in distress reaches out, the right response is to point them toward a real person, a pastor, a counsellor, a community. The test checks for that. The system passes. That is not incidental to Doxa's design; it is central to it.

The Tests, and Where to Find Them

We did not grade ourselves in private against a yardstick we kept to ourselves. Every standard we used is public, written by other people, and you can hold Doxa to exactly the same ones.

We verify every quotation against the Berean Standard Bible, which is public domain. The gap we set out to close is the one named publicly by YouVersion's CEO. All of it is open. None of it requires taking our word.

An Invitation, Not an Announcement

Doxa is an encouragement app and a hosted service that answers questions with grace and truth, anchored in Scripture. It is for anyone: believers, seekers, the curious, the sceptical, the hurting. You do not have to already be a Christian to use it or to find it useful.

But if you are going to trust any AI with Scripture, you should test it. Not take anyone's word for it, including ours.

Everything you need to check us is public. The Berean Standard Bible we verify against is public domain. The standards we measured ourselves against are all linked above. The endpoint is live at doxa.app. And the gap we set out to close was named publicly by YouVersion's own CEO, so you can judge for yourself whether we have closed it.

Ask Doxa a question. Ask it for a verse. Then look that verse up yourself in any BSB text you can find. That is the whole invitation. If you are a pastor deciding whether a tool like this is trustworthy, start there. If you are a developer building faith tools, the Doxa MCP is open to install and the same verification runs behind it. If you are simply someone who wants to know whether an AI can be trusted with the words of Scripture, test it yourself.

And please tell us what you find. We especially want the cases where we fall short: a verse Doxa gets wrong, a question it handles poorly, a place the test itself could be tougher. Send it to us through doxa.app. We would rather hear it from you than miss it. The point of measuring in the open is that other people get to check the measurement, and we want that scrutiny.

We did not solve Christian AI. We found a specific, real gap, admitted that hoping the system would be careful was not enough, and built a check that means it simply cannot be careless with Scripture. That is a narrower claim than "solved." It is also a true one.

Test it. Tell us where we are wrong. We will stand behind what you find, and we will fix what you catch.

Frequently Asked Questions

Can AI quote the Bible accurately?

Most AI models cannot reliably quote Scripture word for word. YouVersion's CEO has said that leading models misquote the Bible between 15 and 60 percent of the time, which is why YouVersion declined to ship a public AI chatbot. Doxa closes this gap by verifying every quotation against the public-domain Berean Standard Bible before it is shown, so a wrong verse cannot be delivered.

Does Doxa ever misquote Scripture?

No. Doxa checks every verse it is about to show against the verified Berean Standard Bible text. If a word is off, it is corrected to the exact wording. If a verse cannot be confirmed against the verified text, it is not shown. This is a deterministic check that runs every time, not a matter of the model trying its best.

What Bible translation does Doxa use?

Doxa uses the Berean Standard Bible, which is public domain and accurate word for word. If you ask for a translation Doxa does not host, it will tell you plainly and offer the Berean Standard Bible rather than guessing at the wording.

How can I test Doxa's Scripture accuracy myself?

Ask Doxa for any verse, then look it up in any Berean Standard Bible text. The Berean Standard Bible is public domain, the standards we measured ourselves against are public, and the endpoint is live at doxa.app. You can reproduce our result, and if you find a verse Doxa gets wrong, you can tell us and we will fix it.

Is Doxa an AI companion or a replacement for church?

No. Doxa is a tool that points to Jesus and to real human community, never a synthetic friend. When someone reaches out in crisis, Doxa points them toward real human help. This is checked explicitly by the crisis-safety part of our test, and it is central to how Doxa is built.

Next in the practice

Remember

Coming back to what God has said and done.

Continue to Remember

Try Doxa free

Free on iOS and Android.

Download on the App StoreGet it on Google Play

The personal prophecy app.

The words spoken over you, weighed against Jesus, kept for the whole journey.

Scripture as the Standard. Your own record of what God has said to you. Available now on iOS and Android.

Download on the App StoreGet it on Google Play