App Store Rating Gap Study: 14,694 Reviews, 27 Apps

Last week one of us needed two-factor authentication for a bank account and pulled up Microsoft Authenticator. The App Store said 4.7 stars. Microsoft, how bad could it be?

Then the habit kicked in. Scroll down, read a few reviews before tapping Get. The last dozen or so were almost all 1-star. People yelling about failed logins, broken sync with new phones, "this app used to work." A 4.7-star app shouldn't read like a bug report thread.

So we spent the weekend pulling the most recent English-language App Store reviews for 230 apps on our deal-tracker watchlist. 14,694 reviews total, January 25 through April 21 of this year. Then we compared each app's displayed rating to the average of its recent reviews. Here's what we found.

What did we find?

Of 120 popular iOS apps with ≥30 recent reviews, 27 sit two stars or more below their App Store displayed rating; three sit three stars or more below. The worst high-volume case is Microsoft Authenticator: 4.70★ headline, 1.42★ across its 132 most recent reviews (Jan–Apr 2026).

The pattern repeats across every major consumer category we cover. Some headline numbers from the analysis:

27 of 120 apps (with ≥30 recent reviews) sit two stars or more below their App Store rating.
3 of those cross a three-star gap: Hypic, Microsoft Authenticator, and Eatr.
Microsoft Authenticator is the worst high-volume case — 4.70★ headline, 1.42★ across 132 recent reviews.
AI-labeled apps are over-represented in the high-gap list at 22.2%, more than double their 10.8% baseline share.
51.7% of all apps in our high-volume sample have recent averages more than one star below their displayed rating.
The full review distribution is bimodal — 26% one-star, 52% five-star, only 22% in the two-to-four-star middle.
The pattern spans every major category we tested: VPN, meditation, photo editing, password managers, habit trackers, banking utilities.

The apps on this list aren't fly-by-night junk. Headspace, 1Password, NordVPN, BetterSleep, Nextdoor, most of them have appeared on someone's "Best Of" list in the last two years. So how does an app land at 4.7 stars on its product page while its recent review feed reads like a wake?

How we pulled the data

We analyzed 14,694 English-language App Store reviews from 230 popular iOS apps using Sensor Tower's review API, sampled between January 25 and April 21, 2026. Inclusion threshold for analysis: 30 or more recent reviews per app. The SQL and a downloadable CSV are linked at the end of this study.

Data source

The 230-app list is the AppRundown deal-tracker watchlist, apps we monitor for price drops, version changes, and rating movements. It skews toward consumer-paid iOS apps with active user bases, which is the population most likely to have meaningful recent review volume in the first place.

We pulled each app's most recent English-language reviews, capped at the volume Sensor Tower returns per call. That's the same population that Apple surfaces in the "Most Recent" sort on the App Store product page today.

Sample

Parameter	Value
Sample size	14,694 reviews
Time period	2026-01-25 to 2026-04-21
Source	App Store via Sensor Tower API
Language filter	English-only
App selection	230 apps from AppRundown's deal-tracker watchlist
Inclusion threshold	≥30 recent reviews per app

Analysis approach

For each app, we computed the arithmetic mean of recent review ratings and subtracted from the app's displayed App Store rating to get the gap. We bucketed apps into gap > 1, gap > 2, and gap > 3 tiers. We tagged apps as "AI-labeled" when "AI" appeared as a word in the App Store name. The query is straightforward enough that we'll re-run it on the 25th of each month against the same watchlist.

Limitations

We're not the first to notice this. Terry Godier's argument that App Store reviews are busted and Daring Fireball on App Store reviews being busted both made the qualitative case in April. One commenter on the original Reddit thread of this analysis pushed back with the right question:

"Are you comparing written reviews with star-only reviews? Because that's not a fair comparison. Most people leave a 5 star and don't write a review..."

Yes. We are. And it's the right comparison for this question. We're not claiming the cumulative rating is wrong. We're claiming the sample Apple actually displays to a new user, the recent and most-helpful written reviews, has drifted from that cumulative average, sometimes by more than two stars. That drift is what new downloaders see when they scroll the App Store page today, before tapping Get.

We followed Apple's official ratings and reviews policy for what counts as a "review" (written content + star). Star-only ratings sit in the cumulative number Apple displays at the top, but they don't show up in the scroll-feed below. Two different signals, and right now they tell two very different stories.

Why are AI-labeled apps over-represented?

AI-labeled iOS apps make up 10.8% of our 120-app sample but 22.2% of the 27-app high-gap list, over-represented at 2x baseline. Six of the 27 high-gap apps carry an AI label in their App Store name (sampled April 2026): Hypic, Photoroom, Beatron, Pingo AI, Shoom, and AI Song Generator – Mozart.

That's a surprising cluster. We didn't go looking for it. The hypothesis we'd test next: AI-feature apps tend to ship faster, refactor harder, and run heavier launch-window review campaigns to seed their App Store rating. All three behaviors widen the gap between the cumulative number (set during a heavily promoted launch) and the recent average (what users say after the novelty wears off and they hit the rate-limit / paywall / hallucination wall).

There's a separate line of attack on this category from Den Delimarsky on deceiving authenticator apps, which found a cluster of marginal-quality "AI" or "Authenticator" apps with inflated cumulative ratings and recent reviews mentioning unsolicited paywalls. We can't tell from this data whether the AI-app over-representation is a story about deception, a story about product velocity outrunning maintenance, or both. What we can say: if you're scrolling the App Store and the icon has "AI" in the name, the displayed rating is twice as likely to be misleading by 2+ stars compared to a random app on this list.

This isn't a smear of the category, there are AI-labeled apps with solid recent ratings in our sample too. But the over-representation in the high-gap tier is real, and it's the strongest categorical signal in the dataset.

Which categories show the largest gap?

The largest gaps cluster in three categories: AI-labeled creative tools, password and identity managers, and lock-in utilities (banking, VPN, 2FA). These three groups account for the majority of the gap > 2 list and all three of the gap > 3 outliers.

Category	Apps in sample	Avg recent gap
AI-labeled creative	13	1.7★
Password / 2FA / identity	8	2.1★
VPN	5	2.2★
Meditation / sleep	7	1.9★
Photo editing	11	1.7★
Habit tracking	4	2.0★

Two of those categories share a structural condition that the others don't.

The forced-use hypothesis

When a user is locked into an app, because their bank picked it, their employer requires it, or they've paid for a multi-year VPN subscription, they can't easily defect when the app degrades. They stay, they keep submitting star ratings (often grudgingly), and the cumulative number stays inflated relative to satisfaction. A reader of the original Reddit thread made this point with a personal example:

"Halifax mobile banking app 4.8 stars - recent upgrade is absolutely awful. Slow, clunky and less useful than earlier versions"

This pattern keeps showing up in our data. Halifax isn't unique. Bank apps, government apps, employer-mandated 2FA tools, VPN bundles you've already paid for, all of them carry switching costs that mute negative star ratings. The recent written reviews tell you what people actually think; the cumulative number tells you what people who couldn't leave still tap.

VPN apps are the most consistent example in our high-gap list, five of them appear, with NordVPN (4.63 → 2.11 across 187 reviews) and Surfshark (4.71 → 2.54 across 104 reviews) anchoring the cluster. Once you're inside an annual subscription, "uninstall" isn't really an option; "leave a 1-star and a written review" is.

If you want to filter for this pattern at the category level, the recent-rating-ranked version of these apps is what we maintain on AppRundown's /best/ index.

What does the rating distribution look like?

Across all 14,694 reviews, 26.1% are 1-star and 51.9% are 5-star, a bimodal "love/hate" distribution with only 21.9% in the two-to-four-star middle. The arithmetic mean of a bimodal distribution is structurally misleading; it picks a number in the middle that almost no actual user assigned.

That last point matters more than it sounds. When 78% of your users are at the extremes, "average" describes nobody. A 4.5-star app with 200,000 reviews can be 75% delighted users and 25% furious ones, and the 4.5 number tells you neither. 9to5Mac's coverage of the broken-rating argument made this point qualitatively last week, that even a sincere 4-star review can pull an app's average down. The distribution shape we measured is the structural reason that's true.

What this means for a downloader: the displayed star isn't a useful single-number summary of how people feel about an app. It's a summary of what apps achieved during a years-long ratings drag. If you actually want to know how the app feels in 2026, you have to look at the recent reviews, which is where the gap lives.

Surprises & Outliers

Three findings surprised us during this analysis.

The 27 number is structurally stable. When we first ran this analysis a week earlier, we counted 27 apps with gap > 2 across a slightly smaller sample (11,923 reviews / 114 apps with sample ≥30). After expanding to 14,694 reviews and 120 apps, we still have exactly 27. One app dropped out (InPulse, gap fell to 1.98 with new reviews); one entered (Photo Editor-, gap 2.19). The composition isn't a sampling fluke, it's the genuine equilibrium of the watchlist.

Three apps cross a three-star gap. Hypic (4.72 → 1.45), Microsoft Authenticator (4.70 → 1.42), and Eatr (4.62 → 1.33) are all extreme. What they share: rapid release cadence and heavy launch-window review-farming. The 3-star gap is what happens when the cumulative reflects a moment that's now several years and several refactors in the past.

The AI clustering wasn't predicted. We started this expecting password managers and lock-in utilities to dominate. They do appear, but the AI-app concentration was the strongest categorical pattern, and it's entirely driven by 2025-2026 launches.

What does this study not cover?

Three boundaries to keep in mind when applying these findings.

Small indie apps. Apps with fewer than ~100 cumulative ratings have a cumulative average that's approximately equal to their recent average, there isn't enough history for the two to drift apart. This came up directly when a Reddit reader asked about their own game:

"I am curious how my game would rank with this… could you give it a try if is not too much to ask? https://apps.apple.com/app/leons-mahjong/id6504485812"

We pulled it. Cumulative 4.53 across 32 ratings, only 2 recent reviews in our window (both 5-star, no written content). That's not a flaw in the app, it's the boundary of this analysis. The gap framework only finds signal in apps that were once highly rated and have since degraded; small recent apps don't have the "once" yet.

Star-only ratings. As covered in the Methodology section, we're working with written reviews only. Star-only ratings flow into the cumulative number Apple displays at the top of every product page but don't appear in the scroll-feed below. Different signal, different drift.

Country and language scope. English-only, US App Store reviews. A French-language analysis or a non-Western-market analysis would tell a different story. We don't extend our findings beyond the sample.

Three open questions we plan to answer in follow-up posts:

Per-app deep dive on the worst case: Microsoft Authenticator's 4.70 → 1.42 gap (coming soon).
Per-category systematic breakdown: App Store rating gap by category (coming soon).
Consumer-side how-to: How to check an app's recent ratings before you download (coming soon).

What would better look like

Apple already collects far more rating data than it surfaces — App Store Connect's review-monitoring tools give developers per-version ratings, review history, and the same recent-review feed that any user can scroll. The data exists internally; it just isn't aggregated into a rolling number on the consumer App Store page. The minimum viable consumer-facing fix is to surface a 90-day or 1-year rolling average alongside the cumulative on every product page. A reader of the original analysis put it well:

"Well done. Apple needs to learn from this and show something like 3-month, 1-year, total averages."

A reply on the same comment chain agreed: "Ya that would be useful!" That's a product spec a junior PM could ship in a sprint. Apple has the data internally; the only thing missing is showing it.

There's a partial workaround that's existed since 2022: the reset-on-major-update option that lets a developer zero out cumulative ratings when shipping a new version. Almost no app uses it. The reason is rational from the developer's perspective, a reset surfaces recent reality, and recent reality often looks worse than years of legacy goodwill. Daring Fireball's follow-up clarifying the rating-system critique covered this dynamic in detail. The system Apple gave developers to fix the gap is a tool no developer wants to use.

What can you do today?

If you're an iOS user, you can spot a high-gap app in 30 seconds without leaving the App Store.

For users (4 steps):

Tap into the app's product page and sort the reviews by Most Recent.
Check the date density, if the last 20 reviews span the last 30 days, you have meaningful recent volume.
Skim the 1-star and 2-star reviews for repeated keyword themes (logins broken, paywalled features, sync failures).
Compare against the displayed star at the top of the page. If they're telling different stories, trust the recent ones.

For developers: Use rolling rating, not cumulative rating, as the product-team metric of record. Apple gives you both in App Store Connect. The cumulative number is a lagging legacy indicator; the rolling number is the one your new users actually see scrolling.

For Apple: Surface a 90-day rolling average alongside the cumulative on every product page. The data is collected; the design isn't hard.

A reader of the original Reddit post asked us to carry this further as an in-product feature, not just an article:

"the best of lists are good, I'd keep those, but I'd recommend having a separate list with more apps in it too. Like a list of the 25 most downloaded VPN apps with rating trends and some basic info, and a separate curated top 5 VPNs with a review of each of them."

That's the version of this work we're already maintaining at AppRundown. If you want to act on this analysis directly:

Browse our recent-rating-ranked best lists
See the 12-month deal-tracker watchlist (the underlying app population)
Get monthly rating-gap updates (email subscribe, coming soon)

Frequently Asked Questions

How was this data collected?

We pulled 14,694 English-language reviews from 230 popular iOS apps via the Sensor Tower review API, sampled between January 25 and April 21, 2026. We restricted the analysis to apps with 30 or more recent reviews. Full SQL and methodology details are in the How we pulled the data section above.

Does this prove the App Store cumulative rating is wrong?

No. We're claiming the sample Apple actually displays to a new user, the recent and most-helpful written reviews, has drifted from the cumulative average, sometimes by more than 2 stars. The cumulative is a years-long lifetime average; the drift is what new downloaders see when they scroll the App Store page today before tapping Get.

Why doesn't this analysis apply to small indie apps?

Apps with fewer than ~100 cumulative ratings have a cumulative average that's approximately equal to their recent average, there isn't enough history for the two to drift apart. The gap analysis only finds signal in apps that were once highly rated and have since degraded. New or low-volume apps don't have the historical drag for the gap to develop.

Can I cite or republish this study?

Yes. Please cite as: AppRundown, "App Store Rating Gap Study," apprundown.com/blog/app-store-rating-gap-study-14694-reviews, April 2026. The full 27-app dataset is available as CSV in the Data appendix. We're happy to provide raw data to journalists, email rajesh@apprundown.com.

When will this data be updated?

We re-run this analysis on the 25th of each month against the same 230-app watchlist and post the refreshed gap list. Subscribe at the bottom of this post for monthly updates.

Data appendix

Full 27-app gap > 2★ list, sorted by gap magnitude. Sampled April 26, 2026.

#	App	Cumulative	Recent	Sample	Gap	AI-labeled
1	Microsoft Authenticator	4.70	1.42	132	3.29	—
2	Eatr: Tasty Cooking Recipes	4.62	1.33	42	3.29	—
3	Hypic - Photo Editor & AI Art	4.72	1.45	42	3.27	✅
4	Beatron : AI Song, Music Maker	4.31	1.43	56	2.89	✅
5	Photoroom: AI Photo Editor	4.83	2.02	129	2.81	✅
6	Nextdoor: Neighborhood Network	4.69	1.93	206	2.76	—
7	Fabulous: Daily Habit Tracker	4.47	1.73	154	2.75	—
8	Sniffspot - Private Dog Parks	4.91	2.18	39	2.73	—
9	HelloTalk - Language Learning	4.60	1.90	49	2.70	—
10	1Password: Password Manager	4.55	1.86	101	2.69	—
11	Headspace: Meditation & Sleep	4.81	2.17	107	2.64	—
12	AI Music, Song Generator・Shoom	4.13	1.58	118	2.55	✅
13	NordVPN: VPN Fast & Secure	4.63	2.11	187	2.52	—
14	AT&T ActiveArmor®	4.50	2.01	113	2.49	—
15	Brainrot: Screen Time Control	4.34	1.85	154	2.49	—
16	Language Learning: Pingo AI	4.62	2.13	83	2.48	✅
17	BeautyPlus-Selfie Photo Editor	4.78	2.33	51	2.45	—
18	Photoshop Express Photo Editor	4.72	2.29	123	2.43	—
19	VSCO: Photo Editor & Presets	4.64	2.44	48	2.21	—
20	Calorie Counter & Food Tracker	4.62	2.43	128	2.20	—
21	Photo Editor-	4.62	2.44	32	2.19	—
22	Surfshark VPN: Fast VPN App	4.71	2.54	104	2.17	—
23	Proton VPN: Fast & Secure	4.52	2.36	87	2.16	—
24	Jigsawscapes® - Jigsaw Puzzles	4.69	2.54	127	2.14	—
25	AI Song Generator - Mozart	4.20	2.11	105	2.09	✅
26	Endel: Focus & Sleep Sounds	4.64	2.57	131	2.06	—
27	BetterSleep: Relax and Sleep	4.72	2.68	132	2.04	—

The exact query that produced this table:

WITH recent AS (
  SELECT store_app_id,
         COUNT(*)              AS recent_count,
         AVG(rating::numeric)  AS recent_avg
  FROM   app_reviews
  WHERE  detected_language = 'EN'
  GROUP  BY store_app_id
  HAVING COUNT(*) >= 30
)
SELECT s.name,
       ROUND(s.rating::numeric, 2)                    AS cumulative,
       ROUND(r.recent_avg, 2)                          AS recent_avg,
       r.recent_count                                  AS sample,
       ROUND(s.rating::numeric - r.recent_avg, 2)     AS gap
FROM   recent r
JOIN   store_apps s ON s.id = r.store_app_id
WHERE  s.platform = 'ios'
  AND  s.rating IS NOT NULL
  AND  (s.rating::numeric - r.recent_avg) > 2
ORDER  BY gap DESC;

Download raw data: rating-gap-april-2026.csv (CC-BY 4.0).