AI for Customer Insights That Actually Move the Roadmap
Open the deck slide where the team presents their latest customer-insight findings. Pasted at the top is a five-bullet summary from ChatGPT, generated from 800 reviews exported out of Shopify last Friday.
10 min read · 20 July 2025

AI for Customer Insights That Actually Move the Roadmap
Open the deck slide where the team presents their latest customer-insight findings. Pasted at the top is a five-bullet summary from ChatGPT, generated from 800 reviews exported out of Shopify last Friday. Bullet one: "Customers love the product quality." Bullet two: "Shipping times are a concern." Bullet three: "Packaging could be improved." Bullets four and five are variations of the same. The product team nods, the marketing team nods, and not a single line of the roadmap moves because of it.
This is the universal failure mode of AI customer insight in DTC. The output is technically correct, generic enough to apply to any of your competitors, and useless as a decision input. The fix is not a smarter prompt. It is a structural rethink of what AI is being asked to do.
The Generic-Insight Trap That Kills Roadmap Decisions
Through 2026, Gartner AI-ready data expects organisations to abandon 60 percent of AI projects that are not supported by AI-ready data. Drill into the same Gartner research and you find that 63 percent of organisations either lack the right data management practices for AI or are unsure whether they have them. That is the headline number, but the failure mode for voice-of-customer work is more specific. It is not that the data is missing. It is that the data is unstructured, and operators are asking an LLM to do the structuring on the fly.
The unstructured-text problem cuts both ways. Reviews land in a Shopify export with no codes attached. Support tickets land in Gorgias with a free-text body. Social mentions land in a Sprout Social or a Brand24 dump as raw strings. When you paste 800 of these into ChatGPT and ask for "the top customer complaints", the model does what LLMs do: it summarises. Summarisation is not classification. Summarisation pulls the most-frequent surface themes regardless of whether those themes are decision-relevant. That is why every brand running this workflow gets back the same five generic bullets. Their competitors are running the same workflow and getting back the same bullets, because the underlying data is similarly unstructured and the model is similarly running zero-shot.
There is a second, quieter failure. LLM outputs are not deterministic. Run the same prompt twice on the same review file with the model's default temperature, and you get two different summaries with different ordering and different emphasis. Operators rarely realise this until two team members run the same exercise and produce non-matching findings. By that point the roadmap conversation has happened, the bullets have been treated as fact, and the underlying inconsistency is invisible.
Real voice-of-customer work looks nothing like this. Glossier customer-led product documents how the brand's Milky Jelly Cleanser was developed off a 400-comment community thread that was coded by hand into specific adjective clusters: "mild", "glowy", "moist". Those are not generic themes. They are brand-owned vocabulary that translated directly into product specification. The team did not paste the thread into a summariser. They built a small, specific taxonomy and forced the data through it. The output drove a real product launch.
The other thing the Glossier example shows is that the taxonomy is the asset, not the AI. Once "mild, glowy, moist" exists as a coding scheme, anyone can apply it to new comments, new reviews, new social mentions. The taxonomy is portable across data sources and stable across runs. The AI summary is portable across nothing and stable across nothing. The brands using AI to do voice-of-customer work without first building the taxonomy are running insight theatre. They are getting an output. They are not getting an asset.
Academic work backs this up. A Springer LLM review study compared LLM zero-shot review classification against smaller, fine-tuned models on the same data. The fine-tuned models matched or beat the LLMs on classification accuracy at a fraction of the inference cost. The lesson is not that LLMs are bad at this. The lesson is that classification against a defined taxonomy beats freelance summarisation, regardless of which model is doing the work. Get the taxonomy right and a small model does fine. Get the taxonomy wrong and the most expensive frontier model produces noise.
The Insight Taxonomy Engine
The fix is The Insight Taxonomy Engine. It is a five-stage discipline that puts a brand-specific coding scheme on top of every AI customer-insight workflow. The AI is not freelancing themes. It is classifying customer text against a taxonomy the brand already cares about, with seeded prompts and pinned model versions so the output is repeatable across runs.
The five stages are: motivations, friction points, switching triggers, brand-owned vocabulary, and churn drivers. Motivations are why the customer bought (the rational and emotional reasons). Friction points are where the experience breaks (sizing, checkout, shipping, returns). Switching triggers are what would push them to a competitor (price, availability, brand misalignment). Brand-owned vocabulary is the language unique to your category (Glossier's "mild, glowy, moist"). Churn drivers are the patterns that precede a customer leaving (typically a sequence of friction points compounded over two to three orders).
Each stage gets defined as a list of codes with examples. "Sizing" is not a code. "Sizing runs small at chest" is a code with three example sentences attached. "Slow shipping" is not a code. "Promised 5-7 day shipping arrived in 14+ days" is a code. The codes are specific enough that two team members reading the same review would tag it the same way. That is the test of a real taxonomy.
The Insight Taxonomy Engine is built once, then maintained quarterly. New codes get added when a critical mass of customer mentions does not fit existing codes. Old codes get retired when the underlying issue has been fixed and mentions stop arriving. The taxonomy is a living document, but it is owned by a single named person on the team, usually the customer experience lead or the senior product manager. Without an owner, the taxonomy drifts, and a drifting taxonomy is no taxonomy at all.
I have built taxonomies with operators across categories. The pattern that works is to start with 25 to 40 codes, distributed across the five stages. Beauty brands tend to have richer brand-owned vocabulary. Apparel brands tend to have richer friction points (sizing, returns). Subscription food brands tend to have richer churn drivers (palate fatigue, dietary changes). The exact composition matters less than the discipline of doing the codification work before any AI workflow is built on top.
Enterpret voice of customer implements this pattern as a product. The platform unifies feedback from over 50 sources and auto-classifies against a brand-owned taxonomy that the customer team curates. It is not the only path. Plenty of brands run The Insight Taxonomy Engine on a Google Sheet, a fine-tuned Claude prompt, and a weekly review meeting. The point is the taxonomy comes first. The tool is downstream.
Phase 1: Taxonomy Definition Workshop (Days 1-30)
Day 1 of The Insight Taxonomy Engine is a customer-research workshop, not a tool selection. Get the customer experience lead, the senior product manager, and one or two members of the support team in a room. Pull a representative sample of 100 to 200 customer interactions: reviews, tickets, post-purchase survey responses, social DMs, return-request notes. The sample should span at least 90 days and cover all major SKUs.
Read every interaction. Yes, all of them. The team writes down candidate codes in real time as they read. By interaction 50, the codes are settling into clusters. By interaction 100, the team is recognising repetition and can start grouping. By interaction 200, you have a draft taxonomy of 30 to 50 codes across the five stages.
This is slow work. It typically takes two full days. Operators resist this because it feels like the work the AI is supposed to do. It is not. The AI cannot do this work because it does not know which codes matter to your brand. The team has to make those calls, and the codes that emerge are the brand's competitive intelligence asset. Outsource this and you outsource the asset.
By Day 10, the draft taxonomy is shared with the broader team for sense-checking. The merchandising lead reviews the friction-point codes. The CFO reviews the churn driver codes. The brand director reviews the brand-owned vocabulary. Each owner can add, merge, or split codes. By Day 20, the taxonomy is locked. Print it. Pin it to the wall. Get the team to refer to it in standups.
By Day 30, the taxonomy is ready to feed into an LLM coding pass. The deliverable from Phase 1 is the codebook, with definitions, examples, and an owner per stage. Without this, Phase 2 produces hallucinations dressed as insight.
Phase 2: The Seeded LLM Coding Pass (Month 2-3)
Phase 2 is where the AI does work. Take the locked taxonomy and build a single, seeded prompt that asks the model to classify any customer interaction against the codes. Prompt structure matters. Provide the code list with definitions and two to three example sentences per code. Ask the model to return code IDs, not free-text summaries. Pin the model version (do not let it auto-upgrade). Set the temperature to zero. Run the same prompt against a 50-interaction holdout twice. The two outputs should match within 5 percent. If they do not, the prompt needs refining or the codebook needs sharpening.
Once the prompt is stable, batch-run it against the full backlog of customer interactions. Most $1M-$10M brands have between 5,000 and 50,000 untagged interactions sitting in Shopify reviews, Gorgias tickets, and post-purchase surveys. The batch run takes a few hours and a few hundred dollars in API spend. The output is a structured table with one row per interaction and one or more code IDs per row. That table is the asset. Every dashboard, every roadmap input, every CX briefing reads from it.
The key discipline at this stage is the 50-mention rule. A code with under 50 mentions does not drive a roadmap decision. It might be a real signal, but the volume is not high enough to invest against without further validation. Codes with 50 to 200 mentions become roadmap candidates. Codes with 200+ mentions become roadmap priorities. The threshold prevents the team from chasing low-volume noise that surfaces because the LLM was creative on a small sample.
LLM e-commerce review research frames this trade-off as the difference between perceptual signals and actionable ones. Perceptual signals are everything the customer mentions. Actionable signals are the subset that meet a volume and severity threshold worth acting on. The Insight Taxonomy Engine is biased toward actionable. Operators who treat every coded mention as worth pursuing burn the roadmap. Operators who use the volume threshold get the actual signal.
Gorgias AI Agent plays a useful role here on the support-ticket leg. Gorgias surfaces ticket themes natively and can feed into the same taxonomy via webhook. The support-ticket data is typically the highest-volume input, and integrating it correctly takes about a sprint. Once it is wired, the support team's daily workload becomes a continuous feed into the same code system the product team is reading from.
Phase 3: The Sprint-Cadence Review (Quarterly)
Phase 3 is the operating rhythm. Every two weeks, the customer experience lead pulls the latest tagged data, sorts codes by mention volume and trend (rising, flat, declining), and walks the product and merchandising teams through the top 10 movers. Codes that crossed the 50-mention threshold this period are flagged as new roadmap candidates. Codes that have been declining for three consecutive periods get noted as fixes that worked.
VoC roadmap template provides a useful sprint-cadence skeleton. Adapt the structure to your team's planning rhythm. The non-negotiable element is the 50-mention rule and the requirement that every roadmap input cite a specific code with mention count and trend, not a generic summary. "Improve checkout" is not a roadmap input. "Code FRC-12, 'Apple Pay button missing on cart drawer', 84 mentions, rising 30 percent month-over-month" is a roadmap input.
Quarterly, the taxonomy itself gets reviewed. The owner walks through the codes, flags any that have been silent for two quarters (candidates for retirement), and proposes new codes for issues the existing taxonomy does not capture. The codebook is never frozen. It evolves as the product evolves and as the customer base shifts. The discipline is the regular review, not the static structure.
From Insight Theatre to Roadmap Asset
The shift The Insight Taxonomy Engine produces is not a smarter dashboard. It is a different category of decision-making. Before, the team ran an AI summary, got bullets, nodded, and the roadmap moved by gut. After, every roadmap input traces back to a coded mention count, a trend, and a brand-owned definition. The AI is doing what AI is good at (classification at scale), and the team is doing what only the team can do (deciding which codes matter and what to do about them).
The metric that signals success is the number of roadmap items in the current quarter that cite a specific code from the taxonomy. At Day 0, that number is zero. After two full quarters of running The Insight Taxonomy Engine, it should be 70 to 90 percent. The remaining 10 to 30 percent will be strategic bets that do not originate in customer voice (a new category launch, a margin-protection rebuild). That is the right ratio. Voice of customer is most of the roadmap, but not all of it.
The brands still pasting reviews into ChatGPT will continue to get the same five generic bullets. The brands that build the taxonomy will get something rarer: a customer-insight asset that compounds. Each quarter the codebook gets sharper. Each quarter the AI coding gets cheaper because the prompt is more specific. Each quarter the roadmap gets more defensible because every line item cites a real code. That is the difference between insight theatre and a real customer-insight system.
Unit Economics Calculator
Contribution margin per order after COGS, shipping and fees — the number scaling actually depends on.
Natural Language Processing Applications That Move Margin
ChatGPT for Ecommerce Practical Applications That Save Hours
AI Integration With Existing Systems Without the Reconciliation Tax
Why Sentiment Analysis For Brand Monitoring Misses Real Signal
Customer Service Apps for Shopify: Build to Deflect, Not Absorb
Market Research For FMCG: The Shelf-Reality Research Protocol
Newsletter
The Uncommon Insights Letter
Practical FMCG & eCommerce growth playbooks — margins, retention and scaling tactics, straight to your inbox.
Turn ai optimization into profit you can see
Get a hands-on operator to turn the frameworks above into results — book a free audit call.