Welcome to my blog, where I share insights on product management, combining lessons from education, certifications, and experience. From tackling challenges to refining processes and delivering results, I offer practical advice and perspectives for product managers and teams. Whether you’re new or experienced, I hope these articles inspire and inform your journey.

Tom Rattigan 2/19/24 Tom Rattigan 2/19/24

That Time Our File Upload 'Worked' But Nobody Could Find Their Files

"The file upload is working perfectly," our frontend developer announced during standup. "I can drag and drop, see the progress bar, get the success message—everything's smooth."

"Great," I said. "Have you tested it with the client's content team?"

Twenty minutes later, my Slack lit up with messages from the client.

"I uploaded the PDF but can't find it anywhere."

"The files are uploading but not showing in the media library."

"Are the documents supposed to disappear after upload?"

I logged into their system and immediately saw the problem. The files were uploading successfully—they were just being stored in a completely different location than where the interface was looking for them. Our upload was working perfectly. Our file browsing was working perfectly. They just weren't talking to each other.

The Path Configuration Maze

Here's what we'd built: a beautiful, intuitive file upload interface that dropped files into /storage/uploads/files/. And a equally beautiful file browser that looked for files in /public/media/documents/. Both components worked flawlessly in isolation. Together, they were useless.

The problem wasn't obvious during development because we'd been testing with the same few files over and over. Upload test.pdf, see test.pdf in the browser, everything works. But we hadn't been testing the full workflow—we'd been testing individual components.

When the International Energy Agency's content team started uploading their actual documents—dozens of research reports, policy briefs, and data sheets—every file vanished into a digital black hole. Successfully uploaded, completely unfindable.

The support ticket they filed was diplomatic but pointed: "File uploads appear to work, but we can't access any uploaded files through the admin interface. Are we missing a step?"

The Development vs Production Reality Gap

During local development, everything had worked fine because we'd been lazy about configuration. Local file paths, development storage settings, simplified folder structures—all fine when there's one developer testing with the same three files.

But production environments have different rules:

Security policies that restrict where files can be written
CDN configurations that affect how assets are served
Load balancers that might serve requests from different servers
Docker containers with ephemeral filesystems
Cloud storage that maps local paths to remote buckets

Each hosting setup had its own way of handling file storage, and we'd made assumptions about how paths would work that were true for our development environment but false everywhere else.

The worst part was that the error wasn't visible. Files uploaded successfully (from the server's perspective) and the database recorded their locations correctly (from the application's perspective). But the media library interface couldn't find them because it was looking in the wrong place (from the user's perspective).

The Mental Model Mismatch

The deeper issue wasn't technical—it was conceptual. We'd built the file system around how developers think about file storage (absolute paths, directory structures, filesystem hierarchies) instead of how content teams think about file organization (categories, projects, usage contexts).

Content teams wanted to upload a document to "the media library" and find it in "the documents section." They didn't care about storage/uploads vs public/media. They didn't want to understand directory structures. They just wanted to put a file somewhere and get it back later.

But our interface was exposing all the technical complexity of file paths and storage configurations. Upload a PDF, and you'd need to know which folder it ended up in, what the final URL structure was, and how the CDN was configured to serve it.

During a screen-sharing session with Nike's content team, I watched their marketing coordinator spend ten minutes looking for a file she'd just uploaded. She clicked through every folder in the media library, searched by filename, filtered by date—nothing. The file existed, but it was in a location that didn't correspond to any category or folder structure she could see.

"This is confusing," she said, with the kind of patience that comes from dealing with broken tools all day. "I just want to upload a PDF and be able to find it again."

The Configuration Explosion

Our first attempt to fix this was to make file paths configurable. Add environment variables for upload directories, storage locations, public URLs, CDN prefixes—give administrators complete control over where files go and how they're accessed.

This made the problem worse.

Now instead of files disappearing into one wrong location, they could disappear into dozens of wrong locations depending on how someone configured their environment variables. Our GitHub issues exploded with path-related problems:

"Files upload to local storage but URLs point to S3" (Issue #79)

"Media library shows files that don't exist" (Issue #383)

"Upload path works on dev but breaks on production" (Issue #456)

"Files upload successfully but serve 404 errors" (Issue #521)

Each issue required forensic debugging to figure out which combination of configuration settings had created which particular flavor of broken file handling.

The Support Ticket That Changed Everything

The breaking point was a support ticket from a client that I'll never forget:

"We've been using Twill for three months. Our content team has uploaded over 200 files. We just realized that none of them are actually accessible on the live website. All the upload confirmations were lies. Do we need to re-upload everything?"

I stared at that ticket for a long time. We'd built a file upload system that could successfully lie to users for months. Files appeared to upload correctly, showed up in admin interfaces, but weren't actually available to website visitors because of a mismatch between internal storage paths and public serving URLs.

The content team had been building pages, adding documents, creating workflows—all based on the assumption that their files were working correctly. They weren't malicious or careless; they were trusting the interface to tell them the truth about whether their uploads had succeeded.

What We Actually Built

The fix wasn't more configuration options—it was fewer configuration options with better defaults.

Unified storage handling: Files go in one place, get served from one place, with automatic path resolution that works the same way across environments.

Upload validation: After a file uploads successfully, the system immediately tries to access it via the public URL. If that fails, the upload fails, with a clear error message.

Visual confirmation: The media library shows actual file previews and download links, so users can immediately verify that uploads worked correctly.

Environment detection: The system automatically detects common hosting configurations and sets appropriate defaults instead of requiring manual path configuration.

Error surfacing: When file paths are misconfigured, the system shows clear error messages instead of pretending everything is working.

Path testing tools: Built admin tools that let developers verify file upload and serving configurations before content teams start using them.

The Lesson About User Mental Models

The file upload disaster taught me that users don't care about your technical architecture—they care about their mental model of how the system should work.

Content teams think in terms of "I uploaded a file, so now I should be able to use that file." They don't think about storage backends, CDN configurations, or directory structures. When the system confirms that an upload succeeded, they trust that confirmation.

If your technical implementation doesn't match that mental model, you need to either change the implementation or change the interface. You can't just document the complexity and hope users will understand it.

The most dangerous kind of broken feature is the one that appears to work. Failed uploads are annoying but obvious. Successful uploads that create unusable files are silently destructive—they let users build workflows on quicksand.

These days, when we design file handling features, we test the complete round trip: upload, storage, serving, access, download. We don't just test that uploads succeed—we test that uploaded files can actually be used for their intended purpose.

And we've learned to be suspicious of features that work perfectly in development. If path configuration works great on your laptop with simplified settings, it probably doesn't work great on a production server with security policies, CDN layers, and multiple environments.

Tom Rattigan 1/13/24 Tom Rattigan 1/13/24

Why I Stopped Promising 'It Works Out of the Box' (Especially with Imgix)

"Image optimization is built right in," I told the prospect during our demo call. "We integrate with Imgix, so your images automatically get optimized for every device and connection speed. It's seamless—just upload and go."

The demo looked perfect. Upload an image to Twill's media library, and it appeared on the frontend crisp, fast, and properly sized. No configuration, no technical setup, just beautiful images that loaded instantly.

Three weeks after they signed the contract, I got a very different kind of call.

"Tom, none of our images are displaying. Everything's showing broken image icons."

I remoted into their staging environment and immediately saw the problem. The Imgix configuration was pointing to a test account that had expired. What looked like "seamless integration" in our demo was actually a house of cards built on hardcoded credentials that worked great until they didn't.

The Demo Magic Behind the Curtain

Here's what our "it works out of the box" demo actually required:

A pre-configured Imgix account with specific domain settings
Environment variables set up exactly right across dev, staging, and production
S3 bucket permissions that had been tweaked through hours of trial and error
CDN settings that our DevOps engineer had quietly fixed after the first dozen attempts
Fallback image handling for edge cases that we'd never mentioned
CORS configurations that worked for our demo domain but broke on client domains

None of this was visible during the demo. Upload an image, see it display optimized—magic! But behind that magic was a configuration maze that would take any new team hours to navigate.

The worst part? Every client environment was slightly different. Different hosting setups, different security policies, different domain configurations, different image workflows. What worked perfectly in our controlled demo environment broke in a dozen creative ways when deployed to the real world.

The GitHub Issues That Told the Real Story

Within six months, our most common support tickets all had the same theme:

"Imgix integration not working on production" (Issue #196)

"Images display locally but not after deployment" (Issue #314)

"Imgix URLs generating but returning 403 errors" (Issue #610)

"Development vs production image paths inconsistent" (Issue #743)

Each ticket represented hours of back-and-forth troubleshooting. Screenshot exchanges. Environment comparisons. Configuration file deep-dives. The kind of technical archaeology that makes everyone involved question their life choices.

The pattern was always the same:

Client follows our "simple" setup instructions
Something doesn't work in their specific environment
We spend 3-5 hours debugging their particular combination of hosting, domain setup, and security settings
We eventually find the one configuration detail that was different from our demo environment
We add another bullet point to our setup documentation
The next client hits a different combination of issues

Our documentation grew from a clean two-page setup guide to a sprawling troubleshooting manual full of "if you're using CloudFlare, then..." and "some hosting providers require..." disclaimers.

The Configuration Nightmare Reality

The real problem wasn't Imgix—it's actually a great service. The problem was that we'd marketed a complex integration as simple. Image optimization involves a lot of moving pieces:

Source configuration: Where are your images stored? S3? Local filesystem? Different buckets for different environments?

Domain setup: What domain should Imgix use to fetch your images? How does that work with CDNs? What about HTTPS certificates?

Path mapping: How does Twill's internal file structure translate to Imgix URLs? What happens with nested folders? Special characters in filenames?

Security settings: Who can access your images? How do you handle private content? What about hotlinking protection?

Fallback handling: What displays when Imgix is down? When an image doesn't exist? When the optimization service is overloaded?

Performance tuning: Which optimization settings work for your content? How do you handle different image types? What about progressive loading?

Each of these had multiple valid approaches, depending on the client's infrastructure and requirements. But our demo assumed one specific setup that worked great in our controlled environment and broke in creative ways everywhere else.

The OpenAI Reality Check

The breaking point came when we were setting up Twill for OpenAI's content team. This should have been straightforward—they had sophisticated infrastructure, experienced developers, and clear requirements.

It took two weeks to get images working correctly.

Not because anyone was incompetent, but because their security policies, CDN configuration, and deployment pipeline introduced variables we'd never encountered. Our "simple" Imgix integration required custom middleware, modified environment handling, and a completely different approach to asset URLs.

"This doesn't feel like 'plug and play,'" their developer told me during one of our debugging sessions. "It feels like we're building a custom integration from scratch."

He was right. We'd packaged a complex system as a simple feature, and the complexity was exploding in their environment instead of being handled gracefully in ours.

What We Actually Built Instead

The fix wasn't technical—it was philosophical. We stopped trying to hide complexity and started helping people manage it.

Configuration templates: Instead of one "simple" setup, we created different configuration templates for common hosting scenarios. AWS + CloudFront. Heroku + S3. Traditional hosting. Each with its own specific instructions.

Environment detection: Built tools that could identify common configuration issues and suggest fixes instead of just failing silently.

Fallback strategies: Made the system degrade gracefully when integrations weren't configured perfectly, so clients could launch with basic functionality while sorting out optimization.

Modular integration: Broke the Imgix integration into optional components so teams could adopt parts of it without needing the full setup working perfectly.

Honest documentation: Rewrote our setup guides to be upfront about complexity instead of hiding it. "This will take 2-3 hours to configure properly" instead of "works seamlessly out of the box."

Configuration validation: Added tools that could test whether integrations were working correctly before deployment, catching issues in development instead of production.

The Conversation That Changed Everything

The shift happened during a client onboarding call where I decided to be completely honest about what "integration" actually meant.

"Twill includes Imgix support," I told them, "but getting it configured for your specific environment will probably take an afternoon. Here's what you'll need to set up, here are the common issues people run into, and here's how we can help you through it."

Instead of being frustrated, they were relieved.

"Thank you for being upfront about that," their technical lead said. "We can plan for it properly instead of being surprised when it doesn't work immediately."

That client onboarding went smoother than any of our "seamless" demo-driven setups. They allocated proper time for configuration, asked good questions upfront, and ended up with a more robust setup than clients who expected everything to work magically.

The Lesson About Third-Party Integrations

The Imgix integration taught me that "seamless" third-party integrations are usually an illusion. Every external service introduces variables you can't control: their API changes, their service limits, their security requirements, their pricing models, their uptime.

When you promise that an integration "just works," you're not just promising that your code works—you're promising that your code will work with their service, in the client's environment, under their constraints, with their security policies, and with their particular combination of other tools.

That's a promise you can't keep, because you don't control all those variables.

What you can promise is good documentation, helpful error messages, clear fallback behavior, and support when things don't work as expected. That's less sexy in a demo, but it's way more valuable in production.

These days, when we demo third-party integrations in Twill, we show both the happy path and the configuration process. We talk about what happens when things go wrong. We're upfront about the time investment required to get everything working smoothly.

It makes for less magical demos, but much happier clients. Because the goal isn't to win the demo—it's to deliver a system that actually works in their environment, with their constraints, for their team.

The best marketing isn't promising that complex things are simple. It's being honest about complexity while providing great tools to manage it.

"It works out of the box" became "It works reliably once you've set it up properly, and we'll help you get there." Less catchy, way more true.

Tom Rattigan 12/13/23 Tom Rattigan 12/13/23

The Great Media Library Cleanup: When Storage Costs Became a Wake-Up Call

The AWS bill that landed in my inbox on a Thursday morning made me do a double-take. Our S3 storage costs had tripled in six months. Not gradually—suddenly, dramatically, like someone had flipped a switch.

"Are we backing up everything twice now?" I asked our DevOps engineer.

"No, it's all media files," he replied. "Your Twill projects are uploading a lot of assets."

I pulled up the storage analytics and stared at the numbers. 847GB of images, videos, and documents across our client projects. That seemed... excessive. Especially since most of our sites were fairly straightforward corporate and campaign pages.

"How much of this is actually being used?" I asked.

That's when we discovered the problem that every CMS faces but nobody talks about: digital hoarding.

The Upload-and-Forget Problem

Here's how it typically happened: A content team would get access to Twill's media library for the first time. They'd see the clean, organized interface—folders, tags, search functionality—and think, "Perfect, let's get everything uploaded so we have it when we need it."

Then they'd dump in every asset from their last three campaigns. Every photo from the product shoot (including the 47 takes of the same angle). Every version of every logo (including the ones with the tiny typo that got fixed). Every video file (including the raw footage that was only supposed to be for internal review).

The media library was so easy to use that it became a digital junk drawer. Upload first, organize later. Except "later" never came.

We started noticing patterns in our GitHub issues:

"Media library becomes slow with large numbers of files" (Issue #154)

"Search functionality times out with 10k+ assets" (Issue #298)

"File browser pagination breaks with massive datasets" (Issue #445)

But the real wake-up call was realizing that most of these files were digital ghosts. Uploaded with good intentions, never actually used in any published content, but still sitting in S3 accumulating charges month after month.

The Reference Tracking Nightmare

The obvious solution seemed simple: build a tool to identify unused media files and delete them. How hard could it be?

Very hard, as it turns out.

Twill's flexible architecture meant that media files could be referenced in dozens of different ways:

Direct references in block content
Background images in CSS customizations
Featured images attached to modules
Gallery collections that might be used across multiple pages
PDF files linked in rich text fields
Video files embedded in custom blocks
Images used in email templates
Assets referenced in JSON fields for API integrations

Just because a file wasn't visibly displayed on a page didn't mean it wasn't being used. And just because it was being used today didn't mean it would be used tomorrow when the content team updated that campaign page.

Our first attempt at cleanup was a disaster. We built a script that identified "unused" files by checking database references, ran it on a staging environment, and confidently deleted 200+ assets that appeared to be orphaned.

Then we pushed to production and watched as half of the client's image galleries turned into broken image placeholders.

The files weren't unused—they were being referenced dynamically through a custom field structure that our cleanup script didn't know about. We spent the next six hours frantically restoring files from backups while the client's marketing team wondered why their perfectly good website had suddenly broken.

The Real Cost of Digital Hoarding

Storage costs were just the tip of the iceberg. The bigger problems were more subtle:

Performance degradation: Media library interfaces that worked fine with 100 files became unusable with 10,000. Content teams started complaining that finding the right image took longer than creating the content that used it.

Decision paralysis: When you have 500 product photos in a folder, choosing the right one becomes overwhelming. Content creators would spend more time browsing assets than actually building pages.

Version confusion: Multiple uploads of similar files led to constant questions: "Is this the approved logo?" "Which version of this image should I use?" "Is the high-res version the same as the web version?"

Backup complexity: Our deployment and backup processes slowed down dramatically as they tried to sync massive asset directories across environments.

Support overhead: Every "my image isn't displaying" ticket required forensic work to figure out which of the twelve similar filenames was the right one.

The International Energy Agency project made this painfully clear. They'd upload batches of charts and infographics for their energy reports, but by the time they were ready to publish, they couldn't remember which files were the final versions and which were works-in-progress. We spent more time on asset archaeology than actual content management.

What Actually Worked

Instead of building a perfect automated cleanup system, we learned to attack the problem from multiple angles:

Upload governance: Added file naming conventions and folder structures during client onboarding. Not exciting, but way more effective than trying to organize chaos after the fact.

Version control: Built simple tools for marking files as "draft," "approved," or "archived" so teams could manage their own asset lifecycles without accidentally deleting something important.

Usage tracking: Instead of trying to reverse-engineer what was being used, we started tracking when files were actually served to end users. Files that hadn't been requested in 6+ months got flagged for review.

Bulk operations: Added tools for content managers to select and delete multiple files at once, making cleanup feel manageable instead of overwhelming.

Storage policies: Implemented automatic archiving for files older than a certain age, with easy restoration if someone actually needed something from the archives.

Client education: Started having explicit conversations during project kickoffs about asset management strategies, not just technical capabilities.

The Hard Conversation

The most important change wasn't technical—it was cultural. We had to start having honest conversations with clients about their upload habits.

"I know it feels safer to upload everything," I'd tell new content teams, "but every file you upload becomes someone else's problem six months from now. Either they'll spend time managing it, or they'll spend money storing it, or they'll spend frustration trying to find the right version."

Some clients pushed back. "Storage is cheap," they'd say. "Why not just keep everything?"

But storage costs weren't really the issue. The issue was that unlimited upload capacity created unlimited organizational debt. Every additional asset made the media library slightly less usable for everyone else on the team.

The clients who got the best results from Twill were the ones who treated their media library like a tool, not a warehouse. They uploaded what they needed, organized it as they went, and regularly cleaned up what they weren't using anymore.

The Lesson That Stuck

The great media library cleanup taught me that technical solutions can't fix organizational problems. We could build the most sophisticated asset management system in the world, but if people uploaded files without thinking about long-term consequences, we'd still end up with digital hoarding.

The real solution was designing workflows that made good asset hygiene feel natural instead of burdensome. Making it easy to organize files as you upload them. Building tools that helped people find what they needed instead of browsing through everything they'd ever uploaded. Creating gentle reminders about storage costs and performance implications.

These days, when clients ask about Twill's media capabilities, we don't just demo the upload interface. We show them the file organization tools, the bulk management features, and the usage analytics. Because the goal isn't just to store their assets—it's to help them manage their assets in a way that serves their team instead of overwhelming it.

Our AWS bills are back to reasonable levels. More importantly, content teams actually use their media libraries instead of treating them like digital attics. Because the best CMS feature isn't unlimited storage—it's helping people stay organized so they can find what they need when they need it.

The cleanup is never really finished. But now it's part of the workflow instead of a crisis waiting to happen.

Tom Rattigan 11/21/23 Tom Rattigan 11/21/23

It's Not a Demo Unless Something Breaks

The Perfect Demo Curse

I've given hundreds of product demos over the years, and I can tell you with absolute certainty: the demos you remember aren't the flawless ones. They're the disasters.

The demos where the API goes down mid-presentation. Where the feature you've been hyping for months decides to crash in front of your biggest prospect. Where the "simple workflow" becomes a comedy of errors that leaves everyone questioning your competence.

These moments are mortifying when they happen. But they're also the most honest glimpses into how software actually works—and more importantly, how it fails.

The Demo That Launched a Thousand Bug Reports

My most memorable demo disaster happened three years ago during a sales call for a project management tool we'd built. The prospect was a fast-growing startup that needed to coordinate between their engineering, design, and marketing teams.

"Let me show you how easy it is to create a cross-functional project," I said confidently, screen-sharing our beautifully polished interface.

I clicked "New Project." The modal opened perfectly.

I entered "Website Redesign" as the project name. Clean autocomplete suggested relevant templates.

I selected team members from each department. The interface smoothly showed their roles and current workload.

I clicked "Create Project."

Nothing happened.

I clicked again. Still nothing. The button had that subtle loading spinner that every SaaS developer knows means "something is happening but we're not sure what."

"Just give it a second," I said, with the forced casualness of someone watching their demo slowly implode. "The system is processing the cross-team coordination logic."

Thirty seconds passed. The spinner kept spinning.

"You know what," said the prospect, "this actually happens to us with our current tool all the time. How do you handle it when this occurs in production?"

That question saved the demo—not because I had a great answer prepared, but because it turned a technical failure into a conversation about real problems and real solutions.

When Breaking Becomes Teaching

The fascinating thing about demo failures is how they cut through the marketing polish and reveal the actual user experience. When something breaks during a demo, you're forced to have an honest conversation about:

How errors are handled: Does your system fail gracefully or catastrophically? Are error messages helpful or cryptic? Can users recover or are they stuck?

What the edge cases look like: Perfect demos use perfect data. Broken demos reveal what happens with messy, real-world inputs.

How your support process works: When things go wrong, what happens next? How quickly can issues be diagnosed and resolved?

What your priorities actually are: Do you panic and try to hide the problem, or do you acknowledge it and explain how you're solving it?

Some of our best customer relationships started with terrible demos that led to great conversations about the real challenges of building software that works reliably at scale.

The Anatomy of Useful Demo Failures

Not all demo failures are created equal. The useful ones share some common characteristics:

They Reveal Real Constraints

The best demo failures expose the actual limitations of your system, not just cosmetic glitches.

During a demo of our content management system, we tried to upload a 50MB video file. The upload progress bar got to 99% and then... nothing. Just hung there forever.

Instead of awkwardly moving on, I used it as a teaching moment: "This is exactly why we implemented the chunked upload system in v2. Large files are a real problem in content workflows, and most systems just pretend they work until they don't."

The prospect immediately understood why this mattered—they'd been burned by file upload failures in their current system.

They Show How You Handle Pressure

How you respond to demo failures reveals more about your company culture than any slide deck about "our values."

Do you blame the intern? Make excuses about the demo environment? Frantically try to fix it while everyone watches?

Or do you calmly explain what's happening, acknowledge the issue, and turn it into a discussion about how these problems get resolved in production?

The latter approach builds trust. It shows that you understand software is complex, failures happen, and what matters is how you handle them.

They Create Authentic Moments

Perfect demos feel like theater. Broken demos feel like real life.

When something goes wrong during a demo, the room changes. People lean forward instead of sitting back. They start asking questions about their actual pain points instead of politely listening to your feature tour.

These authentic moments often lead to the most productive conversations about whether your product actually solves their problems.

The Demo Environment Dilemma

There's an ongoing debate in the software world about demo environments. Should you use:

Pristine Demo Data: Perfect examples that show your features in the best light, but don't reflect real-world messiness.

Production-Like Data: Realistic examples that expose edge cases and actual user workflows, but might reveal embarrassing bugs or performance issues.

Live Production: The actual system your customers use, with all the chaos and unpredictability that entails.

Each approach has trade-offs, but I've become convinced that the middle ground—production-like data in a stable environment—often provides the worst of both worlds. You get messiness without authenticity, complexity without the real stakes that make failures meaningful.

The best demos I've seen either embrace the polish (acknowledging that it's a curated experience) or embrace the chaos (doing live demos in production with all the risks that entails).

What Demo Failures Teach Us About Product Development

Demo disasters aren't just sales experiences—they're product intelligence. They reveal gaps between how you think your software works and how it actually behaves under pressure.

The Performance Reality Check

During a demo to a large enterprise client, our "lightning-fast search" feature took 15 seconds to return results. Embarrassing? Absolutely. But it forced us to confront the fact that our search optimization worked great with 1,000 records and terribly with 100,000.

We'd been so focused on feature completeness that we'd ignored performance at scale. The failed demo became the catalyst for a major infrastructure overhaul that made the product genuinely better.

The User Experience Mirror

When demos break, you see your interface through fresh eyes. Features that seemed intuitive suddenly feel clunky. Error states you never considered become glaringly obvious problems.

A demo failure once revealed that our "intuitive" navigation was actually incomprehensible to new users. When the main feature didn't work, the prospect couldn't figure out how to get back to the dashboard. This led to a complete UX audit and redesign.

The Feature Priority Reset

Nothing clarifies your product priorities like watching a prospect's reaction when your marquee feature doesn't work.

If they shrug and ask about something else, maybe that "revolutionary" feature isn't as important as you thought. If they immediately start problem-solving with you, you've found something that matters to them.

The Art of the Graceful Failure

The best sales engineers I know have mastered the art of turning demo failures into demo strengths. Here's how they do it:

Acknowledge Immediately

Don't pretend nothing happened. Don't blame external factors. Just acknowledge that something went wrong: "Well, that's not supposed to happen. Let me show you what should have occurred and talk about how we handle these edge cases."

Explain the Context

Help people understand what they're seeing: "This error suggests that there's a database connection issue, which in production would trigger our automatic failover system."

Show the Resolution Process

Even if you can't fix it during the demo, you can demonstrate how issues like this get resolved: "When this happens in production, our monitoring system automatically alerts the engineering team, and here's how we'd investigate and fix it."

Connect to Real Value

Tie the failure back to genuine business value: "This is exactly why we built the redundancy features I mentioned earlier. Systems fail, and what matters is how gracefully you handle those failures."

When Not to Demo

Sometimes the most honest thing you can do is not demo at all.

If your system is in a genuinely broken state, forcing a demo helps nobody. It wastes the prospect's time and damages your credibility in ways that go beyond a simple technical glitch.

But there's a difference between "the system is fundamentally broken" and "there might be edge cases that surface during a demo." The latter is just software being software.

The key is setting appropriate expectations: "We're going to do a live demo in our production environment. This gives you the most realistic view of the system, but it also means we might encounter real-world issues that we can discuss and address."

The Demo Failure Hall of Fame

Some demo failures become legendary stories that prospects remember years later:

The Infinite Loop: A workflow automation demo that accidentally created a recursive loop, sending hundreds of notifications until someone manually killed the process. The prospect loved it because it showed exactly what they needed protection against.

The Honest API: A integration demo where the third-party API started returning brutally honest error messages like "Your request is bad and you should feel bad." Turned into a great conversation about error handling and API reliability.

The Accidental Stress Test: A simple search demo that somehow triggered a massive database query that brought the entire system to its knees. Led to the discovery of a major performance bottleneck and a much stronger relationship with a prospect who appreciated the transparency.

Building Software That Fails Well

The most important lesson from demo failures isn't about giving better demos—it's about building software that fails gracefully.

Systems that fail well share some common characteristics:

Informative Error Messages: Instead of "Something went wrong," they explain what happened and what the user can do about it.

Graceful Degradation: When one feature breaks, the rest of the system continues working normally.

Easy Recovery: Users can get back to a working state without losing their work or starting over.

Clear Escalation Paths: When self-service recovery isn't possible, users know exactly how to get help.

Learning from Failure: Each failure provides information that helps prevent similar issues in the future.

The Honesty Advantage

In a world of polished marketing and carefully curated customer success stories, authentic failure experiences can be a competitive advantage.

Prospects are sophisticated. They know software breaks. They've been burned by vendors who oversold and underdelivered. They're looking for partners who are honest about limitations and proactive about solutions.

A demo that breaks and gets handled well can build more trust than ten perfect demos that feel too good to be true.

Embracing the Break

The next time you're preparing for a demo, resist the urge to script every interaction and control every variable. Instead:

Test in realistic conditions with real data and real network conditions.

Prepare for common failure modes by understanding what might go wrong and how you'll handle it.

Practice the recovery as much as you practice the perfect path.

Welcome the unexpected as an opportunity to have authentic conversations about real problems and real solutions.

Remember: prospects aren't buying your demo. They're buying your solution to their problems. Sometimes a broken demo is the most honest way to show how you solve problems when things don't go according to plan.

And in software, things never go exactly according to plan.

Conclusion

Perfect demos are forgettable. Broken demos that lead to great conversations are the foundation of lasting customer relationships.

The goal isn't to break things on purpose—it's to be prepared when breaks happen naturally, and to use those moments as opportunities for authentic engagement about the real challenges of building and using software.

So embrace the chaos. Plan for the unexpected. And remember: it's not a demo unless something breaks—because that's when the real demo begins.

Tom Rattigan 10/12/23 Tom Rattigan 10/12/23

When Your Block Editor Demo Looked Perfect But Broke With Real Content

The demo was flawless. Nike's content team watched as I effortlessly dragged blocks around, rearranged sections, and built a complex landing page in under five minutes. The block editor was responsive, intuitive, and exactly what they'd asked for when they said they wanted "full creative control over page layouts."

"This is perfect," their creative director said. "When can we start using it?"

Six weeks later, I got a very different kind of call.

"Tom, we need to talk. The editor is... struggling."

The Demo That Wasn't Reality

Here's what our demo looked like: a clean, purposeful page with maybe 12 blocks total. A hero section, some text, a few images, a quote, a call-to-action. Each block rendered instantly. Drag-and-drop was buttery smooth. The interface felt modern and capable.

Here's what Nike's actual content looked like: 180+ blocks on a single campaign page. Multiple nested sections. Image galleries with 30+ photos. Embedded videos, complex forms, dynamic content pulls, and about fifteen different types of call-to-action blocks scattered throughout.

The first time they tried to reorder a section near the bottom of that page, the browser froze for twelve seconds. When it finally responded, the layout had broken, half the blocks were displaying placeholder content, and their carefully crafted page looked like a digital crime scene.

"Is this normal?" their content manager asked during our emergency call.

I stared at my screen, watching the block editor struggle to render their actual content structure, and realized we'd built something that worked beautifully for the way we thought people would use it, not for the way they actually needed to use it.

The Scale Problem We Didn't See Coming

When we designed Twill's block editor, we made a lot of assumptions. Content pages would be reasonably sized. Users would create focused, structured layouts. People would use blocks thoughtfully, not just throw everything they could think of into a single page.

Those assumptions were wrong.

Nike's content team wasn't being reckless—they were being thorough. A campaign landing page needed to tell a complete story, showcase multiple product lines, include social proof, provide detailed specifications, and convert visitors across a dozen different user journeys. In their world, 180 blocks wasn't excessive; it was comprehensive.

But our block editor was choking. Every drag operation had to recalculate the position of every other block. Every content change triggered a full re-render of the editing interface. The nested block structure that looked so clean in our demo became a performance nightmare when multiplied by real-world complexity.

The worst part? It wasn't just slow—it was unpredictable. Sometimes dragging worked fine. Sometimes it took fifteen seconds. Sometimes it would work for a while, then suddenly break when you hit some invisible threshold of complexity.

The Support Ticket Avalanche

Within two weeks of Nike going live, our GitHub issues started filling up with performance complaints. Not just from Nike, but from other clients who were pushing Twill in ways we hadn't anticipated.

"Block editor becomes unusable after ~50 blocks" (Issue #847)

"Drag and drop performance degrades with large datasets" (Issue #902)

"Browser crashes when trying to reorder complex page layouts" (Issue #1156)

Each ticket was polite but pointed. These weren't edge cases—they were fundamental limitations that people were hitting as soon as they tried to use Twill for anything more complex than our demo scenarios.

The support load was crushing. Every performance complaint required investigation. Every browser crash needed debugging. Every frustrated user deserved a thoughtful response, even when that response was essentially "yes, we know about this problem, and we're working on it."

Meanwhile, our engineering team was spending more time diagnosing performance issues than building new features. Every sprint planning session included some variation of "we can't tackle the permissions system until we fix the block editor performance."

The Real Problem Wasn't Technical

The performance issues were real, but they weren't the core problem. The core problem was that we'd optimized for the demo, not for the workflow.

During our demos, we showed people how easy it was to build a page from scratch using blocks. Drag in a hero section, add some text, insert an image, done. Clean, linear, satisfying.

But that's not how content teams actually work. They don't start with a blank page and thoughtfully compose a narrative. They start with a mountain of content—product specs, marketing copy, legal disclaimers, social media assets, video files, testimonials, competitive comparisons—and they need to organize all of that into something coherent.

Our block editor was designed for creation, but they needed it for organization. We'd built a tool for storytellers, but they needed a tool for librarians.

The Humbling Realization

The moment it all clicked was during a screen-sharing session with Nike's content team. I watched their content manager work, and realized she wasn't building pages—she was managing information architecture in real-time.

She'd start by dumping everything onto the page: every piece of content, every asset, every component they might need. Then she'd spend hours organizing, grouping, reordering, and refining. She was using our block editor like a giant digital whiteboard, not like a publishing tool.

"I know it looks messy," she said, scrolling through their 200-block page, "but this is how we think through campaign structure. We need to see everything before we can figure out what goes where."

That's when I understood why our performance optimization wasn't just a technical problem—it was a product design problem. We'd built an interface that worked great for 12 thoughtfully chosen blocks, but fell apart when someone needed to manage 200 blocks of raw material.

What We Actually Built Next

The fix wasn't just about performance (though we did implement virtual scrolling, lazy loading, and smarter DOM management). The bigger change was rethinking how the block editor worked for real content workflows.

We added:

Section grouping - Users could collapse groups of blocks into manageable chunks, so they weren't staring at 200 individual items all at once.

Bulk operations - Select multiple blocks and move them together, instead of dragging them one by one.

Content preview modes - Switch between "editing mode" (optimized for management) and "preview mode" (closer to the final layout).

Performance budgets - Warning alerts when pages were getting too complex, with suggestions for optimization.

Template sections - Pre-built block combinations that teams could reuse instead of rebuilding common patterns from scratch.

The Lesson That Stuck

The biggest lesson wasn't about performance optimization or even user experience design. It was about the gap between demo scenarios and real-world usage patterns.

When you're building a product, especially one that other people will adopt and rely on, your demo is your hypothesis about how the product will be used. But it's just a hypothesis. Real users will always find ways to push your tool beyond what you imagined, and they'll do it not because they're trying to break things, but because they're trying to solve real problems.

Nike's 200-block pages weren't a bug—they were a feature request we hadn't recognized. They were showing us what the block editor needed to become, not what was wrong with it.

These days, when we demo Twill's block editor, we don't just show the clean, perfect use case. We also show what it looks like to manage complex, messy, real-world content. Because that's when the tool really proves its value—not when everything is perfect, but when everything is complicated and you still need to get work done.

The block editor performance issues eventually got resolved. But more importantly, we learned to design for the chaos of real content workflows, not just for the elegance of perfect demos. Because perfect demos don't ship products—real users with messy problems do.

Tom Rattigan 9/15/23 Tom Rattigan 9/15/23

The Slack Message That Changed How I Think About Technical Debt

"The drag-and-drop is getting really slow with large datasets. We need to talk about this."

The message came from one of our engineers on a Tuesday afternoon, and I knew exactly what he was referring to. Six months earlier, when we were pushing to get Twill's block editor ready for a major client demo, we'd implemented a drag-and-drop interface that worked beautifully—for the modest datasets we were testing with.

"How slow are we talking?" I replied.

"It's fine with 10-15 items, but some of our clients have 200+ blocks. The interface basically locks up."

I stared at that message, remembering the conversations we'd had during development. "We'll optimize this later," I'd said. "The client demo is next week."

The Demo-Driven Decision

Context matters here. I was wearing two hats—Group Production Director managing client projects and Product Manager for Twill. We had Nike coming in for a demo of their editorial workflow, and the drag-and-drop functionality was a key selling point. The "proper" solution would have involved virtualization, pagination, or a complete rethink of how we handled large datasets.

We had four days.

"Can we just add a loading spinner and maybe batch the DOM updates?" I suggested during our sprint planning. "It's not perfect, but it'll get us through the demo."

The engineer nodded. "Yeah, I can make that work. But Tom, if they start using this heavily..."

"I know," I cut him off. "We'll circle back after we close this deal."

The demo went great. Nike loved the interface. We won the project. The drag-and-drop worked flawlessly with their demo content—maybe 20 blocks total.

The Slow Creep

Here's the thing about technical debt in an open-source product: it doesn't just affect you. It affects everyone who adopts your tool.

Within two months of that Nike demo, we started getting GitHub issues. Not angry bug reports, but polite questions: "Is there a way to optimize the block editor for larger content sets?" "Performance seems to degrade with more than 50 blocks—any suggestions?"

Each issue took time to respond to. Not just to say "we're working on it," but to provide workarounds, suggest alternative approaches, or help users optimize their content structure. Meanwhile, the real fix kept getting pushed down the backlog.

Then the International Energy Agency project landed. They needed to manage hundreds of content blocks across multiple publications. Our hacky drag-and-drop solution wasn't just slow—it was unusable.

The Real Cost

"We need to rebuild the drag-and-drop from scratch," the engineer told me during our retrospective. "And it's going to take three weeks."

"Three weeks? You said it would take a few days six months ago."

"That was before we built the new media library integration, the nested block system, and the collaborative editing features. Everything hooks into the current drag-and-drop implementation. Changing it means touching all of that."

The technical debt had compounded. What started as a quick fix had become foundational to other features. Fixing it properly now meant unraveling half a dozen other systems that had been built on top of the assumption that the drag-and-drop worked the way it currently worked.

But the worst part wasn't the development time. It was the opportunity cost. During those three weeks, we couldn't ship the A/B testing features that clients were asking for. We couldn't work on the advanced permissions system that would help us win enterprise deals. We were stuck paying down debt instead of building value.

The Community Impact

This is where managing an open-source product gets complicated. Our technical debt wasn't just our problem—it was everyone's problem.

I started seeing forks of Twill on GitHub where developers had attempted their own fixes to the performance issues. Some worked, some made things worse, but all of them meant fragmentation in our community. Instead of contributing features back to the main project, people were working around our limitations.

The support load increased too. Every new user who tried to use Twill with a large dataset would eventually hit the performance wall, and they'd come to our Discord or GitHub issues for help. That was time our team could have spent on new features, redirected instead to managing the consequences of a shortcut we'd taken months earlier.

What I Actually Learned

The lesson wasn't "never take shortcuts"—sometimes you have to ship something imperfect to keep the lights on. The real lesson was about how to think about the true cost of those shortcuts, especially in an open-source context.

Community debt compounds faster than internal debt. When you're managing a product that other people rely on, your technical debt becomes their technical debt. Every workaround they have to implement, every performance issue they encounter, every GitHub issue they file—that's all compound interest on your original shortcut.

"We'll fix it later" usually means "the community will work around it." In a closed product, technical debt stays contained. In an open-source project, people start building solutions on top of your limitations. By the time you fix the underlying issue, you've also got to consider how it affects all the workarounds people have implemented.

The cost isn't just engineering time. That three-week fix cost us three weeks of new feature development, dozens of hours of community support, and probably a few potential contributors who got frustrated with the performance issues and moved on to other tools.

How I Approach It Now

These days, when we're considering a shortcut in Twill, I ask different questions:

How many users will hit the limits of this approach?
What happens when they do hit those limits?
How will this affect our community support load?
If someone contributes a fix, how hard will it be to integrate?

I also block time in every release cycle specifically for debt paydown. Not "if we have time" work, but scheduled, prioritized maintenance. Because in an open-source project, technical debt isn't just your problem—it's everyone's problem.

The drag-and-drop eventually got rebuilt. It's now one of Twill's strongest features, handling thousands of blocks without breaking a sweat. But getting there cost us way more than those original few days of proper implementation would have.

The real kicker? Nike, the client that drove the original timeline pressure, ended up using a completely different content structure that barely used the drag-and-drop functionality at all.

Sometimes the demo that seems critical in the moment turns out to be just another Tuesday. But the technical debt you create trying to nail that demo? That sticks around for months.

Tom Rattigan 8/20/23 Tom Rattigan 8/20/23

That Time I Shipped a Feature Nobody Actually Wanted

I was so proud of it. Three months of development, countless stakeholder meetings, and what I thought was rock-solid user research backing a feature that was going to be a game-changer. The rollout went smoothly, no bugs, clean implementation—everything a PM dreams of.

Then I checked the usage analytics two weeks later.

8% adoption. Eight percent. Of our most active users, only 8% had even tried the feature. Of those who tried it, most used it once and never came back. I stared at my dashboard feeling like I'd just been punched in the gut.

The Feature That Wasn't

Without getting too specific about the product, let's call it a "smart scheduling assistant." The idea came from what seemed like a perfect storm of validation. Our customer success team kept mentioning scheduling conflicts in user feedback. Sales was getting questions about calendar integration. I'd personally experienced the pain point myself—trying to coordinate meetings across time zones while juggling different calendar systems.

The user interviews seemed to confirm it. When I asked about scheduling challenges, people lit up. "Oh my god, yes, it's such a nightmare," they'd say. "I spend so much time going back and forth on email." Classic pain point validation, right?

I put together user stories, wireframes, the whole nine yards. Leadership loved it. Engineering was excited about the technical challenge. We even had a few beta users who seemed enthusiastic during early demos.

But here's the thing about customer discovery—there's a massive difference between agreeing that something is a problem and actually changing your behavior to solve it.

Where I Went Wrong

Looking back, my mistakes were painfully obvious, but they felt so subtle in the moment:

I asked leading questions. "How much time do you spend coordinating schedules?" practically begs for a complaint. Of course people are going to tell you it's a pain point when you frame it that way. What I should have asked was broader: "Walk me through how you typically set up a meeting with someone external."

I confused vocal frustration with actual priority. People complain about scheduling the same way they complain about traffic or long grocery store lines—it's universally annoying, but that doesn't mean they're actively seeking solutions. The real question isn't "Is this annoying?" but "Is this annoying enough that you'll change how you work?"

I fell in love with my own problem. Because scheduling was a genuine pain point for me personally, I assumed others shared my level of frustration. I was projecting my own needs onto the user base instead of staying objective about what they actually needed.

I didn't test behavior, just opinions. Every piece of validation I collected was hypothetical. "Would you use this?" "Does this seem helpful?" I never asked anyone to actually change their current workflow, even temporarily, to test whether they'd really adopt a new solution.

The Uncomfortable Conversations

The worst part wasn't the low adoption numbers—it was the conversations afterward.

"I thought you said users were asking for this," my engineering lead said during our retrospective. He wasn't being accusatory, just genuinely confused. And he was right. I had said that. I'd presented the feature as user-driven when really, it was assumption-driven.

The customer success team was diplomatic but puzzled. "We're still getting the same scheduling complaints," one of them mentioned. "It's like the feature doesn't exist for most users."

That stung because it was true. We'd built something adjacent to the real problem. Users were complaining about scheduling, but what they really meant was they were frustrated with specific people who were bad at responding to emails, or with clients who kept changing meeting times. Our "smart scheduling assistant" didn't solve those human problems—it just added another step to their workflow.

What I Should Have Done

The fix wasn't better user research—it was different user research. Here's what I learned to do instead:

Test micro-behaviors before building macro-solutions. Instead of asking "Would you use a scheduling assistant?", I should have said, "Here's a simple scheduling link tool—can you try using it for your next three external meetings and tell me what happens?"

Follow the breadcrumb trail. When someone says scheduling is a pain point, the next question shouldn't be "What would solve this?" It should be "Show me the last time this happened." Then you dig into the specifics of that exact situation.

Look for people already hacking solutions. The users who would actually adopt a scheduling feature are probably already using workarounds—shared calendars, booking links, assistant coordination. Find those people first.

Separate nice-to-have problems from painful problems. A painful problem is one people are already spending time or money trying to solve, even imperfectly. A nice-to-have problem is one they complain about but haven't tried to fix.

The Real Lesson

The hardest part of this whole experience wasn't admitting I'd misread the market—it was accepting that good intentions and solid processes can still lead to wasted effort. I'd followed all the "right" steps: user interviews, stakeholder alignment, iterative development, data-driven decisions. But I'd optimized for validation instead of truth.

The feature didn't get killed immediately. We spent another month trying to boost adoption with better onboarding, email campaigns, and UI improvements. Nothing moved the needle significantly. Eventually, it became one of those features that exists in the product but that nobody talks about—a monument to good intentions and poor discovery.

These days, I'm much more paranoid about the difference between what people say they want and what they'll actually use. I push harder on specifics during user interviews. I look for evidence of people already trying to solve a problem, not just evidence that a problem exists.

And I've gotten comfortable with that uncomfortable moment in user interviews when someone says, "Actually, now that I think about it, this isn't really that big of an issue for me." That's not a failed interview—that's discovery working exactly as it should.

The feature nobody wanted taught me that customer discovery isn't about confirming your ideas—it's about killing the bad ones before they become someone else's problem to maintain. Sometimes the most valuable thing you can discover is that you shouldn't build something at all.

Tom Rattigan 6/6/23 Tom Rattigan 6/6/23

When Twill's 'Modular Architecture' Became a House of Cards

The Lego Block Dream

Twill's modular architecture was exactly what we'd been waiting for. Instead of building monolithic, inflexible CMSs, we could create elegant, composable systems where each module handled one thing well.

Our client was launching an ambitious digital platform—part e-commerce store, part editorial publication, part community forum, part event management system. Perfect for showcasing Twill's modular approach.

"We'll build this like a modern application," I explained to the team. "Clean separation of concerns, reusable components, pluggable modules that can be mixed and matched as needed."

The module list was impressive:

Articles Module: Editorial content management
Products Module: E-commerce catalog and inventory
Events Module: Calendar and ticket management
Users Module: Community profiles and authentication
Media Module: Asset management and galleries
Reviews Module: User-generated content and ratings
Newsletter Module: Email campaigns and subscriber management
Analytics Module: Custom reporting and metrics
Comments Module: Threaded discussions
Tags Module: Cross-content taxonomy
Search Module: Full-text search across all content types
Notifications Module: Real-time alerts and messaging

Each module was beautifully self-contained. Clean APIs, well-defined boundaries, minimal coupling. During development, we could work on different modules simultaneously without stepping on each other's toes.

"This is how CMSs should be built," I told the client during our architecture presentation. "Modular, scalable, maintainable. You want to add a new content type? Just plug in another module. Need to remove functionality? Unplug the module. It's that simple."

The demo was flawless. We enabled and disabled modules in real-time, showing how the system gracefully adapted to different configurations.

The First Cracks

Two weeks after launch, the client requested what seemed like a simple change: "Can we show related articles at the bottom of product pages?"

Easy enough. The Products Module would just query the Articles Module for content with matching tags. I added the integration and deployed.

The feature worked perfectly—until someone deleted a tag that was being referenced by both articles and products. The product page threw a 500 error because the Articles Module's tag relationship was broken, but the Products Module had no way to know that.

"Quick fix," I thought, and added some error handling. Products would gracefully handle missing tags from the Articles Module.

But then the client wanted product reviews to show up in the site-wide search results. So the Search Module needed to understand the Reviews Module's data structure. And reviews needed to link to user profiles, so the Reviews Module needed to integrate with the Users Module.

Each integration felt reasonable in isolation. But the dependency web was growing.

The Integration Cascade

Within six months, our beautifully modular system had become a tangled mess of interdependencies:

The Articles Module needed:

Tags Module (for categorization)
Users Module (for author attribution)
Media Module (for featured images)
Comments Module (for reader engagement)
Analytics Module (for view tracking)
Newsletter Module (for article promotion)
Search Module (for content discoverability)

The Products Module needed:

Tags Module (for product categorization)
Media Module (for product images)
Reviews Module (for customer feedback)
Users Module (for purchase attribution)
Analytics Module (for sales tracking)
Search Module (for product discovery)
Events Module (for product launches)

The Events Module needed:

Users Module (for attendee management)
Products Module (for ticket sales)
Media Module (for event photos)
Newsletter Module (for event promotion)
Analytics Module (for attendance tracking)
Notifications Module (for event reminders)

Every module was connected to every other module. We'd built a distributed monolith disguised as a modular system.

The Cascade Failure

The house of cards collapsed during a routine maintenance update.

I was upgrading the Tags Module to add hierarchical taxonomy support. The change seemed contained—just extending the tag data model and adding some new API endpoints. I tested the Tags Module in isolation, and everything worked perfectly.

But when I deployed to production, the entire site went down.

The cascade failure was spectacular:

Tags Module upgrade changed the tag data structure
Articles Module couldn't parse the new tag format and started throwing exceptions
Products Module couldn't load because it shared tag relationships with Articles
Search Module crashed because it was indexing both articles and products
Reviews Module failed because it linked to products that couldn't load
Users Module errored because it displayed user reviews that were now broken
Analytics Module couldn't track anything because all the tracked content was failing
Notifications Module started sending error alerts for every failed page load

A single module update had brought down the entire platform.

The Debugging Nightmare

The worst part wasn't the outage—it was trying to understand what had broken and why.

In a traditional monolithic application, you can trace through the code and see exactly how components interact. But in our modular system, the interactions were spread across multiple modules, each with their own APIs, data models, and integration points.

The Mystery of the Phantom Dependencies: We discovered modules were depending on other modules in ways that weren't documented anywhere. The Reviews Module had somehow become dependent on the Newsletter Module because someone had added a "subscribe to review notifications" feature six months earlier.

The API Version Hell: Different modules were calling different versions of each other's APIs. The Search Module was still using the old Tags API format, while the Articles Module had upgraded to the new format. There was no central API versioning strategy.

The Database Schema Maze: Each module managed its own database tables, but they were all interconnected through foreign keys and shared data structures. A schema change in one module could break three others in subtle ways that didn't surface until specific combinations of data were accessed.

The Configuration Nightmare: Module settings were scattered across multiple configuration files, environment variables, and database settings. Understanding how the system was configured required knowledge of all modules simultaneously.

The Performance Domino Effect

Once we got the system stable again, we discovered a new problem: performance had degraded catastrophically.

Our modular architecture meant that loading a single page often required API calls across multiple modules:

Article page: Query Articles Module, then Tags Module for categories, then Users Module for author info, then Media Module for images, then Comments Module for discussion, then Analytics Module to log the view
Product page: Query Products Module, then Reviews Module for ratings, then Users Module for reviewer info, then Events Module for launch dates, then Media Module for product gallery, then Tags Module for categories

What should have been single database queries had become complex, multi-step API orchestrations. A simple product page was making 15+ internal API calls across 8 different modules.

We'd optimized each module individually, but we'd never optimized the system as a whole.

The Maintenance Multiplication

The modular approach had promised easier maintenance. Instead, it created maintenance multiplication.

Every security update needed to be tested across all module combinations. A bug fix in one module required regression testing of every module that integrated with it. Deploying a single feature often meant coordinating changes across multiple modules.

The Update Matrix Problem: We had 12 modules, each with their own release schedule. Testing every combination of module versions was mathematically impossible. We never knew which combinations were safe to deploy together.

The Knowledge Silos: Different team members had become experts in different modules, but nobody understood the whole system anymore. Debugging cross-module issues required assembling a committee of specialists.

The Documentation Explosion: Each module had its own documentation, API specs, and integration guides. New developers needed to understand not just individual modules, but dozens of integration patterns between them.

The Simplification Solution

We rebuilt the system around domain boundaries instead of technical boundaries:

Content Domain: Articles, media, tags, and search unified into a single module
Commerce Domain: Products, reviews, and analytics combined into one cohesive system
Community Domain: Users, comments, and notifications integrated together
Events Domain: Calendar, tickets, and promotions as a single unit

Instead of 12 loosely coupled modules, we had 4 tightly cohesive domains. Each domain could still be developed and deployed independently, but the internal complexity was contained.

The result:

90% fewer API calls between system components
Single database transactions instead of distributed operations
Domain experts instead of module specialists
Integrated testing within each domain boundary
Coherent documentation organized around business functions

What We Actually Learned

1. Modularity Isn't Free

Every module boundary creates integration complexity. The cost of coordination often exceeds the benefits of separation.

2. Domain Boundaries Beat Technical Boundaries

Modules should be organized around business domains, not technical capabilities. "Users who buy products and write reviews" is one domain, not three separate modules.

3. Dependencies Are Technical Debt

Every inter-module dependency is a maintenance burden. Minimizing dependencies is more important than maximizing modularity.

4. Integration Complexity Is Exponential

The complexity of a modular system grows exponentially with the number of modules. 12 modules aren't just twice as complex as 6 modules—they're exponentially more complex.

The Real Architecture

The best modular architecture isn't the one with the most modules. It's the one with the fewest necessary boundaries, placed in the right locations.

Twill's modular capabilities are genuinely powerful, but power requires restraint. The question isn't "can we separate this into its own module?" but "should we?"

Sometimes the most elegant architecture is the one that keeps related things together, even if they could technically be separated.

Have you built modular systems that became too modular for their own good? Share your stories of beautiful architectures that collapsed under the weight of their own complexity.

Made with Twill | twillcms.com