Photo by Elisa Ventur on Unsplash

While a short post, I hope this will help others as it did me. These questions came out of an experience trying to troubleshoot a deployment issue I was called into some time ago. The problem got solved, but the approach to identifying the problem wasn't . . . ideal. Among the challenges were pressure from above and the group dynamics of a thirty-person Zoom meeting.

Afterward, I performed a personal post-mortem to identify heuristics that would help with this general class of problems, the "It's not working and we don't know why but it needs to be fixed right now we're watching you" situation.

Actually, these are good questions for any debugging/troubleshooting, not just the under-pressure variety.

Questions

  1. What's the symptom? Get clear on this from the start and respectfully question what you're first told. "The prices are all wrong!" "Would you show me exactly what a where you're seeing that?" "Sure. See? All (and only) the shirt prices are multiplied by ten on this screen."
  2. What changed? In my situation, the exact same code was deployed to a different environment. The code worked in the other three environments. So, the problem wasn't the code. We needed to review the environment, not dive into the debugger. (It was a wrong deployment environment variable.)
  3. What's the dumbest thing it could be? Did something get misspelled? Did you push the wrong version?
  4. Working backward, what controls that behavior? It's so easy to assume you know what "must be" causing the symptom. Stop. Don't think "must be," but instead "might be" and especially "shouldn't be but let's check." What class method displays that text? What service calls that class?

Other Principles

  • Ask, "What's the real problem?" For example, the symptom might be "we're getting a 404 on this page." But the real problem might be, "Our main customer is blocked from completing a critical report." Maybe the customer problem can be addressed without immediately fixing the symptom.
  • Don't rush. Pressure is the enemy, and often an illusion. Solving the problem fast isn't as important as solving it right. Wanting to be the hero and feeling the pressure of being the expert inevitably causes delays and failure. David Marquet calls this "control the clock, don't obey the clock."
  • Change one thing at a time. This is a hard-won skill. Change one thing, and confirm other variables are "known good." Decades ago, I called my boss for help with a non-working printer. He asked, "Are you using a known-good cable?" "Uh, no." "You have to make sure the cable works!" He was right.
  • Always know what you changed. If you must change multiple variables, keep track.
  • Fix the problem, not the blame. Now isn't the time for a retrospective, or questioning why the code is written the way it is. That just derails the conversation and is evidence of a pathological organization. But do have a post-mortem when heads are clear.

Of all these, "what's the symptom?", "what changed?", and "don't rush" will take you very far.

"It's important to think when things are going crazy, if you want to take the smartest action to get them sane again." --Harry Dresden, Battle Ground