Bifurcate the Problem Space

I recently read Hillel Wayne's newsletter issue on debugging. In it, Hillel's advises the reader to "ask broad questions" in order to improve your debugging skills. As I was reading, it occurred to me that this was a technique I'd seen advocated before in Stuart Halloway's Debugging with the Scientific Method. Halloway calls it "carving the world in half" or "proportional reduction." I watched his talk when it was released in 2015 and somehow internalized the phrase, bifurcate the problem space.^†

Bifurcating the problem space means running a test which will rule in or rule out a large number of possible root causes. When playing 20 questions, instead of immediately guessing "Kevin Bacon," you start by asking, "Is it a man or a woman?" Doing so cuts down on half of all possible answers. Likewise, when solving a problem, you want to run tests that cut down a large number of possible root causes.

Reading Wayne's newsletter reminded me that this technique is both unreasonably effective and yet, somehow, not terribly widespread. It is, admittedly, counterintuitive. Instead of asking the natural question, "What is the answer?" this approach pushes you to ask some version of, "What might not be the answer?"

That said, the cost-to-benefit ratio for adopting this strategy is outrageously good compared to many strategies which, for whatever reason, did manage to catch on. Asking broad questions usually costs some number of minutes, and the benefit is you solve your problem in minutes or hours instead of weeks or never.

As I was reading Wayne's post, I began to think that others might find it helpful to see an example of how to use this approach in real life. I can think of a few such examples, but one incident in particular stands out in my memory.

The Case of the Missing Jobs

Some time ago, I was working at a company that had the following backend architecture:

A Web Server receives a request
The Web Server calls to a Job Runner to start a job
The Web Server stores the Job Runner ID to the Database
The Job Runner finishes and calls to the Web Server to indicate the job is done
The Web Server stores the completed job details to the database

+------------+               +------------+
|            |               |            |
| Job Runner | <------------ | Web Server | <----- [Client Request]
|     A      |               |            |
+------------+               +------------+
                                   |
                             [Running on A]
                                   |
                                   v
                               +-------+
                               |  DB   |
                               +-------+

later....
+------------+               +------------+
|            |               |            |
| Job Runner | ------------> | Web Server |
|     A      |               |            |
+------------+               +------------+
                                   |
                               [Job Done]
                                   |
                                   v
                               +-------+
                               |  DB   |
                               +-------+

Of course, as is the case with all architecture diagrams, the above picture is massively simplified. In reality, there were multiple web servers sitting behind load-balancing proxies, multiple databases of varying types, many hundreds of job runners spread across multiple datacenters (also sitting behind load-balancing proxies), and a few queues tossed in for good measure.

As one might guess, occasionally Job Runner A would reject a job. When that happened, the job was supposed to be sent to another job runner, Job Runner B.

+------------+               +------------+
| ¡¡OUTAGE!! |               |            |
| Job Runner | <------------ | Web Server | <----- [Client Request]
|     A      | ---[Error]--> |            |
+------------+               |            |
                     +------ |            |
                     |       +------------+
+------------+       |             |
|            |       |       [Running on B]
| Job Runner | <-----+             |
|     B      |                     v
+------------+                 +-------+
                               |  DB   |
                               +-------+

One day we deployed a new version of the Job Runner. In that deploy, there was a misconfiguration of Job Runner A which caused every job to be rejected. However, instead of falling back to Job Runner B as it was supposed to, the Web Server reported in the database that Job Runner A accepted the job.

+------------+               +------------+
| ¡¡OUTAGE!! |               |            |
| Job Runner | <------------ | Web Server | <----- [Client Request]
|     A      | ---[?????]--> |            |
+------------+               |            |
                             +------------+
+------------+                     |
|            |               [Running on A]
| Job Runner |                     |
|     B      |                     v
+------------+                 +-------+
                               |  DB   |
                               +-------+

This was not good. The end result was that jobs would appear to clients as if they were running forever.

Part One: The System Architecture

So the problem could be in a couple of places:

The Web Server logic
The Job Runner return value

Now the question is: What do you do?

You could assume the problem is in the Web Server and start debugging the Web Server code. Alternatively, you could assume the problem was in the Job Runner and start debugging that code.

However, if you were trying to bifurcate the problem space, you would look for a way to disqualify either the Web Server or the Job Runner. Doing so isolates the problem to one codebase or the other.

In our case, we decided to test the Job Runner by removing the Web Server as a variable. We replicated the exact calls made to the Job Runner from the command line. If we could get the Job Runner to erroneously return "Success" instead of "Error," we would know the problem is in the Job Runner. Otherwise, we have very strong evidence (but not proof) that the problem is in the Web Server.

After half a dozen tries, we were unable to get the Job Runner to give us an incorrect "Success" status. Awesome! The problem is likely in the Web Server. We just bifurcated a problem space that included two machines to a problem space that only includes one!

Part Two: The Code Architecture

Assuming we didn't completely flub the DB write (possible, but unlikely), the problem was somewhere in the code path that handles job submission. Unsurprisingly, there were multiple layers of code where the error could be occurring. A simplified version of the code looks something like:

function callJobRunner(runner, request) {
  // prep the request
  return http.request(runner, request);
}


function withCircuitBreaker(circuitBreaker, runner, request) {
  // check circuit breaker
  return callJobRunner(request)
}


function withRetry(retryOpts, circuitBreaker, runner, request) {
  return withCircuitBreaker(circuitBreaker, request);
  // retry if failed
}


function retryWithOtherRunner(runner, request) {
  withRetry(retryOpts, circuitBreaker, runner, request);
  response.setJobRunner(runner);
}


function checkIfActuallyReceived(runner, request) {
  return !listJobs(runner, request.jobName).isEmpty;
}


function submit(runner, request) {
  response = withRetry(retryOpts, circuitBreaker, runner, request)
  if (isFailed(response)) {
    if (retryCheckFeatureFlag) {
      if (checkIfActuallyReceived(runner, request)) {
        return response;
      } else {
        return retryWithOtherRunner("Job Runner B");
      }
    } else {
    return retryWithOtherRunner("Job Runner B");
  }
  return response;
}


function handleRequest(request) {
  jobRunner = "Job Runner A";

  response = submit(jobRunner, request);

  jobRunner = response.getJobRunner();

  db.save(request, jobRunner);
}

Except each function was in its own file, and each function was itself broken up into smaller bits which were in their own files, and what was being passed around wasn't a request plus a job_runner string, it was a bunch of different objects holding a bunch of information.^††

So the problem could be in any of the following spots:

handleRequest
submit
retryWithOtherRunner
checkIfActuallyReceived
withRetry
withCircuitBreaker
callJobRunner

Once again, our question is: What do you do?

You could assume the problem is in the http request and start debugging that. Or you could assume it's in the retry code, and start there.

Or you could try and bifurcate the code architecture.

We decided to cut out the circuit breaker and retry logic. We commented out those functions, made them no-ops, and re-ran a job. It stored the wrong value to the database.

Nice! Now we know the problem is in one of:

handleRequest
submit
retryWithOtherRunner
checkIfActuallyReceived
callJobRunner

The next part is tricky. We could not easily comment out the remainder and maintain enough functionality to have a viable test. Therefore, we started debugging each function individually, starting with handleRequest. However, we did not set out to prove handleRequest was the culprit. Rather, we set out to prove it was not the culprit. Our goal was to isolate the problem, not to solve the problem.

Instead of carefully reading the code in handleRequest, adding print statements and the like, we simply invoked handleRequest directly to see if we could get it to erroneously report a success. After a few tries, it became clear that it was working properly, so we moved on to the submit call. Invoking it directly, we found that it did, in fact, return a response with the wrong jobRunner.

Bingo!

Part Three: A Function

We were down to three possible spots:

submit
retryWithOtherRunner
checkIfActuallyReceived

Once again, we ask: What do you do?

By now, you know then answer is: Bifurcate the function!

Luckily, there was a built-in way to split this code. Toggling retryCheckFeatureFlag shuts off half of the conditionals in the function, thereby isolating the other half. Doing that, we found we were unable to replicate the issue, so the problem wasn't in retryWithOtherRunner. Now we know for certain that the problem is somewhere in this code:

function checkIfActuallyReceived(runner, request) {
  return listJobs(runner, request.jobName)[0] == null;
}


if (checkIfActuallyReceived(runner, request)) {
  return response;
} else {
  return retryWithOtherRunner("Job Runner B");
}

Running checkIfActuallyReceived in isolation showed that it did in fact return true even when the Job Runner failed.^††† We did it! Mission accomplished.

Hopefully this was a helpful illustration of how to put Bifurcating the Problem Space into practice. It is a true story. The details are anonymized, but I had a fair amount of notes lying around, so it's a pretty accurate portrayal of what happened.

As you can see, this is a recursive practice. You apply it at the top to cut as much of the problem space as possible. Then you apply it at the next level down, then the next level down, until you get to a single branch in the code.

I also want to highlight a couple of common bifurcating techniques showcased in this story:

Run things directly. Directly call APIs or functions. Doing so isolates those routines and excludes the surrounding code.
Comment things out. It's not always possible, but if you're able to comment out code blocks, you'll quickly rule those chunks of code in or out.

Lastly, I will note, while it is generally useful, bifurcating the problem space really shines when you've exhausted the obvious leads. It turned out that the bug in checkIfActuallyReceived had been living undetected in production for several months. Had it been a recent change, I suspect we would have been able to guess the cause much more quickly.

_{† If you'd asked me where I learned that phrase before I wrote this article, I would have told you it was from Debugging with the Scientific Method. I searched through the talk transcript, and I can find no mention of "bifurcate" whatsoever. I suppose I either made it up or unknowingly picked it up from someone else.}

_{^†† Don't bother reading too much into that code. It looks very little like the actual code anyways. The point here is to show how convoluted a relatively straightforward interaction can become.}

_{^††† The actual problem ended up being something to do with the code expecting a list of null when it actually got a list of empty lists in return. Plus one for the type checkers.}

A big thank you to Max Shenfield and Chris Sims for their inspiration to start writing again and for their feedback on this article.