New grad vs senior dev

Posted on March 27, 2020 by ericlippert

A student who I used to tutor in CS occasionally sent me a meme yesterday which showed “NEW GRAD vs SENIOR DEVELOPER”; the new grad is all caps yelling

NO! YOU CAN’T JUST USE BRUTE FORCE HERE! WE NEED TO USE SEGMENT TREES TO GET UPDATE TIME COMPLEXITY DOWN TO O(LOG N)! BREAK THE DATA INTO CHUNKS AT LEAST! OH THE INEFFICIENCY!!!

and the senior developer responds

Ha ha, nested for loops go brrrrrrrrrrrr…

OK, that’s silly and juvenile, but… oh no, I feel a flashback coming on.

…

It is 1994 and I am a second-year CS student at my first internship at Microsoft on the Visual Basic compiler team, reading the source code for InStr for the first time. InStr is the function in Visual Basic that takes two strings, call them source and query, and tells you the index at which query first appears as a substring of source, and the implementation is naive-brute-force.

I am shocked to learn this! Shocked, I tell you!

Let me digress slightly here and say what the naive brute force algorithm is for this problem.

Aside: To keep it simple we’ll ignore all the difficulties inherent in this problem entailed by the fact that VB was the first Microsoft product where one version worked everywhere in the world on every version of Windows no matter how Windows was localized; systems that used Chinese DBCS character encodings ran the same VB binary as systems that used European code pages, and we had to support all these encodings plus Unicode UTF-16. As you might imagine, the string code was a bit of a mess. (And cleaning it up in VBScript was one of my first jobs as an FTE in 1996!)

Today for simplicity we’ll just assume we have a flat, zero-terminated array of chars, one character per char as was originally intended.

The extremely naive algorithm for finding a string in another goes something like this pseudo-C algorithm:

bool starts(char *source, char *query)
{
  int i = 0;
  while (query[i] != '\0')
  {
    if (source[i] != query[i])
      return false;
    i = i + 1;
  }
  return true;
}
int find(char *source, char *query)
{
  int i = 0;
  while (source[i] != '\0')
  {
    if (starts(source + i, query))
      return i;
    i = i + 1;
  }
  return -1;  
}

The attentive reader will note that this is the aforementioned nested for loop; I’ve just extracted the nested loop into its own helper method. The extremely attentive reader will have already noticed that I wrote a few bugs into the algorithm above; what are they?

Of course there are many nano-optimizations one can perform on this algorithm if you know a few C tips and tricks; again, we’ll ignore those. It’s the algorithmic complexity I’m interested in here.

The action of the algorithm is straightforward. If we want to know if query “banana” is inside source “apple banana orange” then we ask:

does “apple banana orange” start with “banana”? No.
does “pple banana orange” start with “banana”? No.
does “ple banana orange” start with “banana”? No.
…
does “banana orange” start with “banana”? Yes! We’re done.

It might not be clear why the naive algorithm is bad. The key is to think about what the worst case is. The worst case would have to be one where there is no match, because that means we have to check the most possible substrings. Of the no-match cases, what are the worst ones? The ones where starts does the most work to return false. For example, suppose source is “aaaaaaaaaaaaaaaaaaaa” — twenty characters — and query is “aaaab”. What does the naive algorithm do?

Does “aaaaaaaaaaaaaaaaaaaa” start with “aaaab”? No, but it takes five comparisons to determine that.
Does “aaaaaaaaaaaaaaaaaaa” start with “aaaab”? No, but it takes five comparisons to determine that.
… and so on.

In the majority of attempts it takes us the maximum number of comparisons to determine that the source substring does not start with the query. The naive algorithm’s worst case is O(n*m) where n is the length of source and m is the length of the query.

There are a lot of obvious ways to make minor improvements to the extremely naive version above, and in fact the implementation in VB was slightly better. The implementation in VB was basically this:

char* skipto(char *source, char c)
{
  char *result = source;
  while (*result != '\0' && *result != c)
    result = result + 1;
  return result;
}
int find(char *source, char *query)
{
  char *current = skipto(source, query[0]);
  while (*current != '\0;)
  {
    if (starts(current, query))
      return current - source;
    current = skipto(current + 1, query[0]);
  }
  return -1;
}

(WOW, EVEN MORE BUGS! Can you spot them? It’s maybe easier this time.)

This is more complicated but not actually better algorithmically; all we’ve done is moved the initial check in starts that checks for equality of the first letters into its own helper method. In fact, what the heck, this code looks worse. It does more work and is more complicated. What’s going on here? We’ll come back to this.

As I said, I was a second year CS student and (no surprise) a bit of a keener; I had read ahead and knew that there were string finding algorithms that are considerably better than O(n*m). The basic strategy of these better algorithms is to do some preprocessing of the strings to look for interesting features that allow you to “skip over” regions of the source string that you know cannot possibly contain the query string.

This is a heavily-studied problem because, first off, obviously it is a “foundational” problem; finding substrings is useful in many other algorithms, and second, because we genuinely do have extremely difficult problems to solve in this space. “Find this DNA fragment inside this genome”, for example, involves strings that may be billions of characters long with lots of partial matches.

I’m not going to go into the various different algorithms that are available to solve this problem and their many pros and cons; you can read about them on Wikipedia if you’re interested.

Anyways, where was I, oh yes, CS student summer intern vs Senior Developer.

I read this code and was outraged that it was not the most asymptotically efficient possible code, so I got a meeting with Tim Paterson, who had written much of the string library and had the office next to me.

Let me repeat that for those youngsters in the audience here, TIM FREAKIN’ PATERSON. Tim “QDOS” Paterson, who one fine day wrote an operating system, sold it to BillG, and that became MS-DOS, the most popular operating system in the world. As I’ve mentioned before, Tim was very intimidating to young me and did not always suffer foolish questions gladly, but it turned out that in this case he was very patient with all-caps THIS IS INEFFICIENT Eric. More patient than I likely deserved.

As Tim explained to me, first off, the reason why VB does this seemingly bizarre “find the first character match, then check if query is a prefix of source” logic is because the skipto method is not written in the naive fashion that I showed here. The skipto method is a single x86 machine instruction. (REPNE SCASB, maybe? My x86 machine code knowledge was never very good. It was something in the REP family at least.) It is blazingly fast. It harnesses the power of purpose-built hardware to solve the problem of “where’s that first character at?”

That explains that; it genuinely is a big perf win to let the hardware do the heavy lifting here. But what about the asymptotic problem? Well, as Tim patiently explained to me, guess what? Most VB developers are NOT asking if “aaaab” can be found in “aaaaaaa…”. The vast majority of VB developers are asking is “London” anywhere in this address, or similar problems where the strings are normal human-language strings without a lot of repetitions, and both the source and query strings are short. Like, very short. Less than 100 characters short. Fits into a cache line short.

Think about it this way; most source strings that VB developers are searching have any given character in them maybe 2% of the time, and so for whatever the start character is of the query string, the skipto step is going to find those 2% of possible matches very quickly. And then the starts step is the vast majority of the time going to very quickly identify false matches. In practice the naive brute force algorithm is almost always O(n + m).

Moreover, Tim explained to me, any solution that involves allocating a table, preprocessing strings, and so on, is going to take longer to do all that stuff than the blazingly-fast-99.9999%-of-the-time brute force algorithm takes to just give you the answer. The additional complexity is simply not worth it in scenarios that are relevant to VB developers. VB developers are developing line-of-business solutions, and their line of business is not typically genomics; if it is, they have special-purpose libraries for those problems; they’re not using InStr.

…

And we’re back in 2020. I hope you enjoyed that trip down memory lane.

It turns out that yes, fresh grads and keener interns do complain to senior developers about asymptotic efficiency, and senior developers do say “but nested for loops go brrrrrrr” — yes, they go brrrrrr extremely quickly much of the time, and senior developers know that!

And now I am the senior developer, and I try to be patient with the fresh grads as my mentors were patient with me.

UPDATE: Welcome, Hacker News readers. I always know when I’m linked from Hacker News because of the huge but short-lived spike in traffic. The original author of the meme that inspired this post has weighed in. Thanks for inspiring this trip back to a simpler time!

41 thoughts on “New grad vs senior dev”

OTAVIO DECIO on March 27, 2020 at 12:50 pm said:

Oh this brings me memories… I once wrote a simple editor in C (well it did break words in proper Portuguese) that used assembly calls to SCANB / SCANW for searching – it was for DOS and pretty naive since it loaded up the entire text in memory but that was screaming fast for those old PC-XT. Good times.

Reply ↓
vc_74 on March 27, 2020 at 12:51 pm said:

Thanks for the post.

It’s been a long time I haven’t coded in C but it seems like starts loops infinitely, no?

Reply ↓
- vc_74 on March 27, 2020 at 12:53 pm said:
  
  (unless the string is empty…)
  
  Reply ↓
- ericlippert on March 27, 2020 at 1:10 pm said:
  
  That’s one of the bugs; can you find more?
  
  Reply ↓
  - vc_74 on March 27, 2020 at 1:16 pm said:
    
    Nothing guarantees that the 2 strings have the same length -> core dump guaranteed if source is shorter than query.
    But I guess there’s more, no?
    
    Reply ↓
  - vc_74 on March 27, 2020 at 1:16 pm said:
    
    Nothing guarantees that the 2 strings have the same length -> core dump if source is shorter than query.
    But I guess there’s more, no?
    
    Reply ↓
    - ericlippert on March 27, 2020 at 1:28 pm said:
      
      Think about that one a bit more; under what circumstances can we read off the end of source if query is longer?
      
      Reply ↓
      - vc_74 on March 27, 2020 at 1:46 pm said:
        
        Do you mean if `starts` is only used by `find`?
        
        If was more looking at `starts` generally, if source = ‘a’ and query = ‘aa’ `if (source[i] != query[i])` will produce unexpected behaviour when i = 1. I think we need to check whether the end of query has been reached.
      - ericlippert on March 27, 2020 at 1:58 pm said:
        
        You might be thinking about C# here. Remember, in C, by custom all strings end in a zero char, so it is defined behaviour; source[i] will be zero and query[i] will not be, so we’ll stop the loop.
      - vc_74 on March 28, 2020 at 10:02 am said:
        
        Forget my last comment, it was late and most my brain cells were already asleep 😉
        
        Your post actually reminded me how comments can really be helpful when reading code written by others
- Alexey Badalov on March 27, 2020 at 7:45 pm said:
  
  I don’t see it. Why would it loop indefinitely? query[i] has to equal to ‘\0’ for some i >= 0. i starts at 0 and increments at every iteration.
  
  The only bug I see is handling of strings beyond int’s range.
  
  Reply ↓
  - ericlippert on March 28, 2020 at 5:52 pm said:
    
    In an earlier draft of this post I left out the iteration step as one of the bugs, but I decided that was too easy and fixed it.
    
    Your point about the size of a string being potentially bigger than int is a good one, but in the actual implementation in VB we had gear that ensured that the appropriate integer types were used.
    
    There are more fundamental problems though. What is the behaviour of the different versions of the algorithm when given empty strings, for instance?
    
    Reply ↓
    - Johnny on March 29, 2020 at 3:31 pm said:
      
      I read the earlier comments and couldn’t figure out for the life of me why it would loop forever.
      
      Reply ↓
    - Taner on May 14, 2020 at 11:30 am said:
      
      I’ll bite.
      
      The first implementation should have j as an input to starts so you’re not starting from source’s beginning every time. You also need to put a check to see if you’re at the end of sources in starts’ while loop and then check in the if-statement within that to see if you’re at the end of sources (protection against sources ending early).
      
      For the find function, i should be entered as an input in the starts function instead of being added to the source character array.
      
      Reply ↓
Nick on March 27, 2020 at 4:17 pm said:

Great post. What shines for me is the best / worst / average comparison and the interplay between that and how the code will actually be used. Mindful developers often wear “worst case” blinders and oblivious devs “best case”, but often neither of these is really where the focus should be.

I’m curious – you said:

> The skipto method is a single x86 machine instruction.

Was this actually implemented with inline assembly (or something like that), or was the compiler at the time smart enough to recognize this construct in C and optimize it to the single instruction?

Reply ↓
- ericlippert on March 27, 2020 at 4:20 pm said:
  
  Great question, and I do not recall exactly. My vague memory is that it was a library method that the optimizing compiler knew was special, and inlined it aggressively.
  
  Reply ↓
- Zhentar on March 28, 2020 at 5:03 pm said:
  
  The VC6 compiler (and others at the time) were very aggressive about optimizing for code size, since most systems didn’t have much of it. REP instructions are quite size efficient, so it detected a lot of patterns for using them.
  
  It’s rarely done today because the REP instructions can be significantly slower on modern CPUs
  
  Reply ↓
Julien on March 28, 2020 at 7:53 am said:

Thank you for this telling example.
Maybe one possible conclusion is that while programming classes often insist on theoretical algorithmic complexity (or OOP design, for that matter), making many students strongly believe about a “right way” to do things, they often neglect to mention that solutions don’t exist in a vacuum and must also take the actual use case and hardware into account. They forget to tell to profile and measure.

Reply ↓
badhonda on March 28, 2020 at 11:39 am said:

Hey Eric
I was an intern too that year in the Windows NT team. I think you used to do juggling also?

Good to see your post and name again after many moons.

Reply ↓
- ericlippert on March 28, 2020 at 11:43 am said:
  
  I did used to juggle a lot back then yes!
  
  Reply ↓
Kazinator on March 28, 2020 at 4:56 pm said:

Actual senior dev here.

Let’s see:

int find(char *source, char *query)

Firstly, since this function doesn’t clobber the input strings; those parameter should be const char *.

Secondly, the function called strstr was already standardized in ANSI C in 1989.

Nobody who writes strstr from scratch can be called a senior dev.

Reply ↓
- ericlippert on March 28, 2020 at 5:35 pm said:
  
  As I noted in the text, the actual job the real function had to do was complicated by the fact that it needed to do more work than the library function strstr. As I noted in the text, the code is intended to represent a sense of the algorithm, not to be a shining example of actual C code.
  
  I note also that someone had to implement strstr; that person did not have it available to call!
  
  Reply ↓
SirClueless on March 28, 2020 at 9:11 pm said:

Still, there are cases where worst case algorithmic time complexity does matter, no? Especially in standard library algorithms. I’m thinking of cases where a DDoS attack can be mounted against an application that uses quicksort with a deterministic pivot selection so that you can adversarially make their sort routines run in O(N^2). Or when Java had to change their string hashing algorithm because the simple thing they were doing allowed malicious attackers to create many strings with hash values that collided mod small powers of 2 to DDoS Java hash maps.

A lot of these assumptions (for example, “Most VB programmers aren’t searching for ‘aaaaaab’ in ‘aaaaaaaaaaaaaaaaaaaaaaa'”) fall over when exposed on the web or in general to untrusted inputs.

Reply ↓
- ericlippert on March 28, 2020 at 10:04 pm said:
  
  That’s a great point that I did not call out in the article because of course there were no such scenarios for VB in the 1990s. Designing algorithms that have good properties for mitigating attacks is quite tricky!
  
  Reply ↓
Pingback: === popurls.com === popular today
dominicrose91 on March 29, 2020 at 6:15 am said:

Let’s not forget that naive code is easier to implement, modify, read, reuse etc.
That being said, I think it’s important to have the worst-case scenario in mind when coding.
You don’t want a fast elevator on average if it’s going to forget some people for a long time.
In some worst-case scenarios, a progress bar could be enough.

As for string searching, using preprocessing time and space are both good ideas, while ignoring both those ideas is also interesting 🙂

Reply ↓
Pingback: The Morning Brew - Chris Alcock » The Morning Brew #2963
Luvieere on March 30, 2020 at 1:49 am said:

I’m not a C dev, but I’m thinking that in the algorithm using `skipto`, if in the `find` function, `query` is an empty string, then `query[0]` would be reading outside the string’s bounds.

Then, in the `starts` function, you’re only checking if `query` reaches its null terminator, but what happens if `source` does? Wouldn’t the if statement read outside of `source`’s bounds? If `query` is a string composed of only ‘0’ chars (‘0000\0’), couldn’t this lead to a false positive match?

Reply ↓
- ericlippert on March 30, 2020 at 10:03 am said:
  
  Reason it through — what if source reaches a zero char but query does not? The zero terminator is still in bounds; if it were not, how could you safely read it?
  
  Reply ↓
Srivathsa Harish Venkataramanan on March 30, 2020 at 5:02 am said:

All would have been well when there were comments for you to understand why the brute force was better in VB InStr , you did not have to make an appointment with HIM , no ?

Reply ↓
Claus Appel on March 30, 2020 at 5:06 am said:

It sounds to me like there ought to be a comment in that source code explaining the optimization. IMO if some code LOOKS wrong or bad but is actually good, you should always leave a comment lest another developer try to fix it.

Reply ↓
James Craig on March 30, 2020 at 5:11 am said:

It has been a while since my C/C++ days… But null values make the whole thing choke or infinite loop. Also typo in the while loop, ; instead of ‘. In starts you can go past the end of source. That’s all the obvious ones that I see. There’s probably something more in there if I were to write some tests and do some fuzzing.

Reply ↓
- ericlippert on March 30, 2020 at 10:01 am said:
  
  The real code was complicated by the fact that in Visual Basic, null is considered to be semantically the same as an empty string.
  
  Speaking of which — does this algorithm correctly handle empty strings as I wrote it?
  
  Reply ↓
Rob Wissmann on March 30, 2020 at 6:00 am said:

It’s important for all devs, senior or otherwise, to leave documentation (or documentation breadcrumbs) in the code. It is totally reasonable for a fresh grad (or a senior dev who isn’t familiar with the problem domain) to ask questions about why something was done a certain way, especially when that something is not done the textbook way.

A senior dev who does not suffer foolish questions gladly would do well to leave documentation both as a shield against foolish questions, something to point to should they arise, and a reminder for their future, forgetful self who might have the same question in five years.

Reply ↓
tyler on March 30, 2020 at 1:56 pm said:

I like to say that going to college for computer science to be a programmer is like going to college for car repair to be a truck driver.

You learn useful stuff in computer science, but for the most part isn’t needed for the day to day of programming. I’ve used dynamic programming once. Funnily enough, it was for analyzing output from Sanger Sequencing (an older DNA sequencing technique).

New grads have all these techniques and tools, and want to do things “The Right Way™”. I can’t blame them, because I myself was like that, too.

Reply ↓
Drew Griffiths on March 30, 2020 at 11:24 pm said:

Devs are machines – they need to be reporgrammed on how to programme if you want to upgrade to the latest version

Reply ↓
Pingback: New Grad vs Senior Dev – 21 Lessons
Maringa on April 11, 2020 at 4:09 pm said:

I still code in VB6

Reply ↓
Charles Sugden on April 18, 2020 at 8:14 am said:

Thank you Eric for the reminiscence of older days when simple folk could attain ‘giant’ status within the industry.

Reply ↓
Pingback: Life, part 11 | Fabulous adventures in coding
Pingback: T-Mobile misses quarterly revenue estimates – Vertono