HN2new | past | comments | ask | show | jobs | submitlogin
Node.js in Flame Graphs (netflix.com)
778 points by stoey on Nov 19, 2014 | hide | past | favorite | 240 comments


The moneyquote:

"We made incorrect assumptions about the Express.js API without digging further into its code base. As a result, our misuse of the Express.js API was the ultimate root cause of our performance issue."

This situation is my biggest challenge with software these days. The advice to "just use FooMumbleAPI!" is rampant and yet the quality of the implemented APIs and the amount of review they have had varies all over the map. Consequently any decision to use such an API seems to require one first read and review the entire implementation of the API, otherwise you get the experience that NetFlix had. That is made worse by good APIs where you spend all that time reviewing them only to note they are well written, but each version which could have not so clued in people committing changes might need another review. So you can't just leave it there. And when you find the 'bad' ones, you can send a note to the project (which can respond anywhere from "great, thanks for the review!" to "if you don't like it why not send us a pull request with what you think is a better version.")

What this means in practice is that companies that use open source extensively in their operation, become slower and slower to innovate as they are carrying the weight of a thousand different systems of checks on code quality and robustness, which people using closed source will start delivering faster and faster as they effectively partition the review/quality question to the person selling them the software and they focus on their product innovation.

There was an interesting, if unwitting, simulation of this going on inside Google when I left, where people could check-in changes to the code base that would have huge impacts across the company causing other projects to slow to a halt (in terms of their own goals) while they ported to the new way of doing things. In this future world changes, like the recently hotly debated systemd change, will incur costs while the users of the systems stop to re-implement in the new context, and there isn't anything to prevent them from paying this cost again and again. A particularly Machievellan proprietary source vendor might fund programmers to create disruptive changes to expressly inflict such costs on their non-customers.

I know, too tin hat, but it is what I see coming.


You're assuming that your closed source vendors are perfectly aligned with you. In practice they almost inevitably seem to cause capture (https://en.wikipedia.org/wiki/Regulatory_capture).

Open/closed is a red herring here. Projects slowing down as they succeed seems to be a universal phenomenon, from startups to civilizations. Specialization leads to capture. I think almost exclusively about how to fix this: http://akkartik.name/about (you've seen and liked this), http://www.ribbonfarm.com/2014/04/09/the-legibility-tradeoff

Disclosure: google employee


Yep, closed source doesn't solve the problem either. If you believe that just because you're paying money for someone to take responsibility for a problem, they will actually solve the problem in a way that's amenable to you...well, there are numerous closed-source software vendors looking to sell you something.

In practice, the way to avoid this is to keep the software as simple as possible. Try to adjust to your user's most pressing current needs, not every need they might conceivably have. Killing features and deleting code is as important as launching features and writing code; make sure that your incentive systems reward this. Very often, third-party code gets pulled in to scratch one particular itch; if it's no longer itching, rip the code out. If it is still itching and you've built significant parts of your system around it, you may want to think about replacing the innards with a home-grown system.


When a software provider I use starts ripping out features I relied upon, I start looking for an alternate provider, one that isn't so eager to kill features. And in particular, I try not to learn or rely on any new features, if there is past behaviour of feature removal by the provider.

It's better to be careful - very careful - about what you add, and to have a story for migration, than to remove features.


This depends on industry, of course - in consumer web it's much better to risk pissing off a few customers but make the majority of them happy than to keep all your existing customers but risk losing out on a new innovation that gives a competitor a toe-hold. Enterprise SaaS probably has different trade-offs, and software infrastructure probably different still.

This paradox, BTW, could be thought of as the full-employment theorem for entrepreneurs. As long as it is rational for a business to avoid change for fear of having to remove or support it later, then there will exist changes that a company with no customers and no codebase could implement that no incumbent would dare. Some of these are bound to be useful to some segment of the market, and that's why you get continued disruption in technology markets.


I agree this is just part of the software iteration game. This has been Microsoft's modus operandi since they started. Microsoft still is with getting people to move to Azure/mobile, the cycling is just happening further up the stack.

Some of the iterations are good, some are just for control, mostly though it is to keep developers locked in and chasing rather than innovating. Software has to change and iterate though, but too slow or too fast can be harmful or the worthiness of iterations varies when platforms change.

On the topic of Node.js and Express though I think it is production solid and it is very fast (faster than php, ruby, python). I love it and I think here Netflix developers are at fault for not testing something a bit outside of the express default usage that is tested thoroughly. Their solution of changing to Restify so quick on a problem probably tells you how they got into this problem even if Restify is indeed better for this purpose. With any new change comes research/testing/possible problems open source or closed.

In the end this is why microframeworks, which both of the node frameworks are, do win out as they are easier to inspect and live on after the hype (monolithic frameworks not so much).


> What this means in practice is that companies that use open source extensively in their operation, become slower and slower to innovate as they are carrying the weight of a thousand different systems of checks on code quality and robustness, which people using closed source will start delivering faster and faster as they effectively partition the review/quality question to the person selling them the software and they focus on their product innovation.

To contrast with what you said, I've worked at Microsoft, which is almost the company that invented NIH and we had the same problem. I think it's because, to paraphrase Alan Perlis, programming nowadays is less about building pyramids than fitting fluctuating myriads of simpler organisms into place.


That is an excellent point. That we're building primarily distributed software rather than point software has two effects, one that the number of failure modes are quite high, and two that the 2nd and 3rd order interactions are not easily deduced by inspection. I like the organisms picture. I think of them as 'atoms' in a molecule but organisms is much better, it includes their own idiosyncratic behavior into the mix.


This is why I like choosing open source technologies with some sort of commercial support available. The best support I've ever gotten was from MySQL ab. (before Sun) -- the 10k we paid them for two years and three servers was affordable back then even for a small startup. I had a MySQL engineer (if memory serves he is now a MySQL community manager at Oracle) SSH into my server 34 minutes after a desperate call.

Disclaimer: I work for a company providing commercial support (scaling) for an open source project (Drupal).


Morgan definitely knows what he's doing. (-;


Actually, open source allows something that is not possible with commercial software. You don't have to read all of the code and understand all of the implementation details, but you have that option if you need to. I think having the option is a great benefit as we can see from this example. When they started debugging they could reference the express source and figure out what was going on. If they were using some sort of commercial framework they would have had to either refer to the docs or call the help desk.

Just because you have the option to read the implementation details of every library and service you are using doesn't mean that you have to. You only have to learn enough about it to decide whether you think it is a good addition to your stack, and to use it to do whatever you are trying to get done. But open source gives you the ability to figure out why you're using it wrong, or how it is broken when that time comes.

If you are saying that open source is not documented well enough because the developers fall back on "check the source," that is a different argument where commercial software may be better, but this is not true for any of the more common open source software I've used.

Making a decision of which software to depend on in your application is something that is always difficult whether you have access to the source code of all of the choices or not. It's a decision that you make with limited information. You only have extensive knowledge of the tools that you already have experience with, so using alternatives is always a risk, but understanding and managing that risk is part of the developer's job.

And with regard to keeping up with changes, you always can remain with previous versions for some time to avoid the slowdown associated with shifting to new APIs.


I agree with you that the greatest strength of open source is that you can just go read the code and (optionally) fix it. Netflix did this and it got them past their crisis.

I think you missed my point though, which was that there isn't any sort of fitness function being applied to open source. In the closed source environment that it 'price' (caveat walled gardens). By paying for the software customers "vote with their wallet" on the things they like/want/use. What is more if two folks are putting out the same capability they are incentivised to compete on a vague sense of value. That creates a 'culling' function which operates independently of the software creators (the only way to sell more is to be "better").

Open source doesn't have that forcing function, there isn't really even a reputation function involved where bad FOSS would give folks a bad reputation that a user could pick up on (yes we probably all know one or more committers who are jerks but that isn't quite the same thing). And there is no barrier to entry so we get multiple variations on the same conceptual solution (I've lost count of the number of CMS's or blogging packages there are out there for example).

Finally there is the various bits of entanglement, you wrote "And with regard to keeping up with changes, you always can remain with previous versions for some time to avoid the slowdown associated with shifting to new APIs." but that has not been my experience. As you're evolving your product interdependencies in the APIs and packages you are using force them to upgrade as well. There is no island of stability in the FOSS world (well maybe OpenBSD servers or something but I have not seen them in the 'stack') Backports only go on so long and suddenly you're no longer getting updates from your source as Ubuntu showed when they dropped updates for 12.x LTS. All those systems still shell shockable or heart bleedable, or something else. Because to upgrade that part, needs to upgrade the next part, and so on and so on until you've upgraded everything. And you pay that price again and again and again and again.

Much of this is just different than the other model which is not necessarily a bad model. But from a systemic point of view it is hard to partition the work and that leads to inefficiencies. A single change in an API implementation which results in dozens of engineering groups re-implementing stuff, versus a migration model that allows people to move over gradually. The promise of open source is "no capital expense" but I worry the down side will become "much higher operational expense". And when the opex cost of open source is > than the capex + opex cost of closed software, then closed software wins. Of the folks who have tried to prevent that equation from shifting in favor of closed source, I only see Redhat as a success story.


There totally is a fitness function for open-source. People who find that an open-source program creates more hassles than it solves don't use it. Folks who blog about their hassles induce other people not to use it.

There are a number of open-source projects - web.py, Mongo, Angular, Ember - that I am actively avoiding because IMHO their benefits to me don't outweigh risks or previous hassles experienced. Much better to use tried-and-true open source projects like MySQL, Postgres, vanilla JS, and Django that I have had good experiences using. There are many others like Polymer or Docker that I am intrigued by, would love to give a try, but am sitting on the sidelines until other people uncover the most obvious pitfalls.

Economically, open-source projects are just startups where the monetary cost of use is zero. That doesn't mean that the total cost of use is zero, and as you're evaluating potential products, you need to be careful to figure in time expenditure spent debugging as a cost. But you are incredibly naive if you believe that just because you are paying a million dollars for a piece of software, you will spend less time debugging and integrating it than a well-used piece of open-source software. Hell, I once was the college intern writing that million-dollar piece of software. All it takes to sell something is chutzpah.


   People who find that an open-source program creates more hassles than it 
   solves don't use it. Folks who blog about their hassles induce other people 
   not to use it.

Would that it were so, but people still use mongo. You seem to be positing perfect knowledge, whereas crummy software seems to creep in places either because of marketing or because it's the default in some system.

Docker, btw, is awesome. You can use it to replace virtualenv (which only handles python libs) and system lib isolation in a much more scriptable, more robustly deployable fashion. So you can take python projects developed at different times that use different revs of packages (and those packages can require different libs) and make deploying them way better. Try it, it's just awesome.


The same problems with imperfect knowledge exist with closed-source software too, but they're worse because you can't examine the source yourself and the vendor often has an incentive not to let flaws come to light.


> People who find that an open-source program creates more hassles than it solves don't use it.

But bad open-source software can live on to upset the productive lives of others forever, as the distribution cost is essentially free. With bad closed-source software, if the producer goes out of business (without a buyer), typically it becomes unavailable.


This is both a feature and a bug. It is definitely caveat emptor; you should do your diligence before pulling in any piece of third-party software. However, there are many, many critical systems that incorporate unmaintained open-source libraries. libxml, for example, gets a bugfix release about every year, while TeX has been asymptotically approaching version pi since 1989. That's fine; those libraries are stable, they do what they do, and I'd much rather have them available than have them disappear like closed-source alternatives.


This mirrors my experience as well. Mature well known open source is usually excellent quality. Many of the more "modern" packages are often a lot more hassle.


You can't misuse closed source APIs? How would you know something is O(n) without seeing the source code?


I wondered the same thing too. I think the answer is here:

> people using closed source will start delivering faster and faster as they effectively partition the review/quality question to the person selling them the software

I guess the theory is that if you have a customer/vendor, contractual relationship, then you can task the closed-source vendor to describe, fix, or alter their API.

But my (albeit limited) experience is that the customer is not always right; sometimes the vendor just says "that's the way it is" and you're stuck either living with it or porting to a different company/technology. Conversely, it's often possible to hire a vendor or consultant to provide expertise and support for some aspect of an open-source stack. And because it's open-source, if you don't like your consultant you can go find another one without having to change your technology.

In short, I think the speed of innovation thing might have more to do with internal teams trying to manage too much diversity of technology, than with some inherent shortcoming of open source.


>What this means in practice is that companies that use open source extensively in their operation, become slower and slower to innovate as they are carrying the weight of a thousand different systems of checks on code quality and robustness, which people using closed source will start delivering faster and faster as they effectively partition the review/quality question to the person selling them the software and they focus on their product innovation.

I think...your experiences at Google have altered your world view to the point where you don't see how thing are happening at other organizations. Google's monolithic codebase where everything builds against everything else may work(?) for them, but the alternative is disciplined module management.

You don't have to ever bump a version of working code if it's doing its job. Good open source projects should absolutely (publicly) test against performance regressions. New versions should be minor/incremental and source compatible.

I've never worked on a codebase the scale of Googles', but I fail to see how you can't mitigate your concerns, nor do I see commercial software the solution.


It's funny, my first job out of college was at a startup with an ex-Sun CTO, and he insisted on a single monolithic codebase. After he left I tried to organize an (aborted) project to modularize the codebase, since many of the other engineers ran into it as well.

Having worked at Google in the interim, and having a number of acquaintances at Microsoft (which uses the multiple-repository approach) I can see the pros and cons of both. The biggest benefit of the single codebase isn't technical, it's cultural. When you have a number of interdependent modules, then every change request and new feature has to go through that module's owner. If it's not their top priority (and it won't be), then getting your work done suddenly has a hard dependency on a team who is...generally pretty unresponsive, at least from your POV. The result is a lot of finger-pointing and political infighting, where every division thinks that every other division is a bunch of bozos.

The nice thing about Google's system (which, IIUIC, was Sun's as well, and is also Facebook's) is that when you have a hard dependency on another team and their priority list doesn't line up with yours, you can say "Well, can I make the change myself and you review it?" and the answer is usually yes. That means that people's default worldview is to assume busyness, not malice or stupidity, which makes the company as a whole function much better together. Yes, it creates a huge mess that someone will eventually have to clean up. But now you have a lot of options for how to clean it up, not a single point of failure: you can have the feature implementor do it, or the code owner, or it may get replaced entirely if the system is rewritten, or the feature may be unlaunched and no longer necessary, or someone may write an automated Clang or Refaster tool to fix a bunch of instances at once.

I'd compare it a lot to democratic capitalism: it's the worst system that exists, except for all the rest. When I was at Google, we all complained about how everybody else checked in changes that added complexity to our code. But if we couldn't do that, we'd all be complaining about how everybody else prevented us from getting our work done. I know which problem I'd rather have.


>When you have a number of interdependent modules, then every change request and new feature has to go through that module's owner. If it's not their top priority (and it won't be), then getting your work done suddenly has a hard dependency on a team who is...generally pretty unresponsive, at least from your POV.

Is that true? Again, I've never done development on that scale, so I could totally be wrong, but I'm seeing this a problem with how changes are handled and releases are published.

I do work on some Open Source projects that are used by some big companies, the project has very good reliability because of CI, specification testing, perf testing for regression to the point that merging in minor bug fixes and pushing a minor release is pretty trivial.

Let's say only some of the teams practice good module management, ok fine, you fork the repo, fix the changes, publish your fork as your new dependency and move along on your way.

Does this take discipline? Absolutely, but it severely reduces complexity in the code. It mandates a certain level of pureness by forbidding certain dependencies in the vast majority of the codebase. The upsides are tremendous when compared to the slight inconvenience of fixing the occasional bug in someone else's module.


Imagine what happens at scale if everyone forks the repo when a critical feature they need isn't being worked on by the core team. Suddenly you have dozens of forks. Now imagine that the core team eventually gets around to publishing that bugfix you really do want. All of these incompatible forks need to down-integrate the change and work around the little tweaks they've added. You have a mess.

Open source works because the projects that become popular can stake out a particular territory and say "This is our responsibility, this is your responsibility, these are the hooks you can tap into, and if you don't like it, fork or use a different project." Flip the perspective and imagine yourself as the user of an open-source project. There's always a huge amount of work you have to do to customize exactly what the framework gives you to what the product needs. This is okay, because that's the bargain you know you're signing up for when you use the open-source framework, and it is after all what you're getting paid to do. If you don't believe it saves you more effort than it costs you, then don't use it. (In fact, there's a huge graveyard of projects on GitHub where the author tosses their personal way of doing things over the wall and then wonders why nobody uses it or contributes to it. This is why: the benefits don't outweigh the costs.)

Now imagine that you're at a big corporation. There is a lot of work that somebody has to do to make the product work right. All of you are getting paid by the same entity, the company. Who does the work? The team who wants it? The team whose codebase it's in?

Open source projects have the luxury of defined interfaces; they get to set the boundaries of what their code does themselves, and can turn away customers who need more than they provide. Internal teams do not: if they say no to a critical feature (regardless of how different it is from their existing interface), the product doesn't ship, and then an executive gets mad. So often adding features is non-negotiable, the features are significant enough that there's a real risk of impacting the stability or performance of the core product - and yet it's still the right thing for the business to do. The big problem is that each individual team wants clean code and well-defined interfaces, yet the benefit accrues to the company as a whole...hence, resentment between teams about who's got to spend significant work mucking up their pristine design.


I appreciate the response, but I don't follow the argument.

>All of these incompatible forks need to down-integrate the change and work around the little tweaks they've added. You have a mess.

Is that really better than having a mess without any demarcation of the software? Either way people are going to make a change. Your way involves no disincentive to making quick fix that just your team needs. Modularized codebases means that before submitting that PR or forking, you're forced to consider how this will affect everyone who uses the library.

I understand your point about modularizing codebases causing more friction between teams. That's quite insightful and I haven't considered that. However, I think that gets down to a cultural issue of what do you value more, clean code that is easier to comprehend, or fostering geniality between teams.


I learned a lot from this post; thanks! As a MS employee (but in research), it is fascinating to study these differences. I also believe that the single codebase view is inevitable given the need for agility in the modern marketplace.


Consequently any decision to use such an API seems to require one first read and review the entire implementation of the API, otherwise you get the experience that NetFlix had.

There's no getting around this, you or someone that you trust (not necessarily at your company) needs to read and review both the API, and implementation details for open source software that you use. Open source software isn't a hardware store you can dip into to get the latest parts that you need for your project. Adding a dependency makes your code depend on other peoples code. Just like internal code should go through a code review, so should external dependencies. This also explains why for a lot of companies, it's easier to write their own thing rather than need to keep on top of other people's changes.

I happened to write about this yesterday which explains my thoughts further http://danielcompton.net/2014/11/19/dependencies.


Especially in high availability/load systems like Netflix IMO you need to reduce complexity and that means less modules that depend on 69 other modules, each of which depends on 69 modules etc.. What's going on as this sw proliferates is insane.

Then, you've got to be intimately familiar with every piece. There's no excuse when the source is readily available. I give these guys credit though they seem to take responsibility instead of just saying "express sucks". Some of their design choices seemed a little shaky.


This is precisely why — for some products in some industries — NIH is a reasonable strategy for writing good programs.


Please forgive me if I'm misinterpreting you, but the lament you make about open source software seems to me to be more about distributed systems. It is an interesting observation, and now that you've pointed it out I can observe the pattern at past and present employers. But I have seen that pattern on internal software, written by the company for the company. This makes me think the problem is an architectural one.

I hesitate to comment about the relative velocities of open source vs proprietary software because I do not have enough experience with commercial, third party software. My sample size is too small, but I'm inclined to agree with you.

I don't disagree with your Machiavellian conspiracy, either, but I've worked in marketing and advertising so I know that some of the villains are on the payroll. Maybe there needs to be a third category? There's open source software, written by someone who has no particular relationship to you. There's commercial software, written by someone who has a positive economic relationship with you. And then there is ... corporate?... software, written by someone who might think they have a zero sum relationship with you.


I would say this is more of a mature software versus new software rather that open versus closed source . Node isn't even at version 1 yet.


> It’s unclear why Express.js chose not to use a constant time data structure like a map to store its handlers.

Its actually quite clear - most routes are defined by a regex rather than a string, so there is no built-in structure (if there's a way at all) to do O(1) lookups in the routing table. A router that only allowed string route definitions would be faster but far less useful.

I can't explain away the recursion, though. That seems wholly unnecessary.

Edit: Actually, I figured that out, too. You can put middleware in a router so it only runs on certain URL patterns. The only difference between a normal route handler and a middleware function is that a middleware function uses the third argument (an optional callback) and calls it when done to allow the route matcher to continue through the routes array. This can be asynchronous (thus the callback), so the router has to recurse through the routes array instead of looping.


Of course there's a faster way! Combine all the routes into a DFA, then run the DFA over the URL. It's guaranteed to run in constant space and O(n) (n=URL length) time! The union of any set of regular languages is itself a regular language.

You can use Ragel[1] to build your automaton.

[1] http://www.colm.net/open-source/ragel/


This. I wrote something that did this a few years ago. It took n patterns (not regex, simpler) and turned them into one DFA state table with a function ptr stored for each final state. Then it had a tiny runtime. Never thought of using it for route tables however.

Nice idea.

Edit: I did consider rewriting the runtime as a forth style interpreter as well but the time never appeared.


How long does it take to build the DFA when a new route is added? That seems like the major time performance problem point if new routes are added often or regularly.


To be honest I didn't time it and it wasn't thread safe and I don't have the source any more so I can't answer that. It doesn't do a lot so I imagine it would be relatively cheap.

The function it performed to rebuild the DFA was to create an NFA for the new pattern stream (it worked on char *) and then walk both the existing master NFA and the merge them. Then it was converted from NFA to DFA via dragon book copy and paste algorithm (my discrete math isn't great). Both NFA and DFA were represented as a bunch of C structs tied together by pointers. Then a table generator walked the new DFA and generated a new state table. It was very naive but covered the common case where large parts of the byte stream to be matched were similar.

I've got MacVim up and writing it again. May post it as a Show HN if I get the time to finish it.


Worst case is absurd, yes. But most (all?) added routes that exhibit that behavior are routes that would exhibit that behavior when run as a regex (and hence converted into a DFA) on their own. (I.e. generally if you would shoot yourself in the foot with this method you've already shot yourself in the foot.)

Your worst case is O(nd + O(constructing the new route's DFA)), where "n" is the number of nodes in the previous DFA and "d" is the number of nodes in the DFA of the new route.

Note that constructing r's DFA can be done beforehand, and hence the actual time the router will be offline is only O(nd).

(Assuming a simple 2 DFA -> 1 NFA -> DFA merge. Your worst case is that every possible pair from the two, and singleton of the two is present in the final DFA, i.e. ({node from old, node from new} -> nd + {node from old} -> n + {node from new} -> d) -> O(nd + n + d) -> O(n*d). Note that this isn't the classic O(2^n) behavior, because we know that, due to the old DFA and new route DFA being DFAs, only one node in n and one node in d can be present at a time in the final stateset.

A better bet may be to use the classic "lazily-constructed DFA" approach. In other words, keep it as an NFA while only converting (sets of) nodes to DFA nodes when necessary. If you are worried about memory, you can even do this only for "hot" sections of the NFA, or only for those sections of the NFA that have the most performance gain for the number of nodes added.


I'm not sure I can think of a case where new routes should be added dynamically, is there ?


That's the exact thing that caused Netflix's problem. Read the article.


This is the reason we need more people in JS land taking compiler courses and the like. There's a lot of tech out there that are speed sensitive yet do not apply the "best" solution to a problem solved in the 70s.


But then you need to seduce the JS runtime into actually performing your optimizations. Maybe time for C/C++ extensions to become more prevalent a la Python?


This approach will of course work, but you can't have a middleware stack with a defined order using that approach, unless I'm mistaken.

Sure, you could use all routes that match, but is there a way to specify the order for all handlers, and whether or not you should continue after one handler is done?


The easiest way is to not conflate middleware handling with routing. Your routing infrastructure should take a dict mapping URL patterns to callables, and returns a callable (which uses this DFA-based approach to dispatch). Your middleware should be a callable that wraps another callable. If you want to apply middleware to just one path, wrap it before you register it with the routing table. If you want to apply middleware to all (or a subset of) paths, wrap the returned routing table, and optionally stick that in some other routing table.


> You can't have a middleware stack with a defined order using that approach

Sure you can. When you build your NFA, each accepting state is tagged with the route to which it corresponds. Converting to a DFA will merge states. Any accepting state in the DFA tagged with more than one route indicates a possible ambiguity. How you handle that ambiguity is up to you.

One strategy is to just fail at route-building time. Another is to order the routes by priority (which priorities maybe coming from order in the source code) and eliminate from each accepting state all routes except the one with the highest priority.

The latter approach gives you the "defined order" semantics you want.


Oh that is very cool, I am definitely going to have to spend some time playing with that.


Yeah, but what's n there? Isn't that going to be something like the sum of each route's length? That doesn't buy us anything (well, probably a smaller constant).

Unless you go the NFA route, but I'm pretty sure that costs non-constant space.


No. n in the case of a DFA is the length of the input string. (i.e. the string being matched.)

So, in terms of the GP's post, yes, this is an O(1) (with respect to number of routes) solution.


Oh, right. My CS has gotten fuzzy - the tree gets enormous, but the runtime stays is O(n) on input length. Thanks for the explanation.


The automaton isn't even that big, really. The number of NFA states is roughly proportional to the total number of characters in the regular expressions involved, and NFA to DFA conversion usually expands the automaton only be a factor of two or three. Although the subset construction in theory worst-case exponential, that situation never occurs in practice.


By that logic the app I work on would have a DFA with ~200,000 nodes. That sounds fairly large to me.

I suspect this is a slight overestimate, but the point stands - non-trivial apps could quickly build very large state graphs.


Yes, but unless you're running on embedded, and to be honest to some extent even then unless you're way at the bottom of the power scale, you're running on a supercomputer that will laugh at 200,000 nodes in a DFA, and it's very easy for that DFA to run significantly faster than the presumably several tens of thousands of FAs you're running sequentially otherwise.


The exponential blowup only occurs in the worst case. I don't have a proof at hand but I suspect that the final automaton shouldn't grow very large if your regexes are modeling web routes, since they are structured as a tree of paths with lots of "prefix-sharing" and no loops.

Other than that, if you are concerned about automaton size you can use a non-deterministc automaton instead of a deterministic one. You spend a bit more time (at each iteration you have to update a list of "active" states, while in the DFA case there is always a single active state) but on the other hand the automaton itself is much more compact.


By that logic the app I work on would have a DFA with ~200,000 nodes. That sounds fairly large to me.

That's not very large, since you can store a DFA very compactly in a table (C array with structs). The table stores the transitions, which only have the character and the target state. Then you either keep an array with state offsets in the transition table, or you modify transition tables such that each transition points at the first transition of the target state.


A lot of people here are right, the right way is with an NFA. I just want to add that the solution is not even hard, you can do it with string concatenation and capture groups using regexps. Regexps are NFAs, and are highly optimized C code in just about every JS engine.

If I have the routes /foo/bar and /foo/bar/(\d+) I can generate the regexp ((^\/foo\/bar$)|(^\/foo\/bar\/\d+$))

I'm not at all surprised, the quality of software in node is pretty low, I've seen numerous issues in node libs being just as boneheaded. I swear, the fact that the express devs overlooked a key optimization is crazy. Rails, by way of example, uses the Journey engine to solve this problem (https://github.com/rails/journey)


> I can generate the regexp ((^\/foo\/bar$)|(^\/foo\/bar\/\d+$))

And how would you know which one got matched? The regex match isn't going to tell you that. Also, it needs to recognize if multiple were matched, which is definitely not going to be done by the built-in regex matcher.

It's certainly possible, but pretending it's trivial isn't helping, either.


For more concrete syntax, consider the following Python:

  >>> import re
  >>> route = re.compile(r'(?P<fb>^/foo/bar$)|(?P<fbd>^/foo/bar/\d+$)')
  >>> route.match('/foo/bar').groupdict()
  {'fb': '/foo/bar', 'fbd': None}
  >>> route.match('/foo/bar/1').groupdict()
  {'fb': None, 'fbd': '/foo/bar/1'}
If the fb group is set, act on the first route. If the fbd group is set, act on the second.


I know very little of NFAs/DFAs/FSMs, or even string parsing in general, but a year ago I built a URL matching engine using exactly this method in Python, in combination with Google's RE2 library (https://code.google.com/p/re2/). It was far faster than anything else I experimented with, and RE2 also improved the speed dramatically by eliminating backtracking.

Nice to know that what I made is considered the best solution algorithmically.


You can check the 2nd and 3rd groups when using /(^\/foo\/bar$)|(^\/foo\/bar\/\d+$)/:

    > /(^\/foo\/bar$)|(^\/foo\/bar\/\d+$)/.exec('/foo/bar')
    [ '/foo/bar',
      '/foo/bar',
      undefined,
      index: 0,
      input: '/foo/bar' ]

    > /(^\/foo\/bar$)|(^\/foo\/bar\/\d+$)/.exec('/foo/bar/3')
    [ '/foo/bar/3',
      undefined,
      '/foo/bar/3',
      index: 0,
      input: '/foo/bar/3' ]
EDIT: Markdown differences.


First off, you don't need to handle the 'multiple' case, since the "|" has precedence rules applied to it. The leftmost match is the match. Second, you know which one matched based on the index of the capture groups, which is deterministic. See this example: http://rubular.com/r/HiW6gjnURe

You could write a simple router based on this in < 50 lines of JS. I'd do it now, but I have work to do.


> First off, you don't need to handle the 'multiple' case, since the "|" has precedence rules applied to it.

Read my top-level post again - multiple routes can be called on the same request, so express has to be able to find all of the matches, not just the first one. This is a mistake in the original article, as the author doesn't appear to understand the power of express routers.

> Second, you know which one matched based on the index of the capture groups, which is deterministic.

Deterministic, yes. But now you're also writing your own regex parser, because you have to know how many capture groups each sub-regex contains in order to figure out the indexing.


Ah, I did miss that, but it does still work. Run the following code below, you'll see multiple groups match.

    "/foo/bar/3".match(/((^\/foo\/bar\/.*)|(^\/foo\/bar\/(\d+)$)|(^\/baz))/) => 
Additionally, parsing out the number of capture groups in a regexp is simple, just looked for unescaped paren groups. You can do it once at route definition.


> just looked for unescaped paren groups

Don't forget non-matching groups! Which brings up my point yet again - everyone is pretending this is a dead-simple problem to solve, but all of the proposed solutions are buggy. Not really an option for such a widely-used library.

I'm not claiming it's impossible. I'm claiming it's more difficult than anyone is willing to admit.


Counting capturing groups is a solved problem - http://stackoverflow.com/questions/16046620/regex-to-count-t...


Actually, no: if you run this you will see three matching groups, but the first is always the whole match (this is in addition to the capture groups that are defined), the second is the (redundant) capture group you put around the whole thing, and the third is foo/bar/.*. /foo/bar/\d+ is not matched at all, even though it matches the input string.


You are 100% correct, I forgot about that extra group (that doesn't need to be there).

That being said, most frameworks don't have multiple matching routes, and I can't imagine that many people miss them. I've never used a framework that supported that, and I can't say I've ever felt I've missed out. Middleware is the right approach to that problem generally.


The point is to be able to have middleware that only runs on certain urls. For example, you might have middleware that only runs on a particular version of your API (`regexp: ^/1\/.*/`). I've used it fairly extensively.

The point being, now you're describing a completely different, less powerful framework. Which, yes, of course it's faster. You'll notice that's what Netflix ended up doing - they moved to `restify` from `express`.

You might as well just only allow routes to be strings and then you have the constant-time map implementation that the article describes as the expected implementation.


The match object will be of the form [undefined, ..., input_string, undefined, ...], so you can use something like m.indexOf(input_string) to find out the index. Unfortunately that's O(n) again, unless the engine uses a sparse array implementation...


You're right that it's O(n), but it's still faster than express js. It's not iterating through the array that's slow, its evaluating the members of that array. Express must check each route using JS functions that each execute a regexp match per route. Using a single regexp is much faster because 1.) by unioning the regexps you get the correct NFA, which will run much faster 2.) there's less function call overhead.

Iterating through the array looking for the first non-null value, then looking it up in a JS object O(1) matching capture group positions to routes is much faster than having to dispatch a bunch of functions. Additionally, it can be JITed to much faster code I would wager.

So, expressJS = N regexp matches, while this method = 1 regexp match + N null checks in the resulting array + 1 hashmap lookup in an object to return the proper route. The second solution is going to be much faster, even for large routing tables, esp. given how optimizable that array lookup code is.


You can get the index of the match via the .exec function. It returns an array. The first element is the match, and then an element for each capturing group. You'd need to keep track of how many sub-capturing groups each individual top-level capturing group has, but that's not terribly difficult.

As far as multiple matches, after you found a match, you would then test against a new regex that is everything to the right of the previous regex's captured group.

((a)|(b)|(c)|(d)) -> if b matches, you then test against ((c)|(d)), and so on.


This indicates it probably wont be much faster in JS

http://jsperf.com/regex-list-vs-full

Most of the time in the single-regex case is likely spent on allocating the matches array (which for a router with 300 routes would have over 300 undefined elements on every execution)

V8 is not quite the typical dynamic language runtime. Code that would be dead slow in others is optimized really well by the amazing JIT.


Interesting, I'm looking at it right now, and it's saying that the multiple regexes are ~ 50% slower in both chrome and firefox. Am I missing something? It looks like the single regex actually is a significant win.

It is interesting how much allocation changes things though, very cool that you made this.


Yeah multiple regexes are slower, but the performance gain isn't too gigantic. If we had `regex.execIterable(string)` we could avoid allocating massive arrays and maybe get a really significant difference


Regexps with backtracks are not regular languages (IIRC, right?), does that matter here?


That's right. Typical routing regexes will not use backreferences, so that's not really an issue here. However, most routes do have parameters implemented as capture groups (which, I believe, is also not technically a feature of regular expressions). One simple solution would be to use a big regex (the union of all the routes) to determine which route it is (in O(n) time), and then once you know the route, use another regex to parse the parameters in the URL. So each route lookup requires 2 regex matches rather than N.


Just re-read this and realized it's unclear: when I said O(n) time, I meant linear in the length of the URL to be parsed. The point is that with this technique, it doesn't matter how many routes there are.


Given that combined regex, how do you determine which of the routes that went in to it was the one that matched?


If it matched the first, match[2] is set. If it matched the second, match[3] is set.

Or you use a better regexp library than the default JS one and use named captures.


It looks like routes are either strings (exact match) or regular expression objects. I don't think there's a way to combine regular expressions in JavaScript.

I don't think this was an unreasonable implementation. Sure, it could be optimized, but no-one needed it enough to do the work. (Even Netflix didn't need a faster matcher, they needed a matcher that didn't slow down between the 1st match and the nth.)


What I'm saying is that the pipe (or) "|" is a cheap way to combine regexps. It has its limitations, but its better than an O(n) search through an array. The regexp parser can create an NFA that is far more optimized.


In JS, "far more" isn't quite that dramatically far as in other dynamic language runtimes. See http://jsperf.com/regex-list-vs-full


Oh yes, I forgot about the legendary Ruby On Rails code quality: http://puppetlabs.com/security/cve/cve-2013-0277


There is a library (I think it is Google's re2) that supports throwing a bunch of regular expressions into a single structure, matching an input string against all of them at once, and answering which patterns matched the input string. This gets you route lookup in time linear to the input string (or if not linear, still better than checking all the patterns one after another).


I use RE2 and currently achieve this with regular string concatenation. Do you know if they provide an API for doing this without concatenation, and if it would be faster?


The header file is available at https://code.google.com/p/re2/source/browse/re2/set.h , and if I recall correctly, it merges all the regexes into one DFA so that it's faster than just concatenating all your patterns into one string.



The other thing you can't properly get using a map is precedence. By building an array you can go through it in order so that middleware is run in the order it was defined in the code.

That way, if you have any route definitions that match early on, especially if they allow execution to continue ( by calling next() ), they will run before any following code does.

This allows your code to be very expressive and intuitive when reading it, as things have a defined order.

I imagine the recursion is also done so that you can use a route object anywhere in that ordered stack and it will apply all its own handlers, in the order they were defined in the code.


You could compile all the regular expressions into a table. Lex has been doing that since, oh, 1975.


Most routes are prefix matches, so a trie (or one of its variants) would work well.


On the other hand, without the recursion I don't see how the flame graph would have been helpful.


In flame graphs heavy iteration tends to give "wider" structures. Recall the stacks aren't ordered on the X axis by time, but by contents.


I'm surprised nobody has mentioned that express has a built in mechanism for sublinear matching against the entire list of application routes. All you have to do is nest Routers (http://expressjs.com/4x/api.html#router) based on URL path steps and you will reduce the overall complexity of matching a particular route from O(n) to near O(log n).


I wonder what the thought process was behind moving their web service stack (partially?) to node.js in the first place. For a company with the scale and resources of Netflix it's not exactly an obvious choice.


What are the arguments against node.js in their use case?

Not looking to start any wars, but I was under the impression that if you know what you're doing* node.js is pretty awesome.

This particular bug had to do with a misunderstanding regarding the express API.

* for the most part: understand async and closures/memory leaks.


To give some arguments in favour:

- "Here we see our latencies drop down to 1 ms and remain there after we deployed our fix." <- Performance (from TFA)

- They can hire from a vastly larger pool of developers. Since they've got both budget and brand recognition, they should have no problems hiring high end JS guys & gals.


Why are these arguments in favor of Node.js over their current Java based stack? Java is both considerably more speedy for this sort of workload and there's a significantly larger pool of high end Java server developers compared to high end JavaScript server developers. Budget and brand recognition are not necessarily in favor of node.js over other stacks. That simply helps with getting the better engineers in general.


They are arguments in favor of node.js being a reasonable tool for the job.

> Java is both considerably more speedy for this sort of workload

They're quoting 1ms latencies, that's pretty speedy. Maybe Java could do <1ms and/or support slightly more concurrent requests. How much difference does it actually make when you're already responding after 1 millisecond?

> there's a significantly larger pool of high end Java server developers compared to high end JavaScript server developers

Oh, I would have thought there are far more JS* folks out there than Java.

> Budget and brand recognition are not necessarily in favor of node.js over other stacks. That simply helps with getting the better engineers in general.

That's what I meant - them being in favour of Netflix as an employer - they aren't "Bob's Digital Agency" so they can have their pick of the litter of node devs.

* I'm not sure if the "server" distinction is meaningful here - if you know JS, you know node.js - simply a matter of familiarizing yourself with node-specific APIs.


I'm not sure if I agree. Responding after 1ms is not the same (or even very much related to) routing a request taking up 1ms of cpu time. Although there are a ton of nuances that make the following a small oversimplification it does basically mean that purely for routing alone you have a hard cap on 1k lookups/sec/core. And building server software in any language takes a whole different set of skills and knowledge than knowing how to build front ends. I'm not arguing one is harder or more involved than the other but they are different enough to make language and paradigm familiarity and the only real gain when moving from front-end JS to node.


The parent (and the article) are both talking about 1ms request latencies, not 1ms routing latency.


One more advantage of Node.js over Java is that development is much faster with Node.js.


No, just please no. Any moderately competent programmer can be productive in either Java or Javascript (+ Node.js). Javascript may help incompetent programmers feel productive because Javascript has fewer compile-time constraints but in the long run the cost of maintaining code written by an incompetent more than counteracts any perceived benefit that Javascript offers.


Well obviously the work of an incompetent programmer is going to increase the long run cost of maintaining code. That has nothing to do with the fact that Javascript is faster to develop in.


I was doing some googling about this earlier, IIRC one of the big ones was "Good when your bottleneck is I/O, not good when your bottleneck is CPU."


> "not good when your bottleneck is CPU"

99% of cases, your bottleneck will be I/O.

In the 1% of cases where your webserver's bottleneck is the CPU, you have much bigger problems than using a hipster ;) language: you're doing it wrong (tm) on an architectural level:

(a) Your processor-intensive/long-running tasks need to be in seperate worker processes and

(b) you need more webserver instances.

I usually try to avoid absolute statements, but I think this may be accurate:

The only CPU-intensive thing your web server code should ever do, is hash passwords.


I shouldn't have tried to be funny on HN* . For this I apologize.

Can I get a rebuttal to my points along with the downvotes, though?

Is someone disputing that CPU intensive tasks should be moved off the webserver?

Have you personally had experiences where your honest-to-$deity webserver bottleneck was the CPU? I'd be incredibly surprised. In most cases I'd guess you're underprovisioned and/or badly architected.

* This is where we come to somberly discuss interesting things.


You're assuming a number of things that aren't always true:

1. Your networking pipe into the server is slow.

2. You don't have a reverse proxy in front of the webserver to buffer the slow connections.

3. You don't make decisions about what information to show the user in the webserver; your UI isn't tailored for the particular user whose request you're servicing.

4. You haven't made productivity trade-offs to move work from the developer to the computer.

Google's webservers (for Search) are CPU-bound, and I can guarantee that they are neither underprovisioned nor badly architected, and CPU-intensive tasks are moved off the webserver (the total amount of CPUs for webservers is miniscule compared to the number of CPUs used for the indexing & serving system). But Google has the fastest fiber networks in the world, a massive load-balancing infrastructure, its philosophy is to serve each user only the HTML/JS that they need (to cut latency), it relies heavily on A/B tests & experimentation, and it invests significantly in developer-productivity tools so that all the bookkeeping for the last two points are done by computers rather than humans.


I hear this more often and it is somewhat misleading. That sentence seems to imply node.js is better in terms of I/O throughput compared to other technologies, as if it's a trade off. This is incorrect and it would be more accurate to say "Good enough when your bottleneck is I/O, not good when your bottleneck is CPU.".


They went into this a bit on the NodeUp podcast: http://nodeup.com/seventyone


Are there some salient points that can be summarized here? That's an hour-and-twenty-minute podcast episode.


The motivation part starts around 8:30. It's hard to claim complete objectivity when it comes to things like this but it sounds particularly unconvincing to me. Node.js' sweetspot is providing an easy route to server development for front-end developers and it seems this is roughly what happened.

It's not a huge step for Netflix that already has an API approach that is very UI centric as outlined here http://techblog.netflix.com/2013/01/optimizing-netflix-api.h.... To me personally this doesn't seem like a healthy step forward but I hope someone can provide some reasons why you'd want to move to Node.js from, in the case of Netflix, Java.


I'm a non-frontend dev who uses node.js for server/systems/embedded work.

My preference for node is based primarily on its sane concurrency model. I basically use it as a handy scripting language for doing rapid prototyping of libuv programs. My on-paper plan is to fall back to libuv and C if I find myself really stuck for CPU, but this is something that's never actually happened to me. (I have had to abandon or modify libraries with performance pathologies, though)

There are other libraries with in other languages the same sane model but they tend to be ecosystems within the language community that aren't compatible with the rest of the language. Doing concurrency both correctly and performantly in Java, in particular, is a pain in the ass.


I'm not disagreeing and that's a perfectly valid reason to go for node for smaller projects or for prototyping. What many people fail to realise is that the "concurrency" model (it is solved by simply not having concurrency) isn't a magic bullet. It comes with large drawbacks. Not a lot of things are complicated because people like it that way. Some things are as complicated as they need to be to be used effectively. I agree that doing concurrency in Java can be a pain in the ass (although some microservice libraries can simplify some cases) but for the most part it's a pain in the ass because concurrency is hard, not because Java is bad. Doing it the way node.js does is as easy to do in Java as it is in node.


An example of basic concurrency I have great difficulty doing correctly in Java:

  n = 0;
  avg = 0;

  function twothings() {
    thing1(function(ret) { ++n; avg += (ret - avg)/n; });
    thing2(function(ret) { ++n; avg += (ret - avg)/n; });
  }
I want to dispatch two I/O operations to run concurrently and have their completions modify some shared state in a serializable fashion, which can be read at any stage in the process and have a valid result. I shouldn't need to give up on nondeterministic simultaneous execution or faff about with locking mechanisms to get this.


What I think you're missing is that in JS, this is not being run concurrently. Existing JS runtimes are single-threaded—if you actually want to take advantage of multiple CPUs you must run more than one process. It's very easy not to have concurrency bugs when there's no concurrency.

In Java, you have access to real threads and can actually run two pieces of code concurrently. This introduces a lot of potential issues, but also allows you to take advantage of multiple cores.

If all you want is non-blocking IO (what node gives you), there are a handful of great libraries that provide that on the JVM.


If you're having trouble with that, I suggest you take a look at CompletableFuture, or ExecutorCompletionService, or any other number of methods to do this in Java. Just like node has it's own lingo, so does Java, and once you learn it, things become real simple.


I share this thought, I'm not trolling, I really believe node is a bad solution for something like Netflix.

Node has its perks but for a money making machine that relies solely on being available and providing a good customer experience, not so much.

I can't imagine the ops nightmares at that size, one buggy code path and the entire cluster could be down. These are issues that drove me away from Node to Go, in my opinion Node has way too many issues to run in money-making scenarios.


> in my opinion Node has way too many issues to run in money-making scenarios.

You can say that about a lot of languages other than node. How many billion dollar companies have their software written in PHP - a language that many people would agree has far more glaring issues than node.

My point, is that I believe your comments miss the larger picture. There is far more involved with deciding which language to build a product with than "which language is best."

I don't disagree with you about Go being a great language - but for most companies, it is not even remotely practical to use. Hiring talented Go developers is hard because there are so few. Their current employees may or may not know Go, what do you do about that? Etc.


For sure, but jumping into Node from a clean slate is probably not the best idea, legacy is different.

If anything the comment on hiring highlights the use of Node being problematic. It's difficult to write robust systems in Node even as a seasoned Node developer.


Could you expand more on the ops issues you have encountered in your systems?


Netflix seems to operate like a tech start-up that is trying to glue together a ragtag collections of often unsuitable solutions because of limited funding. It is a deeply perplexing company.

Similar is LinkedIn, as an aside -- despite being fairly formidable now, I regularly have entire feeds disappear, their caching is abhorrent, they can't markup text properly, and so on. It seems very amateur hour, yet they regularly publish "how it's done" documents that see wide applause despite often completely contradicting their prior missives.


Netflix overhires quite a bit. They also pay their (very good) engineers very high salaries (possibly highest in the valley) which results is very low attrition.

The net result is a lot of engineers having a lot of time and all these engineers experiment away on technology. In fact, they have quite a bit of NIH syndrome internally because the engineers have nothing to do.


I don't think you have those reasons in proper order. I think the ability of the engineers to experiment and use new tools is why they have the high level of retention. I've seen plenty of places that pay much higher than the competition yet have very high turnover due to the a conservative culture which dictates the tools and doesn't encourage experimentation.


"An idle mind is a devil's workshop". I have seen this over and over again when you give engineers a free run. They micro-optimize things. They start rewriting javascript to c++ and c++ to c and asm when the speed improvements are at best marginal and the cost of maintenance of produced code is extremely high.

Don't get me wrong, I am not saying that these optimizations are bad. In fact, these engineers measure things properly and then rewrite. It's just that there is no business case for all this time spent. It's like writing web servers in C. Sure, it's possibly the faster than everything else but seriously? Who maintains all this.


Experimentation is good, but perhaps there is too much going on at some companies? Finding the right balance is a hard problem.


Curious, do you have some anecdotes about that? I've generally heard from coworkers who were at Netflix that they have pretty high attrition at the lower levels, with the "fire the bottom 10% every X months" rules.


The bottom 10% has it hard in any company, let alone netflix. Almost every company I know has stringent rules for the bottom 10% (read: a step by step approach until they get fired). But if you are a decent engineer you have nothing to worry about.


I know I'm being a cocky asshole, but, are their engineers that good?

> It’s unclear why Express.js chose not to use a constant time data structure like a map to store its handlers.

Well, it was blindingly clear to me and the top commenter in this thread.

I'm not judgmental that it was unclear to the author of this article, after all, I'm unclear on things that should've been stupidly obvious to me all the time. I am totally judgmental that no one on the team proofread the blogpost before publishing it and was like "uh, dude, what would the keys in the map be, regexes?"


Well, the same engineer who didn't figure that out went ahead and profiled things, drew some pretty graphs, actually bothered to write a blog post to put his thoughts for public review. IMO, that is all some good engineering :-)

But you made me think what I said. I would say this sort of gets into the programmer / engineer debate. Netflix has good progammers like most companies. Good engineers (in my books) are very hard to find. These people are good at being at peace with tradeoffs and understand the requirements of a business at a much deeper level than programmers. Programmers just want to have fun with computers (and there is nothing wrong about that).


NIH ?


Not invented here. Redoing stuff which already exists.


Sometimes it's okay to eliminate dependencies: http://www.joelonsoftware.com/articles/fog0000000007.html


TIL, SVG's can display labels on element hover: http://cdn.nflximg.com/ffe/siteui/blog/yunong/200mins.svg

Nice, contained way to show data like this.


SVGs can contain ECMAScript, video, canvases, animation, all sorts of things. They're the replacement for Flash that nobody seems to use.


Well, there was a good reason for that, for a long time: IE didn't even begin to support .svg until version 9, which means that it isn't a realistic option for deployment for anyone who needs to support pre-evergreen browsers.

I investigated them a year or two or so ago, because I found that AngularJS could work quite nicely within an .svg document, which opened up some exciting possibilities. My recollection is that at the time, there were some critical cross-browser problems with font rendering that made the topic very kludgy and complicated. I except that matters have improved since, but I do not know to what extent.


And IE still doesn't (nor do they plan to) support declarative animation in SVGs. (SMIL) Someone there decided that script based animation is more robust so they just bypassed that capability. I think it's a terrible idea because you forego having tightly integrated SVG animation modules that can just be dropped in as needed.


They've been around since 2001, but browser support only got good recently. Initially Adobe and Corel made browser plugins but Corel scaled back and Adobe bought Macromedia.

Also SVG GUI tools haven't been great. Inkscape is strongly print focussed, or at least was the last time I used it. There's no reason to set a page size on a web svg...

Someone needs to build an SVG equivalent of Flash Pro.


You do need to set a page size, it's the reference point everything is measured relative to. It's only a problem if you can't zoom in/out infinitely.


There's a decent amount of JavaScript in there to enable it.


> ...as well as increasing the Node.js heap size to 32Gb.

> ...also saw that the process’s heap size stayed fairly constant at around 1.2 Gb.

This is because 1.2 GB is the max allowed heap size in v8. Increasing beyond this value has no effect.

> ...It’s unclear why Express.js chose not to use a constant time data structure like a map to store its handlers.

It it is non-trivial (not possible?) to do this in O(1) for routes that use matching / wildcards, etc. This optimization would only be possible for simple routes.



That seems like a pretty low size to me... how are people getting around this when they need to handle >1.2GB of data on Node?


Native code modules I assume.


> It it is non-trivial (not possible?) to do this in O(1) for routes that use matching / wildcards

I'd be impressed if they did it consistently in O(1) for static routes. I think they were looking for O(log(number of different routes)) instead of O(n).


If the paths have some kind of non crazy Regex leading up to what gets munched for path variables (eg /path1 versus /path2) then you could at least build a tree of maps at each level, which would be roughly constant time for the most common cases.


> roughly constant time for the most common cases.

For small values of n ;)

A tree of maps would be consistently log(n), like any map. Even a hashtable would hash to buckets eventually, and are log(n).


Not if you're using a Cuckoo hashmap.

Honestly, I have no idea why people aren't taught Cuckoo-maps by default. It's simple, has true O(1) worst-case lookup, and has easy deletion. About the only problem is trying to prove insertion time.


Sounds like a documentation issue, or lack of a staging environment. I've written and maintained countless large Express applications and routing was never even remotely a bottleneck, thus the simple & flexible linear lookup. I believe we had an issue or two open for quite a while in case anyone wanted to report real use-cases that performed poorly.

Possibly worth mentioning, but there's really nothing stopping people from adding dtrace support to Express, it could easily be done with middleware. Switching frameworks seems a little heavy-handed for something that could have been a 20 minute npm module.


I read:

"This turned out be caused by a periodic (10/hour) function in our code. The main purpose of this was to refresh our route handlers from an external source. This was implemented by deleting old handlers and adding new ones to the array"

refresh our route handlers from an external source

This is not something that should be done in live process. If you are updating the state of the node, you should be creating a new node and killing the old one.

Aside from hitting a somewhat obvious behavior for messing with the state of express in running process, once you have introduced the idea of programmatically putting state into your running node you have seriously impeded the abiltity to create a stateless fault tolerant distributed system.


When I concluded what they had to be doing and then read the actual confirmation of what they were doing I was somewhat shocked. Why on Earth would you want to programatically recreate the routes in an express app?!?!? It would be really interesting to see a write up on what/why they think this kind of behavior is needed in the first place ....


> benchmarking revealed merely iterating through each of these handler instances cost about 1 ms of CPU time

1ms / entry? What is it doing that it's spending 3 million cycles on a single path check?


Running (uncompiled?) regular expressions, it seems.


So I was a bit unclear on the parent's post but I don't think this time was on a route lookup if I'm reading the thread and post correctly the static file handler getting inserted multiple times. This handler will generally match on any route but then is doing something like "if file exists, return static file, if not look for the next handler" in this case the "if file exists" part was the "path check" thats taking 1 ms and was happening multiple times.

I could be wrong but it seems like the design of the route lookup mechanism (the global array) was actually a bit of a red herring, the real issue was the ability to attach multiple instances of the same handler to the same route.

Something was adding the same Express.js provided static route handler 10 times an hour. Further benchmarking revealed merely iterating through each of these handler instances cost about 1 ms of CPU time.


A simple "if file exists" check shouldn't take 1ms on average.

OSes cache directory entries for a reason.

I mean, even Python manages 40,000 checks / second:

  >>> timeit.timeit("os.path.exists(data)", setup="import os; import random; import string; data = os.path.join(r'C:\Windows\System32', ''.join(random.choice(string.ascii_letters) for _ in range(10)))", number=40000)
  0.9998181355403517


V8 compiles regular expressions to machine code.


You're the author of Haraka, a tool I'm prepared to try in production. This makes me _really_ worried that I'm making the right choice.

(To explain: obviously C++ churns out machine code, the parent was talking about compiled vs uncompiled regexes – if I'm not terribly wrong, the compilation step is turning the regex into a finite automaton.)


V8 literally compiles regexps to X86 machine code the first time they are executed. They are not compiled into an FSA that gets walked in the traditional sense.

Hopefully that lowers your concern level.


Doesn't that mean that you have exponential worst-case complexity?


Absolutely! Even worse, you may get different results than the expected method (FSA compilation) would yield. Observe:

Javascript, executed in the console of a recent Chrome:

    > 'ab'.match(/a|ab/)  
    ["a"]
BSD Grep:

    > egrep 'a|ab' <<< ab  
    ab


Yow...

I was under the impression that all major regex engines used NFAs converted to DFAs lazily, with fallbacks to a slower engine for features that cannot (or cannot practically) be implemented using an NFA (unbounded backtracking, that sort of thing.)

What is the advantage to doing this?


Interesting, thanks for the info!


> I can’t imagine how we would have solved this problem without being able to sample Node.js stacks and visualize them with flame graphs.

This has me scratching my head. The diagrams are pretty, maybe, but I can't read the process calls from them (the words are truncated because the graphs are too narrow). And I can't see, visually, which calls are repeated. They're stacked, not grouped, and the color palette is quite narrow (color brewer might help here?).

At least, I _can_ imagine how you could characterize this problem without novel eye-candy. Use histograms. Count repeated calls to each method and sort descending. Sampling is only necessary if you've got -- really, truly, got -- big data (which Netflix probably does), but I don't think the author means 'sample' in a statistical sense. It sounds more like 'instrumentation', decorating the function calls to produce additional debugging information. Either way, once you have that, there are various common ways to isolate performance bottlenecks. Few of which probably require visual graphs.

There's also various lesser inefficiencies in the flame graphs: is it useful (non-obvious) that every call is a child of `node`, `node::Start`, `uv_run`, etc.? Vertical real-estate might be put to better use with a log-scale? Etcetera, etc.


> The diagrams are pretty, maybe, but I can't read the process calls from them (the words are truncated because the graphs are too narrow).

Flame Graphs provide SVGs by default. You should be able to zoom if your broser supports it. The current version also supports "zooming" in to any frame in the stack, resetting that frame as the base of the display. Also WRT the base frames of 'node' et al its because Flame Graphs are a general use tool for stack visualization, it might be 'main' for a c program or the scheduler looking at a system.

> They're stacked, not grouped, and the color palette is quite narrow (color brewer might help here?).

Colors by default have no meaning and the palette is configurable. The current lib can also assign colors by instruction count/ipc and width by call count, if you have access to that.

> Sampling is only necessary if you've got -- really, truly, got -- big data (which Netflix probably does), but I don't think the author means 'sample' in a statistical sense.

It is sampling. Flame graphs re typically used with something like perf/dtrace/oprofile which dumps stacks at a couple hundred to thousand hertz. Actual call tracing is (typically) not feasible for large/prod stacks.


You're looking at a screenshot. The actual diagram isn't static. Hover over it and it will expand each box with all the info you need to see what was run, what called it, etc.


  > our misuse of the Express.js API was the 
  > ultimate root cause of our performance issue
That's unfortunate. Restify is a nice framework too, but mistakes can be made with any of them. Strongloop has a post comparing Express, Restify, hapi and LoopBack for building REST API's for anyone interested. http://strongloop.com/strongblog/compare-express-restify-hap...


From the article:

> What did we learn from this harrowing experience? First, we need to fully understand our dependencies before putting them into production.

Is that the lesson to learn? That scares me, because a) it's impossible, and b) it lengthens the feedback loop, decreasing systemic ability to learn.

The lesson I'd learn from that would be something like "Roll new code out gradually and heavily monitor changes in the performance envelope."

Basically, I think the approach of trying to reduce mean time between failure is self-limiting, because failure is how you learn. I think the right way forward for software is to focus on reducing incident impact and mean time to recovery.


Without over-training on this one incident, and without guidance on how to get from here to there (I'm still working on that):

1. Don't get suckered by interfaces, share code. If you create code for others to share ("libraries"), stop trying to hide its workings.

2. You don't have to learn how everything works before you do anything. But you should expect to learn about internals proportional to the time you spend on a subsystem. Current software is too "lumpy" -- it requires days or months of effort before yielding large rewards. The first hour of investigation should yield an hour's reward.

3. "Production" is not a real construct. There will always be things that break so gradually that you won't notice until they've gone through all your processes. Give up on up-front prevention, focus instead on practicing online forensics. And that starts with building up experience on your dependencies.

More elaboration: http://akkartik.name/post/libraries2

My attempt at a solution: http://akkartik.name/about

My motto: reward curiosity.


> I think the right way forward for software is to focus on reducing incident impact and mean time to recovery.

So in this case, guarantee you have a strong means of evaluating performance and maybe even include it by default just to be sure.


My biggest takeaway from this article is that Netflix is moving from Express to Restify, and I look forward to watching the massive uptick this has on https://github.com/mcavage/node-restify/graphs/contributors


Yes, but their original bug was from dynamically loading routes from an external source. I don't see how Express is to blame for this. Moving to Restify is not a solution, but they state having different reasons for moving (support for bunyan logging? But Express already supports this too).


the bug was related to dynamically loading routes, but the true cause was that express allowed duplicate handlers. They were loading routes dynamically correctly, that wasn't a problem, it was that when doing that express let them duplicate routes.


That's a feature.

From the API docs:

> Multiple callbacks may be given; all are treated equally, and behave just like middleware. The only exception is that these callbacks may invoke next('route') to bypass the remaining route callback(s). This mechanism can be used to perform pre-conditions on a route, then pass control to subsequent routes if there's no reason to proceed with the current route.


If they were loading duplicate routes, how was it they were "loading routes dynamically correctly"?



While I kind of agree the restify gives you "almost nothing" when compared to some larger frameworks, in my experience "almost nothing" is basically all I need from my framework when writing REST APIs and restify has had me covered every time.


If I had to pick one line to highlight (not to criticize, but was a wise lesson worth sharing) it would be this one:

"First, we need to fully understand our dependencies before putting them into production."


In my experience developers constantly overestimate the gain of using a new dependency and underestimate the amount of effort it will take to sufficiently understand it. (Or fail to make the effort, not understanding the risks.)

This is why developers without significant experience should not be making decisions about the tech stack.


[Subjective] Not to criticize the Express.js code base, but have you tried reading it? It is very complicated and there are a bunch of clever things going on. I think it could have been written simpler and easier to understand. A problem with frameworks is that they are written by people who want to show off how clever they are.


I would like to mention that Netflix could have consulted the express maintainers (us) but didn't.

Source: myself - https://github.com/strongloop/express/pull/2237#issuecomment...


A surprising amount of path recognizers are O(n). Paths/routes are a great fit for radix trees, since there's typically repetitions, like /projects, /projects/1, and /projects/1/todos. The performance is O(log n).

I built one for Java: https://github.com/augustl/path-travel-agent


How does your radix trie implementation handle variables in the URL paths, in a nutshell?


This is one of the reasons for why I call it a "bastardized radix tree" in the README :)

The routes are stored as "nodes". There's a root node. It has a hash map of child nodes, by name. It also has a list of "parameterized" nodes. When a node gets a path segment, it will first look in its hash map. If nothing is there, it'll call the parameterized nodes in sequence. Typically there's just one parameterized node.

For the following paths:

  /projects
  /projects/new
  /projects/special
  /projects/:project-id
The root node will have a single item in its hash map, "projects". No items in the parameterized node.

The node for "projects" will have to items in its hash map, "new" and "special". It will have a single item in it's parameterized node, for :project-id.

I updated the README just now with a slightly more detailed explanation :)


Very cool, thanks for taking the time to explain! I recently learned how tries work so it was cool to see a real-world implementation like yours.


tl;dr:

* Netflix had a bug in their code.

* But Express.js should throw an error when multiple route handlers are given identical paths.

* Also, Express.js should use a different data structure to store route handlers. EDIT: HN commentors disagree.

* node.js CPU Flame Graphs (http://www.brendangregg.com/blog/2014-09-17/node-flame-graph...) are awesome!


It's not just the extra lookups -- static in express is deceptively dog-slow. For every request it processes, it stats every filename that might satisfy the URL. This results in an enormous amount of useless syscall/IO overhead. This bit me pretty hard on a high-throughput webservice endpoint with an unnoticed extra static middleware. I wound up catching it with the excellent NodeTime service.

Now that I look at it, there's a TOCTOU bug on the fstat/open callback, too: https://github.com/tj/send/blob/master/index.js#L570-L605

This should be doing open-then-fstat, not stat-then-open.


I am upset that the title has been changed from "Node.js in Flames". Which is not only the real title of the article, but also a reasonable description of what they've been facing with Node.

#moderationfail


I can see why you would feel that way, but the title, while clever and (I'll take your word for it) fitting, was arguably misleading and unarguably baity. The HN guidelines call for changing such titles, so the moderators were just doing their job. There likely would have been more complaints about the title if we hadn't changed it.


Understood. And thank you for following up, I really do appreciate it.


This is the first I've heard of restify, but it seems like a useful framework for the main focus of most Node developers I know, which is to replace an API rather than a web application.


You're going to love restify. We use it at TrackIf and can't imagine our API running on anything else. Couple it with Swagger and you won't be looking back :-)


> This turned out be caused by a periodic (10/hour) function in our code. The main purpose of this was to refresh our route handlers from an external source. This was implemented by deleting old handlers and adding new ones to the array. Unfortunately, it was also inadvertently adding a static route handler with the same path each time it ran.

I don't understand the need of refreshing route handlers. Could someone explain they needed to do this, and also why from an external source?


We refresh periodically as we dynamically deploy new UI code, which can be accessed at new routes. (/home and /homeV2 for example) This allows us to not have to restart our servers or push out new server code just to serve a new UI at a different (or the same) route.


It sounds to me like some kind of configuration/IT-infrastructure-control issue. I wouldn't be surprised if it relates to some sort of "we want a way to toggle this live without an actual code-push" narrative.


I assume this is some kind of continuous delivery-style thing, where they can push up and amend endpoints without bringing down the entire system.


The express router array is pretty easy to abuse, it's true. For example, as something you probably shouldn't ever do:

https://www.exratione.com/2013/03/nodejs-abusing-express-3-t...

I guess the Netflix situation is one of those that doesn't occur in most common usage; certainly dynamically updating the routes in live processes versus just redeploying the process containers hadn't occurred to me as a way to go.


Responses are already firing in: https://hackertimes.com/item?id=8632220


I love these kinds of investigations into problems in production. I mean, you really have to admire their determination in getting to the root of the problem.

In some ways, these engineers are not that different from academic researchers, in that they are devising experiments, verifying techniques, all in the pursuit of the question: why?


I would have written my apis in golang and not nodejs. Go is way faster in my experience and it feels leaner to create something because creating a web service can be productively doneout of box. Node apps tend to depend on thousands of 3rd party dependencies which makes the whole thing feel fragile to me.


Would someone explain what I'm missing about the flame graphs? Why are they indispensable here? In a normal profiler, you'd just expand the hot path and see what had the most samples. Apart from making recursion very explicit, what special aspect do flame graphs expose?


Why are they loading in routes from an external source? Is that normal, I have never seen that before.


We like the option of dynamically loading new routes, that point to new endpoints. We also have the ability to release new versions of our UI without redeploying (or restarting) our servers.


Could you elaborate in a bit on this? This is the first time I have heard of this, and I find it very interesting.


Ok you add a new route but how do you reference what code should be executed when that route is hit?


We have something that loads up, via requires, the action (or route) that should be run when a URL is encountered.


Second dmak, would love to understand more. I can following dynamically loading routes but can't follow how that would be implemented end to end. Some things I would be interested in: - Where do the keep the code that gets executed for new routes? Is that deployed dynamically as well? - If you are changing routes dynamically how do you test in non-prod, are you constantly syncing non prod with prod? - How do you control what you deploy dynamically vs what you migrate through the environments?


Interesting article. I have a lot of experience dealing with ETLs in WPA on the Windows side - it's an awesome tool that gives you similar insights. I haven't used it for looking at javascript stacks before though, so I don't know if it'll do that.


> We also saw that the process’s heap size stayed fairly constant at around 1.2 Gb.

> Something was adding the same Express.js provided static route handler 10 times an hour.

Why didn't it increase the heap size? Maybe it was too small to be noticeable?


Second, given a performance problem, observability is of the utmost importance

I couldn't agree with this more. Understanding where time is being spent and where pools etc. are being consumed is critical in these sorts of exercises.


So the lesson is to actually know the code you deploy to prod? Is that not obvious?


Your webserver, your dns resolver, your database, your operating system, the compilers they used, how about the bios, or the northbridge


Not everyone digs into the code behind their dependencies.


Everyone should read the terms & agreements too.


NodeJS Project has already a similar issue about recursive route matching.

https://github.com/strongloop/express/issues/2412


Doesn't this seem like a bug in the express router? All of the additional routes in the array are dead (can't be routed to).


No, because the route list is also used for middleware. The recursive search is because middleware routes have a third next parameter that allows the search to run asynchronously.


I wonder how Netflix would perform with using Dart with the DartVM. I reckon it would be faster than Node based on benchmarks I've seen. Chrome DartVM support is right around the corner ;)


Crazy talk. In 1ms, I can perspective transform a moderately big image. NodeJS cant iterate through a list.

We really need a 60 fps equivalent for web stuff. You have 16ms, thats it.


FWIW, a lot of the stuff the Chrome+Polymer team is working on explicitly has a 60fps goal and 16ms frame budgets. I remember in my last project at Google, I spent a lot of time working with the Chrome team to get various parts of websearch rendering under the 16ms budget. (I couldn't manage it given the amount of legacy code we had to work with, but various other people have continued the work since I left, so hopefully they've had more luck.)


I must admit I could enjoy just doing this type of analysis all day long. Yet I hate non computing puzzles.


wow. I love that Netflix us using Node and even more curious that they would use express.


this is why you stick to tried and true methods folks. this is such a typical node.js fanboy mentality. "reinventing the wheels is justified because asynchronous". or "i want this trendy way to do things just because everyone else is jumping on the bandwagon".

Give me flask + uwsgi + nginx anyday.


In the end of the day Node.js is just a Reactor Pattern implementation in a form of bunch of scripts.


an unfortunate title. Ha ha "flames" ha ha "Node.js" but the article is really about express. Not so "ha ha"


They renamed it "in flame graphs."

For the audience, this was originally titled: "Node.JS in Flames"


A very good reason to go with express is TJ. He was the initial author of express and he is quite brilliant when it comes to code quality. Of course, TJ is no more part of the community but his legacy lives :-)


OFFTOPIC: "Today, I want to share some recent learnings from performance tuning this new application stack."

The word you want is "lessons".


"learnings" is a perfectly cromulent word: https://books.google.com/ngrams/graph?content=learnings&year...


If you are Sacha Baron Cohen, yes.


Use of this word predates Sacha Baron Cohen's entire existence.


Nope. You see an option for a plural here? http://www.merriam-webster.com/dictionary/learning


Dictionaries are descriptive, not prescriptive. They can only tell you if something is a word, not whether something isn't a word.


Well certainly there is no law against using made-up words. If that's what you want to do, have at it.

At least it will help people get their buzzword bingo cards filled up sooner.



Dictionaries ain't arbiters of the English language. And quite fortunately so, I should note.


OFFTOPIC: "Today, I want to share some recent learnings from performance tuning this new application stack."

The word you want is "lessons".

Jeez, not only is your comment offtopic and pedantic, it's also wrong.


"it's also wrong."

I believe the word you want is "incorrect."

/sarcasm


I flagged it as Offtopic. But it is correct. "Learnings" is at best a pointless affectation, like saying "utilised" instead of "used". At worst, according to the OED, it's not even a word.

http://english.stackexchange.com/questions/19227/plural-of-l...

Also: http://www.merriam-webster.com/dictionary/learning No plural option.


> "Learnings" is at best a pointless affectation

No, at best it's a deliberately wonky usage for comic effect [1], like "Internets". It's quite possible the author is being gamesome. [2]

[1] https://simple.wikipedia.org/wiki/Borat:_Cultural_Learnings_...

[2] http://youtu.be/Wc1pxO_-5S8?t=6m43s


Continuing the OT and heedless of the downvotes: I appreciate your efforts, but they can't help themselves -- so many people love using recently invented (and redundant) words because they think it makes them sound smart and hip. Sorta like people who think JS is an amazing language and love to sell that point of view.


I think their must be a voting ring of marketers that are upset their latest buzzword-bingo entry has been called out :-)


I play tournament Scrabble and learnings is a valid word in it.


I didn't like it when people started using "learnings", but I'm beginning to appreciate some subtle shadings it seems to have over lessons. It seems more discovery oriented, and less sure it's correct, or at least, less widely correct. Sort of like "hacky, working hypotheses, that we think are worth sharing."




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: