More

ora600 · on Oct 18, 2016

According to the article, Yelp had 7 different data sources and similar number of targets.

If they wrote a loader for each combination, they'd end up with 49 combinations. Not to mention 7 loaders to write every time they add an app.

With Kafka - they just need to connect each thing to Kafka - 14 connectors instead of 49.

This is pretty much the scenario Kafka was invented for, and you get stream processing for free: https://engineering.linkedin.com/distributed-systems/log-wha...

jack9 · on Oct 18, 2016

You only need 1 loader for directory loading json. Yelp already has an ETL (and more transforms) for combinatorial normalization of format (from log files to events pretty much covers the spectrum).

Redshift will create columns (within some restrictions about nested arrays) which generally have to be avoided, however you get the data into redshift, from json. Kafka is a process/time wasteful step in almost every redshift loading scenario, given the current state of AWS services. Test for yourself over a few billion messages at various message sizes from 1k to 1M, if you get the chance.

Kafka is great for a message queue if you can't write to S3 directly or as a buffer to deal gracefully with S3 hiccups, for high frequency throughput to redshift.

Goosey · on Oct 19, 2016

(disclaimer: I work on the Data Pipeline project @ Yelp - but these opinions are my own and not necessarily representative of Yelp)

@jack9 - If I am understanding your point correctly, it is that you could just use S3 directly as the 'unified buffer' and the Kafka part is unnecessary. I'll try to shed some light on why we made Kafka part of this infrastructure.

- Regarding "near real-time" this is our cheeky way of saying we internally haven't set an SLA or a formal definition of the maximum latency in the system for us to declare it 'real time'. Hard numbers wise from the time a MySQL[0] event happens to a transformed version of it being in redshift is roughly in the realm of 10-30 seconds (unofficial and not a guarantee of course, just what I am anecdotally seeing at this point, and we are certain we can bring that down).

- As ora600 points out, we not only have multiple data sources, but also multiple data targets. So it's important to us to reduce the IN*OUT down to IN+OUT. It's important to note that nearly all of these connection points more naturally 'speak' in streaming than batch processing.

- Kafka has been fantastic tech for this use case for us: aside from this s3/redshift OUT connector we generally are dealing with connectors which want to be streaming. In-fact if it were performant to do so we would love to stream message-by-message into redshift as well (it's not).

- We already have a lot of infrastructure for batch processing - specifically a service for bulk loading from s3 to redshift with Mycroft[1] (it's open source[2]), and many systems (especially older systems which wrote to s3 in order to do Map-Reduce processing with MRJob[3], also open source[4]) have been able to use it for this purpose - In many ways we previously had used s3 as our 'unified buffer', but it's not a great solution for stream processing.

- Finally regarding the ETLs:

- - Yep we have them, and we hate them. The Data Pipeline project, in a sense, almost originated as a way for us to get rid of them. That's certainly why my team has been building the Data Pipeline components (we are 'the data warehouse team' in Yelp). They are costly: Each ETL is 'handwritten' (adding a new table is writing a new ETL, granted these are very small classes as we have a decent framework, but it still takes manual effort), this requires a 'Yelp-Main' push (Yelp has talked openly about our efforts to break apart our monolith[5], but it's still a thing and the push process isn't as painless as with our micro services), and inputs from other data sources was challenging as our ETL framework was designed around the 'Yelp-Main' codebase (it's old, cut us slack ;)).

- - A component of the Data Pipeline is what we call the 'Application Specific Transformers'[6] (we call em ASTs for short, much to the confusion of any compiler writers out there) which allow us to apply the types of transformations we typically did in ETLs (outputting both an 'raw id' and the 'encoded id' which is served to the front end, splitting bit-flag ints into separate booleans, and stringifying int enumerations - as well as extra stuff ETLs didn't do like extracting documentation for Watson[6])

It's worth noting as well that the Data Pipeline lives in PaaSTA[7] which is a big improvement from how the legacy components it is replacing were deployed.

All this being said: we thought carefully about meeting all the needs of our systems, but our needs are of course our own. This certainly would be the wrong tool for many other use-cases.

Sorry for slow response - I've been out on PTO and just got back to work today. :)

[0] - https://engineeringblog.yelp.com/2016/08/streaming-mysql-tab...

[1] - https://engineeringblog.yelp.com/2015/04/mycroft-load-data-i...

[2] - https://github.com/Yelp/mycroft

[3] - https://engineeringblog.yelp.com/2010/10/mrjob-distributed-c...

[4] - https://github.com/Yelp/mrjob

[5] - https://engineeringblog.yelp.com/2015/03/using-services-to-b...

[6] - https://engineeringblog.yelp.com/2016/08/more-than-just-a-sc... (Includes some info about the ASTs and Watson)

[7] - https://engineeringblog.yelp.com/2015/11/introducing-paasta-...

ora600 · on Jan 26, 2016

required.acks works together with min.insync.replicas.

Basically "required.acks" lets you choose between "no acks", "leader only" and "all in sync replicas", while min.insync.replicas lets you control what "all in sync replicas" actually mean.

ora600 · on Nov 24, 2015

In this case, you will enjoy the new consumer in 0.9 a lot!

ora600 · on Jan 29, 2014

I already use Houzz for something similar. Houzz lets me browse products by similar categories, but also lets me see how the product looks in actual rooms and has lots of social features.

ora600 · on Dec 27, 2013

Scala: Java has tons of boiler-plate code and is not "functional" enough.

dragonwriter · on Dec 27, 2013

Or, just take the Ada description from the original article -- it applies just as well to Scala as it ever did to Ada.

(N.B., I think Scala is a great language. But it is certainly one that has never seen a feature that it didn't like...)

ora600 · on Oct 14, 2011

Obviously the author doesn't have kids. I don't think anything can prevent them from breaking and changing things. Maybe they won't do it with iPad (although I doubt it), but they'll still do it. Part of being a human child and all.

ora600 · on Sept 19, 2011

If you are currently living and creating wealth in Somalia, then I agree - society doesn't help you much and if you pay any taxes, you can stop.

If you live in a first world country, you should realize that the difference between your country and Somalia is a result of government work and is funded by taxes.

Secure borders, police guaranteeing your property rights, courts, lack of bribes and corruption (relatively!), good roads, good lines of communications, educated people to hire. They don't just show up randomly in places.

davidw · on Sept 19, 2011

All right, but apart from the sanitation, medicine, education, wine, public order, irrigation, roads, the fresh water system and public health, what have the Romans ever done for us?

That said, this article is basically politics, and should be flagged as such.

ora600 · on Sept 15, 2011

Years ago I've read an article that recommended doing this manually as a way to assist the flow of traffic.

I thought its a nice idea and tried it. Unfortunately, I tried this in Tel-Aviv, where drivers from other lanes immediately moved their cars into the space that opened between my car and the one ahead of me. It didn't take me long to figure out that I'm getting nowhere.

Two years later, when I moved to the bay area, I tried it on the 101 during rush hour traffic. To my shock, it worked. It was exceedingly rare to have any car move into that space. Or move lanes in general. For some reason (laziness? safety?), California drivers don't switch lanes as much as Israeli drivers do.

Moral: There's time and place for every algorithm.

anateus · on Sept 15, 2011

Bay area and LA drivers are very different. LA driving is more similar to Tel Aviv driving, if you want a taste of home :)

Bay area drivers have this weird sense of "lane ownership". They don't change lanes... and they don't let anyone come in! It's not the sort of "Won't let you cut in front of me!" that you sometimes see in Israel, but a general thing. Even if you need to change lanes to exit and it's clear you're going to exit right away, I've had countless drivers speed up and block me from merging to the right.

This weird quirk does make it useful for applying traffic wave theory, so at least there's one good outcome.

blahedo · on Sept 15, 2011

I've used a variant of this successfully when there's a merge coming up. Basically, if I'm in the lane that's disappearing, I slow down as if to merge, matching speed with the car next to me, but then don't merge until the last moment. When I do this I invariably get some hotshot honking behind me who wants to zip up to the merge point and slow down traffic there, but such people are essentially the cause of the problem, so I don't worry too much.

What generally happens is that from the moment I start doing this, overall traffic speeds up just a bit and the lane ahead of me clears fairly quickly. It's win-win for everyone but the hotshot behind me.

jawns · on Sept 15, 2011

Actually, this is suboptimal.

The optimal flow is for everyone to use both lanes up until the last moment, and then to merge by alternation (one car from one lane goes, then one car from the other lane goes).

Unfortunately, this only works when a substantial majority of the drivers understand that this is how it should work. If you don't hit that threshold, then the people who use the disappearing lane are going to get glares.

Here's a link that describes the theory behind optimal merge patterns:

http://jksqr.blogspot.com/2008/09/optimal-lane-merging-part-...

wbeaty · on Sept 15, 2011

Actually he may be forcing the lanes-full optimal mode to arise.

If one or a few drivers in the empty lane start pacing the cars in the full lane, cars will build up behind them. When they arrive at the end of their lane, the drivers in the full lane won't be so ready to block merges (since nobody was cheating by racing down to the end.) Perhaps this could trigger an outbreak of zipper-merging.

bkudria · on Sept 15, 2011

So, what you're saying is: your solution is optimal in theory but suboptimal in practice, and the grandparent's theory is suboptimal in theory but optimal in practice. Right?

weekendlogic · on Sept 15, 2011

http://trafficwaves.org/

He has updated it since '98, and even bought a domain! But yeah, it does work.

ora600 · on Sept 13, 2011

I love the explanation of terms. Highly useful.

It also highlights the near complete lack of correlation between how much a plan costs and how much benefits you get.

ora600 · on Aug 12, 2011

... and ZFS copy-on-write has been known to cause severe performance degradation for databases. Not to mention that too many SAs keep the default ZFS block size, which is far too large for OLTP databases.