Hacker Timesnew | past | comments | ask | show | jobs | submit | grammr's commentslogin

How aggregations are performed are determined entirely by your own continuous view definitions [0]. In this case I'm guessing you'd want to include a time-based column in the aggregation GROUP BY clause.

And since PipelineDB is a PostgreSQL extension, you can use the timestamptz type (which includes timezone support), and in general you could pretty easily simply normalize your event timezones in your continuous view definitions. When you're reading aggregate data back out, you could cast the time-based column using whatever timezone the client prefers.

Thanks for the question--I hope that was helpful!

[0] http://docs.pipelinedb.com/continuous-views.html


Thank you, yes, it was helpful for me to understand the possibilities. I'll dig more into that.


PipelineDB co-founder here--I think this is a pretty fair take! I would also like to point out that the aggregate data stored in PipelineDB can still be further aggregated, processed, JOINed on etc. on demand as well.

Since a continuous view's output is simply stored as a regular table, you are free to run arbitrary SELECT queries on it to further distill and filter your results. PipelineDB's special combine [0] aggregate allows you to combine aggregate values with no loss of information for this very purpose.

The most common pattern among our user base is to aggregate time-series data into continuous views at some base level of granularity (e.g. by minute) and then aggregate over that for final results (e.g. aggregate down to hour-level rows for the date range my frontend has selected).

[0] http://docs.pipelinedb.com/aggregates.html#combine


I'm Derek, one of the co-founders--excellent question!

The former. PipelineDB performs aggregations in memory on microbatches of events, and only merges the aggregate output of each microbatch with what's on disk. This is really the core idea behind why PipelineDB is so performant for continuous time-series aggregation. Microbatch size is configurable: http://docs.pipelinedb.com/conf.html.


Can you say a bit more about "performant" or point me to some information? I haven't found any yet. I'm processing millions of protobufs per second and would love to get away from batch jobs to do some incredibly basic counting -- this seems like a fit conceptually...If its a fit, any recommendations on the best way to get those protobufs off a kafka stream and into pipelinedb would be great, too!


Performance depends heavily on the complexity of your continuous queries, which is why we don't really publish benchmarks. PipelineDB is different from more traditional systems in that not all writes are all created equal, given that continuous queries are applied to them as they're received. This makes generic benchmarking less useful, so we always encourage users to roughly benchmark their workloads to really understand performance.

That being said, millions of events per second should absolutely be doable, especially if your continuous queries are relatively straightforward as you've suggested. If the output of your continuous queries fits in memory, then it's extremely likely you'd be able to achieve the throughput you need relatively easily.

Many of our users use our Kafka connector [0] to consume messages into PipelineDB, although given that you're using protobufs I'm guessing your messages require a bit more processing/unpacking to get them into a format that can be written to PipelineDB (basically something you can INSERT or COPY into a stream). In that case what most users do is write a consumer that simply transforms messages into INSERT or COPY statements. These writes can be parallelized heavily and are primarily limited by CPU capacity.

Please feel free to reach out to me (I'm Derek) if you'd like to discuss your workload and use case further, or set up a proof-of-concept--we're always happy to help!

[0] https://github.com/pipelinedb/pipeline_kafka


That's awesome! If you don't mind - one more q.. I see that stream-stream joins are not yet supported (http://docs.pipelinedb.com/joins.html#stream-stream-joins). Can you comment on when you think this feature cold land or is it still a ways off?


Sure! So stream-stream JOINs actually haven't been requested by users as much as you'd think. Users have generally been able to get what they need by using topologies of transforms [0], output streams, and stream-table JOINs. Continuous queries can be chained together into arbitrary DAGs of computation, which turns out to be a very powerful concept when mapping out a path from raw input events to the desired output for your use case.

The primary issue in implementing stream-stream JOINs is that we'd essentially need to preemptively store every single raw event that could be matched on at some point in the future. Conceptually this is straightforward, but on a technical level we just haven't seen the demand to optimize for it.

That being said, you could just use a regular table as one of the "streams" you wanted to JOIN on and then use an stream-table JOIN. As long as the table side of the JOIN is indexed on the JOIN condition, an STJ would probably be performant enough for a lot of use cases. With PostgreSQL's increasingly excellent partitioning support this is becoming especially practical.

I also suspect that this is an area where integration with TimescaleDB could be really interesting!

[0] http://docs.pipelinedb.com/continuous-transforms.html


Just out of curiosity, do you have a specific use case that necessitates stream-stream JOINs, or were you just exploring the docs and wondering about this?


My use case is pretty much parallel time series alignment with several layers of aggregation. I guess I perceive stream-stream joins as an easy way for me to wrap my head around how to structure my compute graph, but it seems doable with the method mentioned by @grammr. I'd hope for an interface roughly like "CREATE join_stream from (SELECT slow_str.key AS key, sum(slow_str.val, fast_str.val) AS val FROM slow_str, fast_str INNER JOIN ON slow_str.key = fast_str.key)". I do realize there are some tough design decisions for a system like this, but I'd also like to drop my wacky zmq infrastructure ;)


This is precisely why PipelineDB has rich support for data structures such as HyperLogLog [0]. HLL's allow you to track distincts information using fixed-size HLLs that only grow to about 14KB while encoding uniques counts for billions of distinct values. The tradeoff is about a ~0.8% margin of error, which users generally find acceptable.

Furthermore, PipelineDB has a special combine [1] aggregate that allows you to combine data structures such as HLL across multiple rows with no loss of information. A simpler example would be average: to get the actual average of multiple averages you obviously can't simply take the average of all the averages. Their weights must be taken into account, and combine handles that.

The capability to combine aggregate values in this way generalizes to all aggregates in PipelineDB.

[0] http://docs.pipelinedb.com/aggregates.html#hyperloglog-aggre...

[1] http://docs.pipelinedb.com/aggregates.html#combine


I'm Derek, one of the co-founders--great questions!

> So you create a table, insert into it, and it's always empty. Is that right?

That is correct. Streams can only be read by continuous queries (e.g. you can't even run a SELECT on them).

> Does this work for any table in pg? How does pg know that the insert should NOT actually insert a row?

PipelineDB streams are represented as a specific kind of PostgreSQL foreign table [0], so only foreign tables created in a specific way will be considered streams. You can use triggers to write table rows and updates out to streams if you want to though.

[0] https://www.postgresql.org/docs/current/static/sql-createfor...


I'm Derek, one of the co-founders--that's an interesting way to frame it, I think that makes a lot of sense at a high level.

We're in contact with the TSDB founders (awesome and super smart guys!) and are in the early stages of figuring out an integration that makes sense. That's most likely going to happen.

To anyone interested: we'd love to hear and consider your ideas re: TSDB integration. Feel free to open an issue in either repo (or add to an existing one) and tell us more!


Can you guys join forces and convince AWS to make both of those products available on RDS? :)


So basically AWS will monetize something they have spent 0 resources building and will likely cannibalise the only viable monetization option?


Surely not, AWS has never done anything like that!


The most impactful thing you can do here is ask the RDS team for this. If enough users ask them for it they'll eventually begin seriously considering it :)


Already did!


RE integration: A docker image with both TSDB and PipelineDB extensions and PostGIS, supporting PG11 ;) which is something I will look into doing myself, but lack the time to do so..

The time-series database of a project I'm on uses timescale and it's been great for the quick inserts and the `time_bucket` function has been very useful for aggregate queries.. But moving from aggregations generated on-the-fly to ones updated continuously on data change sounds like it could be awesome for us, so I am v happy to see this article today :-)


That's great to hear, I'll be looking forward to seeing where those talks and collaborations go.


It seems like if you combine pipelinedb with timescaledb, you get continuous query capability of influx ?


I'm Derek, one of the co-founders--thank you!

We're super happy with where Stride is at! We've continued to onboard customers in a few AWS regions and the infrastructure is rock solid at this point. Most users are ingesting 10k+ events/s and their analytics frontends are retrieving results in well under 100ms. We've gotten it to the point where it "just works" which has made Stride users' lives a lot easier at that scale.

And since the hard parts of Stride are powered by PipelineDB, an added benefit for us is that we now get a ton of super detailed instrumentation data about PipelineDB performance and behavior, which has helped make the open-source product quite a bit better.

We'll be moving Stride into self-service/GA next year--stay tuned!


Hi there! I'm one of the PipelineDB founders. This description is correct. The unique thing about PipelineDB is that it doesn't store granular data. Once all aggregates are incrementally updated, the raw input rows as discarded and only aggregate output is stored.

This approach dramatically limits disk IO and long-term storage requirements, and enables super high performance in most cases on modest hardware.

PipelineDB has been used in production for nearly four years now and is used by Fortune 100 companies.


So once you make it as an extension, any chance to mix PipelineDB with Citus in one cluster?

My hunch says that it's possible as far as there is some additional computation done with the future aggregate query on the coordinator in Citus.

PPDB looks interesting, but we also need to keep the underlying raw data and multiple clusters require more complex pipeline.


We haven't looked too far into integrations with any existing systems at this point, but if there was significant user demand for it on both ends we'd definitely be open to it.

One thing I will mention here is that we do have plans to add support for persistent streams [0] after version 1.0.0 is released. We've learned a lot over the years about how our users/customers interact with streams in production and persistent streams will be built atop that foundation of understanding.

Please feel free to comment on that issue with your use case, requirements, etc. and we'll see what we can do!

[0] https://github.com/pipelinedb/pipelinedb/issues/1463


Persistent streams are interesting, but we spent years refining our ETL and building it around Citus, that it would be very complicated to separate those two. I will wait for the extension and do some testing.


Hello! I'm Derek, one of the PipelineDB co-founders. The way using PipelineDB feels to users has always been a principal consideration to us and how we make design decisions, so a psychological benefit isn't a second-class citizen in our minds. With so many different tools to choose from nowadays, any friction at all (technical or non-technical) can be a showstopper. We've always strived to make PipelineDB as easy to use as possible, and the extension refactor is the grand finale of that continual effort as we approach 1.0.

Thank you for your input, and we hope you you'll find great success with PipelineDB in the future!


When this ships as a PostgreSQL extension, it looks like it could very handily solve a problem for which we're currently pilot-testing logical decoding.

I look forward to trying it, and would happily explore whether beta testing in a large scale environment (half a million concurrent users) is something my management chain and internal customers would be open to.

Thank you for your work!


If you're open to sharing, I'd love to hear more about your potential use case. Please email me!


Your email isn't listed on your profile. Mine is, though. Please do get in touch.


His GitHub profile has it: :)

  https://github.com/derekjn


> Pipelinedb is annoying though in that it is a fork and not an extension.

Hi, I'm one of the PipelineDB co-founders--thanks for using our product! Making PipelineDB an extension is the most consistent piece of feedback we've received from our users, and I promise we're listening: PipelineDB 1.0 will be a standard PostgreSQL extension, incrementally rolled out via versions 0.9.8, 0.9.9, and 1.0.0.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: