> If there’s anything that’s an absolute must in my Observability to-do list, it’s getting org-wide acceptance of OpenTelemetry (OTel) for data-collection.
OpenTelemetry exists to allow CTOs to tick boxes on requirements checklists, and help to get the CNCF technology stack into the F500 space. The project is textbook design by committee, with seemingly infinite scope but very little acknowledgement of the actual technical constraints of the problem space. It claims it can solve a lot of problems, but I can say from first hand experience that it does not, and cannot, deliver on those claims. It's not good. Nobody with a choice should be using it.
At the moment, the capabilities of technology simply don't permit the level of... generic-ness? that vendors like OTel -- and customers really -- want. If you try to push everything through a single pipeline, as a single kind of data, you necessarily lock yourself out of most of the value you should be getting. You need a heterogeneous stack, with specific tools for specific purposes.
Opinions vary but my experience is that Prometheus-style metrics are by far the most important thing to invest in, and deliver the majority of the "observability value" to the broadest set of architectures. Tracing systems like Lightstep can be super useful, too, but to deliver value they need a lot more end-to-end integration effort, and the cost of setting it up can very very easily outweigh the benefits it provides.
I've come to believe that logs are a trap. Everyone understands them intuitively, so they feel comfortable using them for basically everything, without thinking very critically about the ramifications. And even when logs are structured, they have no substantive schema, so there's no backpressure, so to speak, on usage. So the signal/noise ratio almost immediately goes negative. And they just occupy enormous amounts of time and memory to manage, process, analyze, etc. Although I'm sure it's not a universal truth, I've found that everything you might normally think to log is actually much better served by in-process request tracing. In general that means maintaining the log events of the most recent N requests for all of a set of application-defined categories. This is basically a real-time view of a system, with history proportional to, I dunno, rarity? of the request class. You don't ship these anywhere, you ask the applications directly. It seems weird to describe but it's just vastly more efficient to manage, and equally if not more useful for debug and triage.
I appreciate the goals and tenets of observability, but I really wish these tools didn't absolutely insist upon being integrated into my software.
Give me a log format and parse my logs and then do what you will with those logs to provide observability. All my logs are already json, and they're already relying upon convention for quick and easy parsing elsewhere, so adjusting the format isn't a big lift.
The last thing I want in between the innards of my application is communication with some external system (even if it's a sidecar) - especially if there is any sort of networking involved. If that sidecar drops or slows down, so does my application, which is a variable I'd rather not have to consider, and isn't necessarily tracked by the tools themselves.
I've been finding that vector.dev keeps up well. Before I moved to containers, I was using rsyslog, which also kept up incredibly well. I've had both set up to pour directly into elasticsearch, which, if tuned, can keep up as well.
> As the manager of the Observability team at my current company, I find myself in a rather unique position. As part of my job, I get to define the “golden path” of Observability here.
Alarm bells ringing.
Be careful when you give teams single remits, for they will execute on them to the exclusion of all else.
I feel like the world would be a better place if architects were bonused on company project completion.
> I feel like the world would be a better place if architects were bonused on company project completion.
Be careful of running into Goodhart's Law when doing this.
In my experience, absolutely nothing replaces lots of time (7-10 year time horizon granularity), and conscientiousness. Everything else is gamed into Goodhart's territory.
Where I've seen architects incentivized the way you describe, the definitions of "project" and "completion" are tightly scrutinized, and all other considerations thrown to the wayside. I see developer experience, operations supportability, troubleshooting visibility, observability, extensibility, upgradeability, and so on, all sacrificed upon the incentives altar. Surprised Pikachu face when fantastic company project completions "suddenly" turns into an ocean of technical debt with a particularly nasty Marianas Trench of mission-critical code everyone is afraid to touch but must somehow be tamed to help strategically evolve the company meet new business needs.
Which is why I've been pursuaded to the qualitative over quantitative side for end metrics. E.g. partner team rating of your team, surveys.
Yes, it can turn into political infighting, but on the whole I've seen it turn out better than trying to hit numbers, with everyone trying to manipulate the definition of a number to their advantage.
This is monitoring. More comprehensive monitoring does not warrant new fancy name. Except if you want to confuse people. Or feel good for doing something "revolutionary". (You are not. You are doing monitoring.)
Fixed that for you: The capabilities described in the post do indeed offer more comprehensive TRACING/MONITORING, not just MONITORING.
Again, new monitoring solutions with fancy YAML configs do not warrant new name, more so when that new name is established concept in dynamical systems control theory.
OpenTelemetry exists to allow CTOs to tick boxes on requirements checklists, and help to get the CNCF technology stack into the F500 space. The project is textbook design by committee, with seemingly infinite scope but very little acknowledgement of the actual technical constraints of the problem space. It claims it can solve a lot of problems, but I can say from first hand experience that it does not, and cannot, deliver on those claims. It's not good. Nobody with a choice should be using it.