The Production-First Mindset

Logz.io's Dotan Horovits - Putting The Spotlight On Observability

January 09, 2022 Liran Haimovitch Episode 24
The Production-First Mindset
Logz.io's Dotan Horovits - Putting The Spotlight On Observability
Show Notes Transcript

Rookout CTO Liran Haimovitch sits down with Dotan Horovits, Principal Advocate at Logz.io.  They discuss his passion for open-source and communities…and how that led him to his current position, Open Observability Talks (his own podcast), his take on where OpenTelemetry stands now, and why users make the move from ElasticSearch to OpenSearch.


To learn more about the topics discussed in this episode:

Painless Cloud-Native Debugging
Rookout is a disruptive developer solution for Cloud-Native debugging and live data collection.

SPEAKERS
Liran Haimovitch, Dotan Horovits


Liran Haimovitch  00:02
Welcome to The Production-First Mindset, a podcast where we discuss the world of building code from the lab all the way to production. We explore the tactics, methodologies, and metrics used to drive real customer value by the engineering leaders actually doing it. I'm your host, Liran Haimovitch, CTO and co-founder of Rookout. Today, we're going to be discussing some of the latest and greatest open source projects in Observability. With us, is Dotan Horovits, Principal Developer Advocate at Logz.io.

Dotan Horovits  00:41
Hey Liran, glad to be here. Thank you for having me.

Liran Haimovitch  00:44
Dotan, welcome to the show. Can you tell us a little bit about yourself?

Dotan Horovits  00:48
Yeah, sure. So, my name is Dotan Horovits. I'm a Principal Developer Advocate at Logz.io. I have been a developer for many years, a system architect, and a solutions architect. I've been Product Manager for developer tools, and platforms, and cloud orchestration. And I'm also very passionate about open-source, and communities. And today, as part of my work at logz.io, my main focus is around Observability and DevOps-related open sources in general, and the CNCF, the Cloud Native Computing Foundation, in particular. I co-organize the local CNCF chapter here in Tel Aviv. And I also… run my own podcast called The Open Observability Talks. So for listeners who like a podcast here, maybe they want to check out around open-source DevOps and Observability.

Liran Haimovitch  01:43
So you seem to be knee-deep in open source, at the very least. And I'm kind of wondering, how did you get into open source? There aren't that many people around here in Israel who are, you know, focusing so much on open source.

Dotan Horovits  01:56
Yeah, so I've been passionate about open source for many years since my early days as a developer working as a user with the open-source tools and platforms. I think the main point in which I actually got involved in open-source was back when I worked at a startup called Cloudify. That's a cloud orchestration platform and as part of that, we established a local community for OpenStack, back then a very, very hyped private cloud platform. And since then, it really stuck with me and drove me later on to things such as CNCF -for whoever doesn't know- that's the foundation where Kubernetes is and Prometheus and many other open-source activities. I got involved in the DevOps days community here in Tel Aviv, and many others.

Liran Haimovitch  02:50
You mentioned CNCF. I think today, CNCF is fueling much of the cloud-native revolution, and open source is a big part in making cloud accessible for everyone with a rich private cloud or public cloud. And I'm wondering what the role you see, open source is playing in Observability?

Dotan Horovits  03:08
So I am thinking Observability, there's a very, very important role there. Because essentially, everyone today is running a cloud-native architecture, microservices, Kubernetes, Docker, and so on, and many of the traditional tools that have been around are less applicable. And then again, people turn to the same open-source community that gave us and brought us Prometheus, Kubernetes, and others, is now finding us the solutions on how to monitor these environments. So things such as Prometheus that's under the CNCF, and others are natively integrated into Kubernetes and the cloud-native environment. This is why it's really booming. And you see that also in the numbers, you see statistics about 60%, I think I saw analyst reports 451 Gartner talking about 50-60% of the companies running open-source Observability stacks and monitoring solutions. It's not just young, you know, startups such as Rookout and others, it's also well established enterprises. People approach me from government agencies and from banking and things that you wouldn't have heard of back in the day. And now we're running open-source on their stacks. When I used to work at Amdocs, which is an enterprise, a large enterprise, when I joined, you wouldn't hear about open source and by the end of my tenure there, they were already involved in ONAP and open source around the telco clouds under the Linux Foundation. So it's really everywhere and it's amazing to see that.

Liran Haimovitch  04:42
Today, what are some of the more interesting projects when it comes to open source Observability?

Dotan Horovits
  04:50
Interesting projects? That's a very interesting question, because it depends on the interest areas, I think that I mentioned before, for example, Prometheus is an amazing project that gives a way to collect metrics, which has become really a de facto standard. There is a derivative project of open metrics, which aims to standardize the exposition format. It's spun out of Prometheus, but it stands on its own. That's why it's actually being submitted as its own standard on the IETF currently. So a standard way to converge the industry in a way, how do you actually report your metrics in a standard way so that you're not tightly coupled to a specific vendor, specific formats, specific tool for gathering and analyzing this data. So Prometheus is very interesting. Open Telemetry is fascinating, more about the gathering and collection. Jaeger on the distributed tracing side, lots of interesting projects around Observability, open-source, and Observability.

Liran Haimovitch  05:50
Lots of stuff. So let's take OpenTelemetry. I think it's one of the highest profile projects out there today. There has been a bit of history there, and I'm wondering, what's your take on where open telemetry stands now.

Dotan Horovits  06:03
So there's lots of history. I don't know if the audience knows that it's originally a merge of two preceding projects, the open tracing and open census, it's been extremely active. It's today, the second most active project in the CNCF, second only to Kubernetes itself. So it's very, very impressive. And you see all the major players out there gathering around that. That's for me, the greatest testimonial for saying that this probably is set up to success. Because you see all the major cloud vendors AWS, and Azure, and Google, and others. You see all the monitoring and Observability vendors, like us Logz.io, and Splunk, and New Relic, and whoever. So that's very, very encouraging. And also in terms of maturity, it's still an incubation project under the CNCF, let's remember that. Just a few months ago, it's been graduated from the sandbox to the incubation, the tracing portion of open telemetry, maybe we should say a few words about that, just to align the audience. So essentially, it's a framework for generating and collecting telemetry, largely, traces, logs, metrics, from any system, application code, infrastructure, components, Kafka, cloud services, whatever, and then sending it to whichever backend you want. So OpenTelemetry doesn't really care about the backend. So it's a way to standardize just a portion of generating and collecting the telemetry. And each signal has its own evolution or its own path. So tracing is the most mature one, it was GA, it became generally available this year, which means that this is already ready for production. And what we're about to see is metrics, getting to GA, that's the upcoming announcement, that should be very soon. And logs is the least mature of the signals. That's where OpenTelemetry currently stands.

Liran Haimovitch  08:08
What's the vision of OpenTelemetry trying to bring all of those three pillars together into one project?

Dotan Horovits  08:14
I think, let's take a step back to where we came from, as an industry. So for many years, we've been in a situation where people deployed agents, and SDKs, and collectors and whatever that were provided by vendors. You had the Dynatrace one, the data dog one, the Splunk one, whatever. Lots of intelligence there in the agents, no doubt about it. But you go to the point of the fragmentation. So for each piece, for each signal, maybe for each source, you needed a different tool to gather, maybe even to generate, if it's your own source code. You have a Java app, they're on the backend, and you have the JS on the front-end. And especially polyglot organizations and things like that, for each one, you needed its own thing. And it wasn't talking the same language. So that's the challenge that we were at. And this fragmentation is aggravated, the more the architectures become elaborate. So people today, typical architecture has many, many databases, or your sequel and your non-sequel and the graph database, and the columnar database, and you have your API server, and you have the HTTP proxy, and each one-- It becomes so diverse, and that impacts the way that you collect telemetry. And that's essentially the problem that OpenTelemetry comes to solve and to provide one unified way of collecting that is independent of your backend. So removing this tight coupling between the backend analytics tool and the generation and the collection of telemetry, so that you have one way and one specification across programming languages, across SDKs, across sources that this way is the way. So one SDK, one API, one SDK to collect all the telemetry traces, metrics, logs, that's the vision, of course. From your, let's say, Java and your GO and your NodeJS and whatever. One way to collect it, one way to communicate it over a protocol, because a central part of OpenTelemetry is also an OTOP, which is a protocol that ultimately should be one unified protocol across the different signals. And then you send it to whichever backend you choose to do your analytics.

Liran Haimovitch  10:33
Interesting. Speaking of backend, I know you're super active in the open-source search project, which is one of the most popular backends today for processing Observability data. And it's also one of the most recent and high-profile rifts today in open source. Can you share with us a bit about that?

Dotan Horovits  10:54
Yeah, so that's very interesting activity, that project that has emerged during 2021. It started off from constraint. We were all heavy users, including, you know, my own company, logz.io, heavy users of the ELK stack, Elasticsearch for log analytics. And one day Elastic, the company behind it, just decided to relicense it to a non-open-source license. So SSPL that's called and that created a dilemma for us as an industry, for my own company and for many others. We all found ourselves in a question, okay, what's the path forward? Some were less-- cared less about open source. For us, it was of major importance of keeping these projects open source and enabling the community to carry on and contributing and making this successful path forward. And that ultimately brought about fork, a simple fork of Kibana and Elasticsearch into OpenSearch project. It's called OpenSearch and OpenSearch dashboards, respectively. We have luckily, very major stakeholders involved. Primarily AWS and my company Logz.io is involved and many others. So there's a vibrant set of vendors and community members to back it up. Since then, it's evolved beyond the mere fork of ElasticSearch and Kibana7.10. something into a fork. Since then it's been enriched. So first of all, AWS contributed their open distro for Elasticsearch with a rich set of plugins that was originally designed for Elasticsearch, was adapted to open search. So that brought about lots of features on analytics and security and many others. Since then, there have been quite a few enhancements and expansion beyond log analytics into things such as tracing, such as APM, related and a broader Observability. So a very promising looking project that we're backing.

Liran Haimovitch  13:03
What's the maturity of OpenSearch? Can people and should people use it in production environments?

Dotan Horovits  13:08
So it's already GA. It's version 1.2 at the moment. So since starting 1.0 it's been GA. It's been on a regular cadence of releasing versions, minor versions, every few months, expecting I think in August, if I remember  the roadmap, currently, the 2.0. So it's definitely ready for production. There are already quite a few players out there that move their production environments, from Elasticsearch to OpenSearch, including logz.io, and many others. So it looks like it's there. And then again, lots of new features that we're getting to enhance it even further and make it even more suitable for many other use cases.

Liran Haimovitch  13:55
So what are the key features you see, that you already have or that you have upcoming compared to ElasticSearch?

Dotan Horovits  14:02
There are quite a few angles that we're exploring beyond the simple-- First of all, on the basic levels of managing clusters and shouting and things like that, there is lots of activity around these areas. And also, as I said before, exploring beyond the mere, let's say, log and search capabilities into showing Observability, for example, collecting things such as events, entities, Observability, signals. Collecting, tracing, being able to show and visualize things on the open search dashboard side of things, enrich the data set with metadata, lots of activity around that.

Liran Haimovitch  14:45
From your experience, while you do, users make the transition from Elasticsearch to OpenSearch? What are the key reasons?

Dotan Horovits  14:54
I think the loyal community of ElasticSearch was impacted by the relicensing move. Many felt discomfort carrying on, especially since no one guarantees that that's the last we'll hear of it. So I think the lack of, you know, solid ground underneath your feet, especially. Usually, Elasticsearch is a central component of your, you know, mission critical part of your system. So when you feel this lack of stability one time, you're now being cautious going forward. And by the way, there were steps that came afterwards, like changes to the client libraries to provide us all sorts of checks of the backend, the Elasticsearch cluster and disabling the client libraries that are also open source and by the way, are still Apache 2.0, but with these checks, that renders them ineffective to work and breaks the compatibility, including, by the way, older versions of open source Elasticsearch. So users of Elasticsearch pre-version 7 and before, even they got some break of compatibility. So things like that make people think future proofing their system saying, okay, how do I make sure that I keep on serving my business goals and not being juggled by these sorts of announcements? That, I think coupled with seeing major vendors like when AWS says I'm putting an army of developers there to enhance it and take it forward. That gives a sense of trust, they see a very transparent roadmap, showing where we're heading, by the way, it's public, all of that, of course, on GitHub. All the roadmap going forward. I think these are the main things that drove people to consider the switch.

Liran Haimovitch  16:40
And you've mentioned that at Logz.io, you've also migrated to open search. In general, I think much of what you're doing in logz.io is open source focused, right?

Dotan Horovits  16:49
Yes, Logz.io, and I guess that's the reason that pulled me to this company, following my passion for open source is that what's unique about logz.io, is that we provide cloud native Observability platform or SAS, but it's all based on popular open source tools. So the log management part, for example, has been for years based on the ELK stack. And now we're moving to OpenSearch. And the infrastructure monitoring piece is based on Prometheus, and the distributed tracing piece is based on Jaeger, which is an open source under the CNCF, and so on. We use open telemetry for the collection side of things as the best practice. So it's an underlying theme, and it's part of our mission statement, let's say, at logz.io. That's why for us, it was not just part of what's easy for us to implement when I talk about the transition to open source, not just what's easy, but also as part of what we communicate the guarantee that we give our customers, that whatever open source that they're accustomed to. Even running your own, do it yourself type of stack in their own systems, we'll just provide a compatible thing just without the headache of running, managing, scaling, upgrading whatever themselves, but as a managed SaaS.

Liran Haimovitch  18:07
So as a company, you're essentially hosting and serving those open source products, so anybody can benefit from them without having to host them themselves. So kind of, what's your focus from an engineering perspective? What are the key values that are driving you forward?

Dotan Horovits 18:23
As I said, first of all, making the open-source accessible in the easiest way possible for the users, that's the baseline. And also keeping track of what is the popular open-source tools because things change, like the change from Elasticsearch to OpenSearch, like the rise of Prometheus, and Jaeger, and so. First understanding what is-- what will serve best our users because this is what they're accustomed to, and this is the best open source out there. And then, the second piece is finding the gaps in the open source because open source projects by their nature are very focused on the core capabilities of the problem they come to solve, but less about peripheral things. So companies that run this open-source in production, at scale, have other peripheral concerns, they need to-- you need to make sure you have compliance, you need to make sure of security issues, you need to make sure of user and role management. Thing is that the open-source tools are usually less mature because it's not the core business. And then what we come as logz.io is to provide this wrap around the open-source that provides all the administrative aspects, the security related aspects, the compliance, all of the things that again the organization or the customer doesn't need to take care of these on top and build it, develop it in house around the open source to get this maturity level. And then on top of, let's say, these amenities, we have additional layers around adding intelligence. So adding AI machine learning type of analysis of the telemetry so that we can, let's say, provide recommendations and personalize based on the specific account, specific organizations, type of handling incidents, being able to correlate things intelligently between the tools. Because imagine if you have Jaeger separately from open-source separately from Prometheus, that's fine that you have all the three. But the purpose is to have one unified Observability platform. It also means that they need to talk between themselves very seamlessly. So being able to correlate and jump from one to the other, transitioning the context of the search. So you start from a metric that suddenly spiked, and then you want to say, okay, I want to see related traces to that metric, I want to seamlessly transition and have all the timeframe and some parameters and filters, switching over to the other product, to the Jaeger so to speak, but without you having to copy paste it. So the correlation between the product and creating this aggregated alerts and so on. This is the-- I guess, in high level, what we try to achieve on top of the open-source.

Liran Haimovitch  21:03
Super Interesting stuff. So before we wrap this up, there is one question I've been waiting to ask you. I have one question I ask all of my guests. So what's the single bug that you remember the most from your career?

Dotan Horovits  21:15
For me, the highest-profile bug in recent years, it was actually at Logz.io. This is something that I remember most, because it resonated well, in terms of the importance of distributed tracing and made us realize how important this new pillar for us as it was a new pillar back then, to Observability. So we had a bug with a very low-- you know the high latency in one of our frontend services that we couldn't figure out. And we were masters of logs, obviously, logz.io and metrics, and we couldn't find the reason. And it was inconsistent in some calls, similar calls, apparently, but ones were at low latency, ones at high latency, we couldn't find the reasoning behind it. And we got lost in the logs, we had lots of logs, and we were very experienced with logs, but we couldn't do that. And then we resulted to tracing. And, you know, very simple instrumentation of just three, four microservices on the frontend side, on the Node.JS side, which sounds like very few, got us to the root cause very, very quickly. Finding ultimately one caching issue, one single key that was not cached, and resulted in going to the backend database to get it, to fetch it. And it was only keys of the context that only specific users used. So it was inconsistent, in terms of not every call, had to go to the backend. So being able to see the trace end to end, from the frontend code all the way down to the backend database, in one trace across a request. That was what convinced, I think us as an engineering organization, to adopt tracing as part of our standard practice in Observability. And obviously, to push us further to make it a solid pillar in our observative platforms, also to our customers.

Liran Haimovitch  23:06
Cool stuff. So Dotan thanks very much for joining us. Can you share with us more details about your podcast?

Dotan Horovits  23:13
Yeah, so the podcast is called 'Open Observability Talks'. I'm actually glad to say that now we're recording it in December, end of 2021. And it's being on the finalists for best DevOps rated podcast of the year of 2021 by DevOps Dozens awards. So I hope that you hear that in time to vote. It's now public for voting. But the idea is essentially, to talk about everything that is open source, DevOps and Observability. I have guests that are maintainers of open source projects such as Prometheus and Jaeger, OpenTelemetry, and others as we talk. I have guests that bring the user perspective, like senior SRE is from Google and others. So it's very diverse. And the idea is to learn about practices, tools, and tips for running your Observability based on open source at scale.

Liran Haimovitch  24:10
So if you're interested in hearing more about Observability, definitely go and check it out. Thank you very much, Dotan.

Dotan Horovits  24:16
Thank you very much, Liran. Glad to be here.

Liran Haimovitch  24:23
So that's a wrap on another episode of The Production-First Mindset. Please remember to like, subscribe, and share this podcast. Let us know what you think of the show and reach out to me on LinkedIn or Twitter at @productionfirst. Thanks again for joining us.