Rookout CTO Liran Haimovitch sits down with Sudeeptha Jothiprakash, Product Lead at Doordash. They discuss solving consumer issues, being able to find the needle in the haystack of containers, gaining a deep level of granularity without spiraling costs, the division of responsibility, and what the most important thing for DevOps engineers to be doing is.
Liran Haimovitch, Sudeeptha Jothiprakash
Liran Haimovitch 00:02
Welcome to The Production-First Mindset, a podcast where we discuss the world of building code from the lab all the way to production. We explore the tactics, methodologies, and metrics used to drive real customer value by the engineering leaders actually doing it. I'm your host, Liran Haimovitch, CTO and Co-Founder of Rookout. Today's episode is all about cloud-native applications and modern observability. With us, is Sudeeptha Jothiprakash, previously of AWS and product lead at DoorDash. Thank you for joining us, and welcome to the show.
Sudeeptha Jothiprakash 00:44
Yeah, great to be here. Thank you so much for having me here, Liran.
Liran Haimovitch 00:47
Sudeeptha, what can you tell us about yourself?
Sudeeptha Jothiprakash 00:50
Yeah, a quick introduction. Hi, I'm Sudeeptha Jothiprakash, I've spent as you called out six and a half years working for AWS and mostly on observability solution. The earliest years of my life, I worked on CloudWatch logs, which was essentially the logging platform that evolved to become log analytics over the time. I also spent my time understanding observability requirements from customers and both CloudWatch service lens which brings together metrics, logs, and traces under a single place so that you can triage and troubleshoot your problem easily. Today, I'm with DoorDash and I'm solving some of their consumer issues by offering them plans and partnerships, in order to get customers on boarded and try out the DoorDash capabilities easily. I've worked with AWS as a product manager, as you called out, mostly focused on observability tools for the last six years, I essentially spent my time building logging solutions with CloudWatch logs, looking at customers problems across the application with AWS X-ray, and also service lens, which essentially helps bring together the three fundamentals which are metrics, logs, and traces under a single umbrella. Currently, as you mentioned, I'm with DoorDash, solving their needs in terms of getting customers connected to partners. So essentially, their plans and partnerships in order to ensure that you have the right set of capabilities. So that's a quick whirlwind tour of my experiences,
Liran Haimovitch 02:15
Pretty cool stuff, a lot of interesting projects you've been taking a pile of. Now, before we dive into observability, I want to take a step back and discuss cloud workloads in general. Today, we're hearing more and more about cloud-native workloads. And there are two big powerhouses over there, you have containers, and you have serverless. And I think recently, you're even seeing that some of the boundaries between them are getting blurrier and they're becoming even closer to each other. What's your take on that?
Sudeeptha Jothiprakash 02:44
Yeah, definitely. So, as you called out, the world has evolved towards being more sustainable in terms of the requirements that customers have on their applications. So, we're trying to be more effective in the way computers have been used and this is where the two powerhouses of containers and serverless come in. When it comes to more predictable workloads, containers are the most efficient way for you to serve your customers’ requests. An example would be, let's say, when you have constant traffic that's coming to your web platform, and you know, and on an average, you have certain amount of traffic, containers are the right way to go about it. Serverless essentially looks at a very unique work case where when you have traffic that's unpredictable, but you still think they're highly critical for your business, you can ensure that they're serviced appropriately. An example of this can be when you have IoT sensors that are plugged on to let's say, a truck, which is helping you identify the temperature of the goods that are essentially pharmaceutical based, for example… if there is a drop in temperature, you would like to be notified very quickly in order to make sure that all of the goods are of the temperature and of the quality that you're expecting. And this is where ensuring that the IoT device actually talks to a lambda function, perhaps, you know, pinging a lambda function, and ensuring that message is transferred appropriately is sort of the right mechanism to retain the quality that you're hoping for in terms of your application responses. So, I was hoping to help you understand a little more about the two types of workloads, simply because it becomes the fundamentals of the way you're thinking about observability overall.
Liran Haimovitch 04:25
And how do you think about observability when it comes to those familiar workloads, when it comes to those workloads that can scale up or down and demand? How is observability different?
Sudeeptha Jothiprakash 04:36
Yeah, so as you called out, there are a lot of loads in applications today-- Workloads and applications today, that essentially have very unpredictable patterns as well. And these unpredictable request patterns have to be met by an efficient way of capturing these requests and responding to them. What containers helps you do is scale up and scale down very quickly and be efficient about it in terms of your responses and not just your responses actually, also looking at it in terms of your cost-effectiveness of your application itself. And as several resources come in and go out, you want to be able to capture the data from them in the most efficient manner without actually having any loss. The worst thing that could happen to you, for example, is you have some really critical responses that you need to send out to your customer, and you've not scaled your container environment appropriately. And when you're trying to actually triage the issue on why your customers didn't get the responses, if you do not have the right logs in the matrix, that essentially puts you in a precarious situation. So, this is sort of where several resources being appropriately provisioned and also instrumented in order to get the right telemetry from it is very critical when you're thinking of observability.
Liran Haimovitch 05:56
So how do you figure out what's the correct way to instrument and what's the data you actually need from those workloads, from those containers?
Sudeeptha Jothiprakash 06:04
Yeah, so this is sort of where we recommend that you have an observability plan to begin with. So, think of your application to have some parts which are critical workloads, and some which are essentially, those who are running on the background, but do not require that level of granularity in terms of the data that you would like to capture. So, as part of your critical workload, first, you ensure that they are scaled appropriately and have the right mechanisms for auto-scaling when the traffic increases. But you also have to be very mindful when you're setting up the application on what types of data or insights that you like to capture from these applications. So, you need to be mindful whether you want to enable transactions, you want to enable logging and the critical metrics that you'd like to capture, all of this needs to be configured well in advance before you actually deploy an application versus trying to figure that out after you get it into the production stage, and then see that you are missing some areas of visibility. So, this happens more often than you know, as you probably have heard from others as well here. And we recommend that you sort of be more mindful when you're getting started with application development itself. And again, you mentioned this because you don't want to overload yourself with data as well. So, if you set up metrics, logs, and traces across all of your environments, there are chances of you having an information overload and multiple signals telling you the same issue. Rather, you would like to be able to aggregate these insights from across the critical sources and be alerted appropriately. Did that help?
Liran Haimovitch 07:40
Yeah, makes perfect sense. Now, you keep speaking of scaling up workloads, and obviously with serverless, they think scale up on their own. And you understand other hand, you've said you don't want too much data, you don't want too many signals. So, I kind of have to wonder, as environment scale up, as you're running 10s or even hundreds or 1000s of containers at the same time. What happens to the observability part of the system? How does it affect cost or performance or signal to noise ratio? or other events?
Sudeeptha Jothiprakash 08:13
Yeah, great question there, we definitely look at essentially, when an application scales up very quickly, that you're ideally looking at the same set of signals that you did when the applications is not scaled as well. So, your business telemetry, and the critical data or the observability telemetry that you would like to capture should be on a one-on-one ratio. What should happen in the backend as well is to ensure that the telemetry that you're collecting is aggregating across the scaling that you do. So, an example is, let's say you have a workload environment of 10,000 containers as such, and that scales up to a 100,000. Just because its peak Black Friday sale, for example. And when that scaling happens, you want to ensure that the 100,000 containers are actually sending you the same set of key results or key telemetry by aggregation and this aggregation can happen not just from your logs, but also as signals from your metrics, because ultimately, you're trying to ensure that you're being alerted appropriately on the issues that matter to you most. And I also see a world in which having the data to dive deep further is also very critical. Because once you have that aggregation, it's great in order to have that first line of response and ensure that you know what the issue is, but when you dive deep further, you should be able to point out as a needle in the haystack between the 100,000 workloads that you have, or containers that you have, on this particular container landed up failing this particular process, and that's why your customer got bikes etc. for example. You should have that level of granularity when you need it the most, when you have customer issues.
Liran Haimovitch 09:57
How do you go about having that level of granularity, and it's still of that level of aggregation without getting your costs spiraling out of control?
Sudeeptha Jothiprakash 10:08
Yeah, and this is sort of from experience, I could speak that there is sort of two key verticals in my mind. The first recommendation is to log everything, right? You want to make sure that you're logging very, very robust in order to ensure that you understand when things go wrong. And you can keep it in a low-cost storage if you'd like, in order to keep it also moveable for compliance and privacy purposes as well. So, ensuring that you have logging enabled across at least your critical workloads is super important. From these logs, you can then create metrics that are aggregations, and all these aggregations will help you understand the signal to noise ratio. From there, the second pillar is transaction monitoring. So, while you can have logs across various parts of your application and workloads, what transactions help you do is understand how a request cuts through all of your workloads as such. And being able to tie that metric back to your transaction is very important when you're trying to troubleshoot, let's say, a request coming from your customer. And this is sort of where ensuring that you have at least these three fundamental setup for your critical workloads, will enable you to be more responsive and reduce your mean time to resolution, as we call it, as part of your root cause analysis. Is that sort of where you are going with this, Liran? I am trying to make sure we are thinking of the right ways in which we capture data.
Liran Haimovitch 11:38
I would love to hear a bit of your insights about the cost’s aspects of observability kind of. Because personally, I've been hearing a lot of customers, Rookout customers and partners, sharing their pains of overpaying for observability. Because keeping all of that data, whether it's logs, especially logs, but also metrics, and definitely tracing, which by itself is incredibly expensive, and having a sister scale up, then all of the sudden log, everything can become extremely expensive. And the other time, there is a huge fear of missing out exactly those logs that are going to end up being important and may save the day.
Sudeeptha Jothiprakash 12:18
Yeah, I'm totally with you on that Liran, and I wish there was a world in which cost wasn't the issue that created all of these nuances when it comes to application management. But I do understand that we need to be pragmatic about how we scale up and how we ensure that we're catching the right telemetry as such. The way I personally would look at it is all the critical workloads can have multiple tiers of logging. And when I see multiple tiers of logging is some that are captured and something that could be actually analyzed using your log analytics solution right off the get-go for the first two or three days, because most of the issues that you see, and you triage on 90% of them actually fall into the first three-day window asset. So, you want to make sure that that data is readily available when your root causing an issue. And after that, you essentially send that into some sort of storage. If I may, this can be making sure that you have a backup in your s3 bucket, which is also required for compliance reasons. Many of our previous customers will talk about how compliance might ask them to never purge any of their logs or also have like, tenure audits time as that. So, these are real-world requirements where logging costs on an analytic platform can add up very quickly. And this is sort of where having it in storage, which is low cost makes sense. Compressing your logs as well is a great way to ensure that the costs don't add up when you send it to your storage. And I'm speaking more about logging here, just because I know that's been sort of one of the hot topics that many of the folks talk about in terms of observability. But as you call that there's also transaction monitoring. So, transaction monitoring or tracing does land up adding up in terms of costs when it is larger traces that go across multiple parts of your application. But often more than once, those are the ones which are most critical because the longest-running transactions are the ones that land up creating the most impact to your application. And the way we have seen a lot of the best practices evolve in this space is for you to first map out all of your critical paths and ensure that those critical paths are at 100% sample rate, the non-critical path, you can look at reducing your sample rate to even 5% for example. And that keeps you enough signals coming in, so that you know that these paths are our key actions by your customers. And this is sort of the rate at which requests are coming in and so on. But it doesn't really add up in terms of costs as you called out. But when you're thinking of the most critical parts across your application, I have seen a lot of partners sort of talking to other customers, telling us about not setting that at 100% in terms of transaction monitoring, and they'll have like a couple of high-profile customer cases or customer issues come in and they don't have the traceability just as yet. The final thing I want to mention is, we need to make sure that there is a good relationship between the logging that you have and the transactions as well. And so, while you might have the perfect…in terms of all of the transactions, making sure that they have the connection back to your logs is very critical in order for you to get more evidence, in order to reduce or narrow down to an issue that matters to you. So, being able to identify, here's an issue that the customer ID, XY and Z was able to tell you about and ensuring that you understand the same thing from the logs is very important. And what I mentioned earlier, like make sure that your logging is set up for the first three days and make sure that your critical parts are set up for transactions, they need to be mapped one to one for all of your critical workloads. If not, you will be in a situation where you might have the transactions and not the logs and the other way around as well.
Liran Haimovitch 16:14
Makes sense. Now, earlier in our conversation, you mentioned the importance of not just relying on engineering metrics, but also bringing business metrics into the observability space and keeping track of the business in your application. So, kind of, that's the topic I'm very passionate about. And I would love to hear your take on it.
Sudeeptha Jothiprakash 16:34
Definitely yeah, this is actually one of the areas I feel we do least about in terms of observability, because each application and the business outcomes it drives are so bespoke and so unique, that there's no one template meets all. And what happens often is that the developers may not know what are the right business signals that we need to capture as part of the application performance. And it often leads to asymmetry in terms of the data that you're able to capture and analyze as such. So, I'm 100% with you, this is an area that I think across observability we are trying to solve, but it's often harder for us to tackle rather than thinking of the several resources or the serverless resources as such. And going back to this point were ensuring that you have the right observability plan as you build the application. This is where for example, product managers come in and say, here are the levels of granularity that I need in order to understand how the business is impacted by this particular application is very important for you to define very early on while you're building the product specs. If you have that ingrained, and you know which events to capture what types of signals matter to you, then when issues actually occur, you'll be able to directly relate it to here's how it impacted the number of customers coming in, or the number of orders dropped, because a certain part of your application didn't scale very well, and so on. So, it's being mindful very early on, as I called out that really helps set up the application and the product managers for success.
Liran Haimovitch 18:15
If you're thinking of those plans, I definitely agree observability plans are super important. I wish observability was more agile, but it sometimes isn't. I think one of the most important elements that come into play here is the same that with everything cloud-based is the division of responsibility, or you've been to DoorDash, you've been to AWS, you've seen from multiple angles. What's your point other than that? What can product managers and engineers can rely on their cloud vendors or infrastructure providers to provide versus where should they be spending most of their time and focus and energy?
Sudeeptha Jothiprakash 18:51
Now that's a very well put question, Liran. So, I'll sort of answer the last part of your question there. So, the most important thing for DevOps engineers to be doing is building really cool stuff. And I think, unfortunately, we have so many things that are going on around them, that they often more than not spend more time actually managing their applications rather than building stuff. And a lot of that load, as you called out has to be divided in my mind in three categories. One is, what can you get natively from either your cloud service providers, or areas and mechanisms that you have from the open-source tooling that can enable you to monitor your applications. Second, is as you and I discussed, the business telemetry that is most important for the team to figure out through product management conversations and ensuring that we understand what business objectives this application is trying to drive. And the third, is that aggregation workflow and ensuring that you're reducing the signal to noise across these two segments, you know, all signals that's coming in from your applications that can be more native driven. And then, you have your custom data that's coming from your business, all of them, how are they aggregated? And how are they set up in terms of correlation? In order for you to identify, ah, this drop and order rate is because we didn't do XY and Z properly in a certain application. Those three pillars need to come together very well. And I'm fully acknowledging the fact that it's easier said than done when you're thinking of observability plan aside, and even more so, one of the things I'll throw in as bench is when you are evolving your application so rapidly, I believe the rate of committing code right now is within a couple of days or sub-day level today. So, as you're changing code, so quickly, being able to ensure that you have the right set of telemetry is also hard and I'm fully acknowledging that aspect. But it's almost up to the developer and the product management mindset to come together and to make observability a critical aspect of their job. Because I think more often than not, and definitely in previous was when we had on-prem solutions and monolith ad applications. The mindset was observability was an afterthought and we can no longer afford that, because of how ephemeral these applications have become, and how distributed environments operate. It is very critical for observability to be sort of the front of all of your development overall.
Liran Haimovitch 21:35
I couldn't agree more. I have to admit, I haven't seen that many companies actually manage to get a grip on business metrics, especially in real-time, let alone combine them with technical metrics and release logs and so on. But I wish we were all there, as you imagine a much better world than the one, I know.
Sudeeptha Jothiprakash 21:55
I'm fully with you on that. I think this is a problem that, I guess it's also an opportunity for many of the companies to iterate in. Because ultimately, we need a framework in the way we think about what are these business objectives of each product update that we do, for example. So, what are we trying to drive? How are we expecting customers to react? And what are some of the key signals that help us understand success? Many of the companies do OKRs and OKRs is a great way to think about what are some of the business objectives that we'd like to try. And ensuring that we convert these OKRs into observability goals as well is very important and I think that's a step that I mindfully am trying to also inculcate as part of my journey now. But I can see that a lot of businesses are still evolving that mindset just as you.
Liran Haimovitch 22:48
Makes sense. Now there's one last question, I would love to ask you a question I ask all of my guests. So, you have been an engineer before you were a product manager. Now you've been a product manager for a long while. So, you've been around software for a large portion of your life. So I have to ask, what's the single bug in your career that you remember the most?
Sudeeptha Jothiprakash 23:05
Yeah, and I think I'll probably, without giving a lot of details about what the bug was, there was a point of time where I pushed cord, which didn't have the component, which converted the number of bytes to gigabytes. And this was coming from, let's say, different pipelines. So, I was calling something that was residing in a completely different pipeline, and I expected that conversion metric to be there, or the conversion package to be there. And what this was doing is also helping us build customers on their logging and you can imagine how this could like scale up so quickly, right? We had a lot of, and this was a very minute, one hour bug maybe. But it can cause a lot of havoc to our customers when if they actually see bills that are times 1000, because every gigabyte is now being represented in terms of the bytes that they use. And we were able to quickly catch this as part of our tests and our tests sort of showed these crazy numbers and billions of dollars that it would actually be charging our customers if not for those test situations. So luckily, we were able to sort of rollback and ensure that it didn't impact any production environments. But these are sort of those minute things, right? Like you want to ensure it works really well and I learned pretty early on that you should understand your dependencies very clearly and ensure that your dependencies are also there when you call them.
Liran Haimovitch 24:38
Definitely. Interesting story.
Sudeeptha Jothiprakash 24:40
Yeah. Thank you.
Liran Haimovitch 24:42
Sudeeptha, thank you for joining us. It's been a pleasure having you on the show.
Sudeeptha Jothiprakash 24:46
I'm very happy to be here, Liran. Thank you so much for the chat and I really appreciate the conversation.
Liran Haimovitch 24:57
So that's a wrap on another episode of The Production-First Mindset. Please remember to like, subscribe, and share this podcast, let us know what you think of the show and reach out to me on LinkedIn or Twitter at @productionfirst. Thanks again for joining us.