The Production-First Mindset

OwnBackup’s Roi Dover- Even Log Lines Need Code Reviews

August 22, 2021 Liran Haimovitch Episode 4
The Production-First Mindset
OwnBackup’s Roi Dover- Even Log Lines Need Code Reviews
Show Notes Transcript

Rookout CTO Liran Haimovitch sits down with Roi Dover, VP of Engineering at OwnBackup.  They discuss how to serve customers better, the evolution from a single product to a multi-product company (and what that means for your R&D team), and how to ensure your R&D team has the proper metrics...and that they’re useable.

Developer-First Observability
Rookout is a developer-first observability platform that provides an unparalleled ability to collect

Disclaimer: This post contains affiliate links. If you make a purchase, I may receive a commission at no extra cost to you.

SPEAKERS
Roi Dover, Liran Haimovitch

Liran Haimovitch  00:02
Welcome to The Production - First Mindset, a podcast where we discuss the world of building code from the lab all the way to production. We explore the tactics, methodologies, and metrics used to drive real customer value by the engineering leaders actually doing it. I'm your host, Liran Haimovitch, CTO and Co-founder of Rookout. Today, we're going to be discussing engineering for DLP, and data protection. When your SaaS platform serves to backup and protect the data of your customers, you have to step up your game. How do you build and operate your code to help load that level of trust? We are joined today by our guest, Roi Dover, executive VP of engineering at Ownbackup, a SaaS data protection software company. He is a seasoned leader with over 20 years of experience in software engineering and production. Thank you for joining us today and welcome to the show.

Roi Dover  00:59
Thank you. Thank you very much.

Liran Haimovitch  01:00
Roi, can you tell us a little bit about yourself?

Roi Dover  01:03
Sure. So, I'm Roi, I'm 44 years old, living in Petach Tikva, married plus three. As you mentioned, with over 20 years of experience in the technological realm, mostly around-- I actually started in IT in the late 90s, when it wasn't even called IT. And later, in the starting of the 2000s actually paved my way in the startup world, from storage companies, data security companies, and cybersecurity companies. I'm very interested in technology, in people, people ask me to show an integral part of being a VP of engineering, basically. Today, acting as an EVP of engineering at Ownbackup. And before that, was actually 13 years in a cybersecurity company called Volantis, where I managed there also the engineering force.

Liran Haimovitch  01:54
As you mentioned, you're working for Ownbackup, which is a company doing DLP for SaaS products, and it's become a unicorn over the past year or so. So, what can you share with us a bit about the DRP and what it means today in the world of SaaS?

Roi Dover  02:10
Okay, so basically, we are in the Data Protection realm, especially in the SaaS Data Protection. Now, there is a misconception about SaaS and … applications, which is if you're on the cloud, then it means that you're safe. But basically, the infrastructure is safe but the application is not. You're not protected against users deleting data, removing data, changing data. So basically, in the data protection around where we live, we offer protection for the SaaS application in the application layer mostly, for example, allowing a quick restores to business continuity when users accidentally or maliciously delete data or change it. So by that, and also due to COVID, where most companies started working from home, it increased the need for SaaS Data Protection, because the more companies actually invest more in SaaS products, they need better protection and also they need to have the business continuity come to a continuation very quickly. So this is basically what we offer. And due to that, actually, our sales increased, and the need was more significant and raised. So, by all that combined the market requirements, we actually reached the unicorn status.

Liran Haimovitch  03:35
And I mean, it sounds very sensitive. You essentially get access to SaaS products to many core back office systems for your customers. Isn't that right?

Roi Dover  03:47
Yes, basically, our customers trust us with their data. It's not only about accessing the SaaS applications, but it's also for us towing and handling the data. So, the emphasis on security and compliance is one of our core values, and first class citizen in how we approach the development and also marketing and selling of the product. Because you probably understand that we protect the crown jewels of the customers, the most sensitive business data. Specifically, we are targeted today on the Salesforce platform, and we are going to extend to other platforms as well. And these SaaS applications hold the most valuable data for the customer and it is entrusted to us to hold it properly.

Liran Haimovitch  04:39
So, how do you go about ensuring the level of reliability and security your customers expect?

Roi Dover  04:45
So, there are many layers to that. First of all, we have a very mature security team, both for info security and application security, making an emphasis on one hand on compliance - SOC, ISO - and later also FedRAMP and others. And the other part is the product security, which means as part of the software development lifecycle, how we ensure that new services have the proper change management, and how do we minimize the attack surface, using tools to scan code, using penetration testing, using activities such as all-hands meetings and other aspects to train our developers to have the security in the mindset. So, having all of that actually ensures that we are making sure that the security is fault and center. Other than that, obviously, our cloud data centers are very, very hardened and secured. And this is also a continuous fault to always secure and remove in how then they access them.

Liran Haimovitch  06:00
How is it different when you monitored 12 different regions, 12 different data centers across two clouds? How do you go about monitoring that? What kind of tools are you using? How do you manage the data across those environments?

Roi Dover  06:13
Sure, so basically, we differentiate between the three types of data that we monitor. The first one is monitoring exceptions and errors. So, we use a tool called Sentry, which catches exceptions and shows us in the code where the exception of code is, where we can use it and are using it actually, to see how we can reduce false positives of errors and in finding bugs quicker. This is one layer. Another layer is using the Elastic Search stack, which is basically the Elastic Search and Kibana to monitor logs. Now logs are tricky, because they can contain PII and data information. So, since we are very sensitive to that, and since the security and data locality is front and center, we actually have 12 different regions for Elasticsearch in order for us to limit the chance of data escaping data locality. So, this is a bit of a challenge, because actually the R&d which are using the Elastic Search and Kibana, they need to go over 12 different regions to search for issues. The third level is metrics. Metrics are for the standard is measuring CPU memory, and the next level is latency for the jobs, etc. And for that, since there is practically no chance of PII escaping because it's only the graphs and dashboards, we use a centralized tool called the data dog to monitor the entire set of regions.

Liran Haimovitch  07:47
We keep hearing from our customers, that agility is key to having good observability that you're always trying to add and remove logs and metrics to on one hand get very good visibility to know what matters, and that often changes as you're releasing new features and as the patterns of usage change. On the other hand, you're always trying to control the costs. I mean, running Elasticsearch clusters, paying data log by the metric. Those can get pretty expensive, pretty fast, especially at your scale. So, how do you go about continuously optimizing your observability?

Roi Dover  08:21
So, I think that the key here is continuously optimizing, because this is a never-ending story, you always need to continue reviewing the logs and modifying and optimizing. And it's also-- you know, a maturity thing, because, for example, in the beginning, the logs were not very meaningful. But since we added the engineering to be part of the reviewing of the logs and building metrics on top of it, they actually got value from it. So, one aspect is making sure that the r&d team is using logs not only when something happens, when an incident happens, but as part of the currently software development lifecycle. And also, as part of the definition of done of stories or epics, to make sure that we have the proper metrics and that they are actually usable and we are actually also reviewing this as part of the demo with the other developers. So this is one thing. The other thing is, again, you know, you can't mark all of the checkboxes. So, sometimes the Kibana index explodes and we need to investigate and see how we can reduce, and we had a few projects to see how we can remove sort of a garbage logging from our logs to make sure that we don't overwhelm the indexes and also get insights. So, it is actually again, as any-- I think any software company, this is massive, and it is a continuous effort to improve both, mostly about the usability and the fact that we should get to the actual incident before it occurs or before it manifests itself, so that customers can see it and the cost is always a second place. First thing is making sure that the logs are readable, give insight so that we can serve our customers better.

Liran Haimovitch  10:10
Makes perfect sense. Speaking of customers, you have over 3000 customers, how are processes have changed, as the company has grown, as the number of users and number of customers you're working with has grown over the past couple of years?

Roi Dover  10:25
So again, I think that as you grow in customers, you need to understand that you can't throw manpower on solving stuff, for example, usually when customer count increases the amount of cases and ticket increases. So, the natural thing would be to throw more manpower, more support guys, more tier four guys. One of our objectives at throwing backup is to be able to scale to a 10x scale without an X people. So, as part of that, we are continuously looking into optimizing. Some of these optimizations are analyzing escalations of tickets, and making sure that the tickets are properly handled in the different tiers. The other, is we have a tier four team, which actually shields the engineering from the necessity to get on each ticket and allowing them some kind of slick for development, while three or four are actually either proactively or reactively act to the customers. And the last thing, is obviously, as we mentioned before, investing more in monitoring, in metrics and logs to make sure that the production is working flawlessly, to make sure that we automate some of our past manual operations such as restarting services or analyzing stuff, so that we can react quicker, again, the more customers you have, the more pressure you have to fix things quicker. And, as I mentioned, one of our objectives is not to scale with more people, but to be more efficient.

Liran Haimovitch  11:55
So, how do you ensure that your tier four team is efficiently handling those issues in production?

Roi Dover  12:01
So, we measure the amount of escalations, which of them are bugs, which of them are unjustified. For example, if an escalation which doesn't resolves a bug is escalated to tier four, from my point of view, it's unjustified. We need to either automate something or train the lower tiers for support so it is easy to take it next time. Other than measuring the tickets, we are measuring the incidents in production, and doing root cause analysis to understand how we can improve how we handle, react to the incidents, and also prevent them for next time. So our tier four, as I mentioned before, is both a proactive team working on the production on the monitors, going to logs to see or identify potential issues, and a reactive team to react to actual tickets from support or customers.

Liran Haimovitch  12:51
You keep mentioning that you're measuring stuff, and I feel it's super important to measure the performance of your teams to measure the performance of your services. How do you go about creating and analyzing those measurements?

Roi Dover  13:04
So, I think that the first thing about measurements is asking the right questions and having the goals. Measurement without a goal in mind is just numbers and colors of charts. So, it's first about the objectives, what we want to achieve. And we want to achieve from the production perspective, we want to achieve maximum uptime, best SLA for customers, and also in terms of the operation, we want to achieve not only the speed of solving a ticket, but also the quality of that. Another key point about measuring stuff is always measuring a thing from various or different directions. If you're measuring something from a certain angle, it tends to distort the truth actually, for example, if I only measure the speed of closing a ticket, then everyone will close tickets very quickly, but not of high quality. So, the first thing is obviously defining the goals and objectives. And we have about two of that which I shared with you. The other is collecting the data. And once the data is collected, then we use Jira for our source of truth for all R&D stuff. So we use Jira to track the tier four escalations, to track bugs, escaped bugs from customers, etc. And once you have that, the road to measure is very easy because there are a lot of tools. On top of Jira you can use to measure and as part of the measurements, actually do a review and think about the next corrective actions for that. We actually use a plugin in Jira called Eazybi, which is very, very robust, which allows us to see the data from values aspect and measure not only the tier four, but also the engineering force for productivity, quality, etc.

Liran Haimovitch  14:50
So, you've grown from a single product company to now having multiple products. What does it look like? That your team, your engineering team, this whole team, the sales teams you're working with, are no longer focused on just one thing. How is it making changes to the way you work?

Roi Dover  15:08
So, that's a great question. Because we continue growing, we currently have three products in production, mainly around the same ecosystem. Again, we are in the realm of data protection so we have three products around that. And as part of data protection, specifically, currently, as we discussed backup, we're still in business continuity. There are other realms to data protection, such as security data analytics, and we are definitely looking into expanding our reach to other parts of data protection, of the data protection realm. So, actually having several teams on different products may be challenging and may not. The trick here is to build the team in such a way that the level of dependency will be minimal. And we have actually been successful in limiting the dependency between each team, allowing each team to be very independent and move very quickly without reliance on the other teams. Recently, our goal is also to unify some of the products under the same suite. And we do find that our current state of growing or hyper growing, we need more dependencies between the teams. So, we are actually evolving our methodology of work, in order for us to first detect dependencies. And later on to mitigate dependencies, making sure that everyone are aligned, again, using the same tools such as Jira, etc. So, it's actually split between various factors. One is culture, the other has tools. So from the culture perspective, we still want to allow minimal dependency, but we understand that there are dependencies. So, we are currently looking at ways to do the dependency mapping in such a way that will be very, very lightweight. Again, we are a startup company, we want to move very quickly, deliver quality products very quickly, and deliver value to our customers. So, it's always a balance between process, and actually culture and people. So, we are very, very keen on making sure that we keep the balance and don't lose our agility, but still being able to cope properly in an organized way.

Liran Haimovitch  17:22
Speaking of agility, speaking of culture, as you're trying to keep the teams apart from an engineering perspective allowing them each to build and deliver on their own. You do want them to share a set of values, you want them to share the culture, you want them to share the security practices, you're kind of beholden to, how do you go about ensuring that cohesity?

Roi Dover  17:43
We have the managerial level, which we all meet and discuss, we are not working in silos. So, there are other cost-cutting initiatives such as security process planning, for example, we are using a process taken from the SAFe methodology, which is Scaled Agile Framework, which is called a PI Planning. This allows us to have a single point in time where all the teams start planning the next program increment, which is currently quoted, and in this point in time, everyone align on what each of the teams are doing. And we are also using a tracking meeting each week to make sure that we are on track to each of the goals and also to mitigate risks as well as finding dependency points. So, there are points where we define the architecture, the architecture principles, and we have a chief architect, the points of security either in the software development lifecycle process, as well as inside the teams where they need to do the design review, for example. So, there are points that are the same cause to the organization is about making sure that we are working the same way. Yet, on the actual work itself, the teams in how they develop in terms of using which design pattern or the low-level design, as long as they conform to the upper-level definitions. They will find to do whatever they want, or their manager in order for them to be agile and very quick with minimal limitation. And also, one important aspect is making sure that the team itself feels enabled and has the capability to run forward with minimal limitations.

Liran Haimovitch  19:32
I think empowering the team is so important and I think we were focusing on that. One of the key elements is to get the team to see beyond the code. The team is not just responsible for the code that they're writing, but for the service that they're building and delivering to the customers. How do you go about ensuring that your engineering team is focused on the end product or the end service and they're feeling empowered to do so, not that they're kind of detached from the code once it leaves their laptops?

Roi Dover  20:03
So again, I think that, as I mentioned previously, their involvement in the logs and monitoring actually binds them more into the production. But we do other things as well. We are using the OKR framework, the objectives and key result framework. And one of our objectives started from q3 last year was making sure that the engineers are also part of discussions with customers or meeting customers in zoom, for example. We believe that actually, this allows them not only to, you know, understand the product or how it meets the actual market, but also get feedback and see that there is an actual person, which is using the product, and thinking about how to make the actual customer life easier and better when they understand that. This is a key concept, because then I agree that currently there are engineers that are disconnected, because they write the code, they deploy it and that's it. They don't understand the entire picture. We believe that when an engineer understand the entire process, the entire picture, understand that there is a customer at the end, then his mindset is that each line of code commits to the Git repository, for example, makes its way to customer and does an impact. So, we believe that by that, also closing the loop when engineers actually meet customers, their mindset will change to actually create the product even better.

Liran Haimovitch  21:37
Now, that's a lot of processes and a lot of culture, a lot of things you're trying to bake into this engineering team you're building. I think that's very impressive. And over the past few months, you are actually acquired to companies, and you kind of have to throw them into the mix and get them aligned with everything you're doing and growing, as you're moving. What's it like?

Roi Dover  22:00
Challenging. So, it is very challenging to … and do an m&a. Because you're taking actually a group of people, which are culturally different, because each company is different, and they have been accustomed to working a certain way. And they have been accustomed to using different technologies. And not all of the company's ever had the SaaS productive or not went to production. So, it's actually very, very challenging. First of all, there's the human side of integration, which is the most important because otherwise it would fail. You need to make sure that the teams are well accepted, from both sides. Both from the new ones and the current employees that you know, there's a bunch of people that are coming to the office. And the current employees need to feel also safe and part of the process and not that they are-- you know, getting any kind of team without their knowing and without their engagement. So, this is one part which is super, super important. And once you get that and people are aligned, then next is easier. You know, it's how we do stuff. So, we start small, giving them small stories, user stories, or even bugs, starting to feel the technology, the pipeline, the process, the quality, and then building it from there. The first thing that we did, by the way, is creating a 30-60-90 plan or a six-month plan of integration, making sure that each day is met for the first two weeks for the bodies that would come in and join the teams for the sessions, the introductory sessions with other parts of the organization, making sure that they understand, each of them understand the plan for their integration into the company, which is very, very important. And from that, well, you know, it's a work in progress. It takes time to integrate external parties, but I'm fairly confident that we are on our way to success.

Liran Haimovitch  24:01
Now, how is it different when working in hybrid mode, when working offshore? I mean, with COVID-19 we are all spending so much time apart. How is it changing things?

Roi Dover  24:14
So, COVID-19 actually made a few things. First of all, it showed us that work from home is possible, but also that we also need to invest in FaceTime, Zoom is not enough. So basically, I think that all R&D towards the world actually developed the skills to work remotely. I think some are more successful, some may be less, but it taught us a lesson on how to be more effective even when we're not face to face. How we can show it in the communication cycle, you know, people in the in the office, usually when they need to ask someone they simply go to him and ask. When you were remote it's not the same. So, you need to be trained in using slick or zooming or whatever. So, it actually taught us how to work remotely. Now, the difference between working remotely under the same sort of nation or region is sometimes different than working remotely in, you know, other parts of the world, mainly because, you know, people are people. So, there are cultural differences, which even if you zoom you need to be aware of. And also there's the language barrier. Because, you know, in Israel, we usually speak Hebrew. And when we go, No, we have a team in India. So, we currently need to switch to English, not only for the team meetings, but for the entire town halls in the entire engineering meetups. Now, we need to move to English, we also need to think about timezone, because the time zones are different. So it is very challenging. So, on top of us, I think being successful in mitigating all of the challenges of COVID-19. In terms of remote work, we are also improving in how we interact remotely, discourse time zones, as well as different cultures. One last thing about that, is that when the limitations actually have been reduced by the government, it was very important for me that people actually have FaceTime in the office, mostly walking together at least one or two times a week, and also doing a bunch of fun events. And I was very pleased to seeing people in the office, playing foosball rather than actually coding, because then I understand that they are actually interacting and this actually also makes the connection to the company.

Liran Haimovitch  26:45
Definitely, definitely. Before we finish up, a couple of last questions I want to ask you. So, there's this one question I ask everybody, what's the single bug you remember the most about your career?

Roi Dover  27:00
Any bug? Wow, I have a very long career. A single bug? Okay. So, the bug wasn't actually mine. But it is something that I remember and also remind people of. Back in Varonis, we released a version to production and there was actually a recall of the version because one of the products, one of the cybersecurity products actually took all of the CPU, actually had an endless loop. And the bug was fairly simple. One of the developers actually did a follow up for a logging routine, and forgot to do a code review. And the reason why I'm actually mentioning this is that you should treat any line of code as something that can break your production. It doesn't matter whether it's for logging, or for actual code. And I asked the developer "who did the code review?" He said: "No one, why?" Because it's on your a logging line. And I always mention it to newer developers that each line of code can break your software, no matter whether it's logging or not. You always need to make sure you do the code review, you make sure the code is right. Otherwise, you'll pay a heavy price.

Liran Haimovitch  28:20
Yeah, I mean, every single line of code no matter how innocent looking, it might be, no matter how negligible it might look at first sight can have huge impact on the software, and logging, and observability is no exception to that.

Roi Dover  28:33
Yeah, that's right.

Liran Haimovitch  28:35
But before we wrap this up, what does the future hold?

Roi Dover  28:40
Well, there's a lot more ahead of us than behind us. As I mentioned well in the Data Protection realm, it is a vast realm. And we have a lot of data today, because we back up a lot of data, more than 3000 customers over 12 petabytes of data which is managed. And this is only the first step. The other steps are using and leveraging our data to give insights to customers using some sorts of data analytics, for example, expanding our reach to…systems. Other than Salesforce, for example, which are out there for the SaaS Data Protection, investing in security, for example, because security is part of data protection. So, we have a lot of stuff we want to do. And as many other companies, we're always on the look for talent, and for more acquisitions of companies and it's only the beginning for us.

Liran Haimovitch  29:34
Sounds promising. Thank you again for joining us.

Roi Dover  29:38
Thank you for having me. It was a pleasure.

Liran Haimovitch  29:40
It was a pleasure having you. You guys, everybody listening Ownbackup is hiring. So, feel free to go check them out. So that's a wrap on another episode of The Production - First Mindset. Please remember to like, subscribe, and share this podcast. Let us know what you think of the show and reach out to me on LinkedIn or Twitter at @productionfirst. Thanks again for joining us.