The Production-First Mindset

Indeni's Alon Ashkenazi - The Next-Gen of Infrastructure As Code

December 19, 2021 Liran Haimovitch Episode 21
The Production-First Mindset
Indeni's Alon Ashkenazi - The Next-Gen of Infrastructure As Code
Show Notes Transcript

Rookout CTO Liran Haimovitch sits down with  Alon Ashkenazi, VP of Engineering at Indeni.  They discuss the next-gen of infrastructure as code security, complicated cloud architectures and how to control the different aspects of them, and helping your team manage the combination of new and old.

Painless Cloud-Native Debugging
Rookout is a disruptive developer solution for Cloud-Native debugging and live data collection.

SPEAKERS
Liran Haimovitch, Alon Ashkenazi

Liran Haimovitch  00:02
Welcome to The Production-First Mindset, a podcast where we discuss the world of building code from the lab all the way to production. We explore the tactics, methodologies, and metrics used to drive real customer value by the engineering leaders actually doing it. I'm your host, Liran Haimovitch, CTO and co-founder of Rookout. Today, our guest is Alon Ashkenazi, who's the VP of Engineering at Indeni, who are building next gen of infrastructure-as-code security tools that help companies shift left. He especially cares about the development processes and designing the right CICD to increase quality and shorten the delivery time. In his free time, he loves to take on projects such as building a smart home, tree house, and whatever other projects that need handcrafting. Thank you for joining us, and welcome to the show.

Alon Ashkenazi  00:57
Hey Liran, happy to be here.

Liran Haimovitch  00:58
So, Alon, can you share with us a little bit about yourself?

Alon Ashkenazi  01:02
Sure. My name is Alon Ashkenazi, from Herzliyah, married to Sharon and I have three wonderful kids, two girls and a boy. I started my career at the Air Force, after that I moved to double verify, measure and ensure your digital media is served as intended. Play multiple order, the last one was VP of engineering. Four years ago, I joined Indeni. In the last two years, we have been building a product called Cloud oil, which is the next-gen of infrastructure security.

Liran Haimovitch  01:31
Awesome. It's great having you here. So from your perspective, one of the key elements to managing cloud and production infrastructures, is infrastructure-as-code. Why is that?

Alon Ashkenazi 01:42
First, let's understand what is infrastructure-as-code. So, you have cloud-native, and you have resources there. One way to create resources is to go to your console, let's say AWS, or you're going to AWS console, clicking all kinds of buttons, and then create your resources. This can be very complicated and cumbersome, think about environment that has 1000s of resources. Not only that, you want this environment to be duplicated to your DEV, staging, whatever environment, you can have multiple environments that are basically the same. Using infrastructure-as-code, you can easily create the environment by just clicking a button or running a command, you can integrate it into your CICD processes, and ensure that all environments are aligned and look the same. So besides that, it helps with a lot of DEV processes. So, when we adopt infrastructure-as-code, we take the opportunity and integrate it into our CICD in a way that each developer has its own environments. So when developing a feature, he has the full ownership of this feature end-to-end this means including the cloud infrastructure resources, using TerraForm. Obviously, the feature itself, so before the developer create a pull request, he already tested end-to-end on an environment that looks like the production. This gave us a huge improvement in our development processes.

Liran Haimovitch  03:05
From your perspective, one of the key elements of managing cloud and production environments is infrastructure-as-code. Why is that?

Alon Ashkenazi  03:13
So first, let's understand what is infrastructure-as-code. Infrastructure-as-code is the approach of managing your cloud resources using code. It can either be by scripting, or by declarative definition, rather than manual processes. It gives you the ability to design, implement, and deploy application infrastructure. We've known tools and software best practices, apply both to software and IT infrastructure administrators. For a long time, there was an approach in startup in high tech companies, in general that if you build it, you own it. What we are familiar with is that I build the feature so I own the feature, with infrastructure's code I also build the cloud infrastructure. So for me, the development processes does not start in the feature, it starts in the application cloud infrastructure. So this helps the developer to speed up the development processes. It means that they don't need to wait for some kind of special DevOps team and beg and wait for them to create the cloud resources. They can just write the TerraForm code, the infrastructure-as-code, and deploy it by themselves.

Liran Haimovitch  04:19
Now there are many tools for infrastructure-as-code, whether you are a fan of TerraForm or CloudFormation, or even newer stuff such as Pulumi. But, at the end of the day, infrastructure-as-code doesn't live by itself, quite often you have other stuff that's more manual in the environment, whether it's a key component that have been defined manually from the console, whether it's older systems that are more traditionally defined, how do those elements coexist?

Alon Ashkenazi  04:45
Yeah, so this is a great question. So I will start by saying that we had the luxury of starting a project from scratch. So we decided from day one, to start with infrastructure-as-code as this is also our product. But even for us, we had some cases that it just wasn't-- infrastructure-as-code wasn't the right tool in this specific time for us. So we are also managing manually the resources in the cloud. I'll give an example. So while the entire product is being built by infrastructure-as-code, some network connection between different cloud accounts and different VPCs are done manually. And we did it just because it didn't make sense to create TerraForm resources for that. Beside it, companies that already exist in the cloud and want to start working with infrastructure-as-code have few options. First, new projects can start with infrastructure-as-code and can be integrated into existing cloud environment. But also there are tools that can help you create IAC code from your existing resources. While these tools might be very useful, like any software architecture, and Cloud architecture, there are always special cases for each company, so it doesn't work 100% of the time, and doesn't save you all the work that you need to do, but it definitely helps you with a lot of repetitive work that you need to do.

Liran Haimovitch
  06:06
So you're saying that we have those reversed form of tools that allow you to scan your AWS or as your GCP account, and kind of generate a pretty good TerraForm code out of that. So why doesn't everybody just, you know, migrate to TerraForm and be done with it.

Alon Ashkenazi  06:23
So first, you need to understand the technology, the fact that you have told that you can run and just create an IAC resource from your clouds, it doesn't mean that you understand the IAC tool that you're using, and understand its lifecycle, it's very important to understand. So there is this, you know, gap that you need to understand the technology you're using. And besides that, like I said before, because the cloud architecture can be very complicated. The existing tool doesn't work in 100% of the cases. Obviously, for simple cases, it's fit and it's working, for example, for very simple intro, it's working. But when you have a very complicated cloud architecture and you want to control the different aspects of it, and different cloud configuration, those tools just don't give you a good solution.

Liran Haimovitch  07:10
So how do you see teams go about managing that combination of the new and old, the manual and the automated in the same AWS account or in connected accounts?

Alon Ashkenazi  07:20
So first and foremost, it's the ownership and the tools that you have in order to solve your issues. I will separate it to two different points. First is the speed that you can develop and deploy cloud infrastructure, where when you do not manage your cloud infrastructure, with infrastructure-as-code, it's a longer turnaround. It's a lot more of coordination between different teams, and the ownership is spread around different teams most of the time, and in just a longer cycle. Where when you use infrastructure-as-code, everything can be built in the same pipeline, into the same CICD. So it means, you deliver your cloud infrastructure changes as part of your pull request alongside with your feature. So this is the first aspect. The second one is security, where when you find security vulnerability, in the cloud environment, it's much more expensive to fix it. Not talking about the exposure and the risk that you already have, because it's in production. While when you're running your code using infrastructure-as-code and your process, your cloud resources is built using infrastructure-as-code, you can find security vulnerability much earlier in your process, this is called shift left. And you can find them early on DEV time, or CICD and fake them. And obviously, it's much much cheaper to find this there.

Liran Haimovitch  08:39
In Indeni, you're building a security platform just for that. A security platform for securing infrastructure-as-code. But from what I've heard, you're thinking of things a bit differently than other infrastructure-as-code companies out there?

Alon Ashkenazi  08:52
Yeah, that's correct. So most of the infrastructure-as-code security tools out there are doing static code analysis. It means that they are looking in your code, the static code, the code that you commit to your GitHub, or any Git repository, and scan and find security vulnerability, as opposed to our solution, which we are taking both the existing cloud resources and infrastructure-as-code resources, and merge them together and building graph of dependencies between managed and unmanaged resources. In this way, we can have a better understanding of your cloud architecture as well as less false positive, I will give you an example. So think about a scenario where some of your resources are managed in IAC, and some of them already exist and unmanaged in the cloud. So for instance, you have a server that exposed to the internet and you want it to be exposed to the internet for various reasons. And this server does not-- part of your infrastructure-as-code means that it exists in the cloud. And in your infrastructures-as-code you're adding a database. And this database sits in the same subnet as the EC2 for instance. What will happen is once you will deploy this database into the same subnet as the EC2, you will suddenly have a public access to this database. Because whatever can connect to your EC2 can now connect to the database because it's sitting in the same subnet, so if you do not have the full picture of your cloud resources and your infrastructure-as-code resources will not be able to catch this vulnerability. So most of the tools will not be able to find this vulnerability, where cloud will do the merge together, the stitch together and build the dependencies graph than give you vulnerability scan of the entire ecosystem.

Liran Haimovitch  10:44
Seems like you have to map every AWS resource within the account, potentially even across accounts if they're cross-connected. So kind of how do you go about that? It seems pretty tricky.

Alon Ashkenazi  10:55
So what we are doing, we are collecting the cloud resources from our customers account, we are asking for a specific call for a sim wall that has a very, very low permission to read all the account data, obviously, we will care about only the resources and not the actual data itself. And what we did, we created an open-source called DragonEye, this is exactly what it's doing, is you aren't giving him all, you are giving him permission. And it's just how to do an API call and fetch your Cloud account data. What we are doing later on is we are creating from this cloud account, a context, we are creating a graph that describes your resources and how they are connecting to each other.

Liran Haimovitch  11:35
You mentioned DragonEye, that's an open-source project?

Alon Ashkenazi  11:38
Yeah, it's an open-source project, we started by using a different open source project that is called Cloud mapper, which did what we needed only for AWS. But since we are a multi-cloud provider, and we needed this capability across all big cloud providers, we decided to create additional open-source that will do the exact same thing, but on a more performance-oriented way, as well as support multi-cloud providers. So DragonEye you can use it as a CLI or you can integrate it as an SDK.

Liran Haimovitch  12:10
And have you seen other users adopt DragonEye as an open-source project?

Alon Ashkenazi  12:14
So DragonEye exist for about five, six months. We did see two free users start using it but not a major usage.

Liran Haimovitch  12:23
So open source has to start somewhere.

Alon Ashkenazi  12:25
Yeah, definitely.

Liran Haimovitch  12:26
I mean, you mentioned that you are getting access to your customer’s account, AWS account, obviously a very limited role, but still are getting some level of access, so kind of how do you go about managing and securing that?

Alon Ashkenazi  12:39
Yeah, this is a very, very important item, we have worked with our customer to understand what is sensitive for them. Not only we are taking the Cloud account data, with this all, we are also taking some of the IAC code, which also can contain sensitive data. So what we are doing, we've worked with our customer and iterate with our solution, to ensure that we do not take sensitive data. So for instance, if a lot of customers are using the tags, and we've all kind of named their service name, and they don't want us to know, what is the name, obviously, IPS and password and things like that. So we built into our solution, a process that obfuscates and removes and eliminates sensitive resources.

Liran Haimovitch  13:23
Now, what about the TerraForm plan itself? Obviously, the TerraForm plan, the TerraForm state those includes sensitive resources as well.

Alon Ashkenazi  13:31
Yeah, so TerraForm definitely contains sensitive data, contains IP, names and other things. What we did in our solution, we are filtering the TerraForm plan on the customer side. So we have a CLI that process your TerraForm plan, extract whatever we need, and clean up and obfuscate the sensitive data. After that we are sending the data to our services.

Liran Haimovitch  13:56
Now, you mentioned earlier on that you're building a dependency graph between the unmanaged resources and the IAC plan. Now, obviously, the first time around, it's very, very easy. You can map out the cloud, you see the unmanaged, you have the managed plan, and you can very easily correlate them, but in the next time around, some of the managerial resources are already deployed. And you are all-- the CICD is only going to make a small change to that. So you have to map out everything that's in the cloud, that's part of the existing-- existing plans versus what's unmanaged. How do you go about that? How do you go about figuring out what goes where, what belongs where, so that you can build a dependency graph correctly, and identify the changes that are going to take place?

Alon Ashkenazi  14:42
Yeah, so for that, you need to understand the internals of your IAC tool. So you mentioned TerraForm, but it's also applicable when you're using different IAC code such as Pulumi or CloudFormation. So specifically about TerraForm, TerraForm plan actually tell you what it is about to do. So, If you will open the TerraForm plan, it's a simple JSON, you will see that it has a section, before, after, and what type of action is going to do. So it's telling you in an objective way, what currently exists in your Cloud account, what IAC managerial resources exist in the Cloud account and what is going to happen after you're going to apply this TerraForm. So it's actually telling you what it's about to do. So after processing this information and the taking the cloud data, we can now correlate between the two.

Liran Haimovitch  15:31
But this information is not available within the TerraForm layout itself. It's not available in the TerraForm configuration.

Alon Ashkenazi  15:38
Exactly. And this is why our solution is based on the TerraForm plan and not the code itself, because on the code itself, you can not know what it's about to change, only when you look on the plan itself.

Liran Haimovitch  15:49
So how do you integrate with your customers to enable that?

Alon Ashkenazi  15:52
So we are giving them a CLI-- as I said before, TerraForm plan can contain sensitive data. So all the pre-processing of the TerraForm plan down on the customer side. So we deliver the CLI that take this plan, tear it apart, understand the resources that it has, obviously the data and then send it to our SAS service. On the SAS service on our back-end side, we are taking the Cloud account data and creating this graph, doing the stitching and connecting the points together.

Liran Haimovitch  16:23
That sounds pretty complex. And I think the reality is that even if things start out with infrastructure-as-code, people tend to change them. I mean, sometimes there is an error that you're trying to troubleshoot. Sometimes you're trying to scale up or down, changing IP or whatever. And I'm sure you're encountering a lot of configuration drift, obviously, some of it goes to the unmanaged, which, you know, it is what it is, not much to do about it. But I'm guessing you're going to be seeing some drift on the Manage configurations as well.

Alon Ashkenazi  16:53
Yeah, this is a whole new feature and the capabilities, there are tools that offer this solution: drift detection. So an eye level drift is when you have a managed resource, manage meaning by IAC and infrastructure-as-code, you apply this resource to the cloud, and then for some reason you have a production issue that you need to quickly solve, and you don't have time to wait for your entire CICD cycle. So you are breaking the glass and just doing the manual change. And what happens is that sometimes people forget that they did this change, and they do not open a ticket to do the fix in the TerraForm, or they do not revert back. So this is a very dangerous scenario, because this change can open more permission to some specific server or whatever. So what we are able to do with infrastructure-as-code, then because we understand what your IAC looks like, we know to correlate the information and to see that there is a change, there is a drift between the IAC, infrastructure-as-code resource, and what actually exists in the cloud.

Liran Haimovitch  18:04
And that's super interesting. It seems like your product is quite complex. I mean, you have the CLI, you have the Cloud Platform, you have tons of stuff. And I was wondering kind of, what it looks like behind the scene, how did you go about building that product?

Alon Ashkenazi  18:18
So, we decided to go with a server-less architecture for various reasons. First scaling, we can have a lot of requests in a single moment. And we didn't want to handle all kinds of auto-scaling, or building our own scaling functionality. So we decided to go with lambda. And basically we can scale to infinity, then there are a lot of features and capabilities that we need to build, such as user management. So we choose to go with server-less architecture— server-less, not the server-less framework, the server-less architecture in general. So we are utilizing lambda containers, API gateway, step function in order to build our SAS solution.

Alon Ashkenazi  18:19
So how did you go about building the product architecture for Indeni?

Alon Ashkenazi  18:37
So we choose to go with server-less architecture. Don't be confused with server-less FurMark, we choose server-less, meaning lambda containers. And our services are stateless, obviously, beside the database that need to have persistency. So whenever we could we choose to use lambda, ECS, API gateway, Cognito for user management, step function to manage all kinds of flows in our systems.

Liran Haimovitch  19:36
I know developing a new debugging server-less environment can be quite challenging. Whether it's a development environment where it's remote, you can collect debugger, let alone a production environment and you have these functions and step functions and the containers. Everything's kind of interconnected. How do you go about debugging and troubleshooting in those environments?

Alon Ashkenazi  19:55
Yeah, so obviously, you need to have a very, very good monitoring system and logging capabilities. And you should have a way to be able to use your logging and monitoring to track down the root cause. So what we did, we put in place a monitoring system, a server-less monitoring system called Lumigo that we are using. And every time that we have an issue in one of our server-less capabilities, it can be lambda, API gateway, containers, we are getting a notification in the system. Moreover, we put together a logging capabilities that we can analyze and query and track down the root cause.

Liran Haimovitch  20:35
You've mentioned earlier on that you're releasing your own CLI tools. So it's not just about managing the code running within your server-less environment within your cloud, you also have to take care of packaging, deploying, rolling out and troubleshooting stuff on that CLI that's deployed all across your customers base. So kind of, what are the challenges around having your own CLI tool?

Alon Ashkenazi  20:58
So I treat CLI as on prem product, okay, it's something that you pack and deliver, and you don't have any control to change it. You need a customer to update in order to deliver a bug fix or new capability. So the release process, and the quality, I want to say should be a bit different. But in our case, we're using the same CICD method, and applying the same level of quality to our back-end as well. But, specifically for a CLI, what is important is to make sure that your API is backward compatible. You don't want someone to use your old CLI and then things start to work, yesterday it worked for customer A, and then we released a new backhand and it didn't update the CLI. So now a pipeline is broken. So not only that we are ensuring that we have backward compatibility in the API, we have also integrated a lot of regression testing, to ensure that this will not happen. So, when we are doing our CI/CD and release process, we have a set of regression tests that takes old CLIs and runs regression tests, and ensures that the functionality is still working. So this is one part to backward compatibility. Second is when those open, you need to know about them. So you need to build a statistic and reporting capability to CLI, you need to understand your distribution, you need to know how much from a specific version exists in the wild. And if you release something, a new CLI and it starts to get adopted, you want to quickly know if it has solved your issue or created a new issue. So this is why we have monitoring specifically for CLI and all set of regression tests to ensure that we deliver solid CLI.

Liran Haimovitch  22:36
One of the biggest benefits of infrastructure-as-code is that you can reuse existing pieces of code. There are essentially millions of templates out there that contain snippets, you can use the old-- reuse, for how do you set up an EC2 instance, how do you set up a database. And I'm wondering, what's the impact of that? What's the impact of those dependencies on your security posture?

Alon Ashkenazi  22:59
Yeah, so this is actually exactly what we did, when we started infrastructure-as-code we used templates, right, I want to create a network, I want to create a VPC with subnets. Some of them are public, I want to have some lambda spin and EC2. So we used a template. This is basically how you start programming a new project, right? You don't just start writing, you're using some kind of template, most of the cases. But what we found out and actually later on found that Palo Alto network did a research about it, that most of the template for infrastructure-as-code, or to be more accurate, they found out that 200,000 templates have security vulnerability. So this is why it's super, super important to understand what you're doing, how you're building, what you're using. And when you're starting with infrastructure-as-code you don't have enough knowledge on the infrastructure-as-code itself, you might have a lot of knowledge in insecurity or cloud or cloud resources. So it's very important to use tools or use people's knowledge to understand what kind of security vulnerabilities you can have. So, this is why we built cloud rail, in order to integrate this security check as early as you can in your development process.

Liran Haimovitch  24:14
Now, there is one last question I want to ask you. It's a question I'm asking to all of my guests. What's the single bug that you remember the most from your career?

Alon Ashkenazi  24:22
Actually, I have two funny stories that are related to bugs, some product that we worked on, we had an email registration, okay, you can register to our service using email, username, password. And what we forgot to do, again, very early stage of the product that we forgot to validate your username. So what happened, bug hunters and all kinds of different people found this issue, and they started to send emails on our behalf to all kinds of email list, and they used the username in that way. They just plugged in a URL. So A lot of Email lists got an email from this service that direct to a specific website. Now, you can walk with your imagination to think what type of URL it was. So luckily, we found it very, very quickly. And after a few minutes, stopped first the email server, and then fixed the issue, which was easy fix, but we just haven't thought about it in the first time. And the second story about the bug is that on double verify we build a solution to block ads from being shown in an appropriate website. So due to a stupid FL statement, we just started to block a lot of ads from being served, which cost us a lot of money. Don't know it's funny, but you know, the story is that I was so sure that I knew what the bug is, I just asked everybody, don't talk to me, I know what to do. And I have researched it, and I know what the fix is. And after 20 minutes, I realized that I'm just wrong. And I wasted 20 minutes, which is a lot of money in this example.

Liran Haimovitch  26:02
Always important to verify you know what the bug is before you move on.

Alon Ashkenazi  26:05
Of course.

Liran Haimovitch  26:06
Before you're sure that you know what is the bug.

Alon Ashkenazi  26:07
Yeah.

Liran Haimovitch  26:08
Alon, Thank you very much. It's been a pleasure having you. I hope the audience has learned a bit about infrastructure-as-code and how to keep it more secure.

Alon Ashkenazi  26:15
Thank you for having me, Liran.

Liran Haimovitch  26:22
So that's a wrap on another episode of The Production-First Mindset. Please remember to like, subscribe, and share this podcast. Let us know what you think of the show and reach out to me on LinkedIn or Twitter at @productionfirst. Thanks again for joining us.