The Production-First Mindset

Bridgecrew's Barak Schoster Goihman- Turning Coffee Into Code

October 03, 2021 Liran Haimovitch Episode 10
The Production-First Mindset
Bridgecrew's Barak Schoster Goihman- Turning Coffee Into Code
Show Notes Transcript

Rookout CTO Liran Haimovitch sits down with  Barak Schoster Goihman, CTO & Co-founder of Bridgecrew. They discuss modern cloud architectures and challenges,  why decision-making with the top-down approach doesn’t work for developer tool adoption,  how they went about making the open-source platform into one that’s fun to use, and the focus on the developer experience...and why it’s necessary.

Developer-First Observability
Rookout is a developer-first observability platform that provides an unparalleled ability to collect

Disclaimer: This post contains affiliate links. If you make a purchase, I may receive a commission at no extra cost to you.

SPEAKERS
Barak Schoster Goihman, Liran Haimovitch

Liran Haimovitch  00:03
Welcome to The Production - First Mindset, a podcast where we discuss the world of building code from the lab all the way to production. We explore the tactics, methodologies, and metrics used to drive real customer value by the engineering leaders actually doing it. I'm your host, Liran Haimovitch, CTO and Co-founder of Rookout. Today, we have a packed episode discussing modern cloud architectures and challenges. We've got serverless development environments, infrastructure in code, and much, much more. We're joined today by our guest, Barak Schoster Goihman. Barak is a Senior Director and Chief Architect at Palo Alto, previously, the CTO of Bridgecrew. Thank you for joining us today and welcome to the show.

Barak Schoster Goihman  00:52
Thank you very much. Glad to be here.

Liran Haimovitch  00:53
Happy to have you, Barak. So, how are you feeling? What would you like to share about yourself?

Barak Schoster Goihman  01:00
Alright. So, my name is Barak, as you mentioned, I am from Tel Aviv, Israel. And I've been through an interesting journey, taking several systems into production, with various tools. Some of them remind me of Rookout, and others are more in the monitoring space. And we can talk about open source, which is where I'm coming from, an open-source company, to production monitoring in our architecture in the cloud.

Liran Haimovitch  01:26
That's awesome. I mean, we're seeing more and more open-source companies coming out of Israel these days. How did you get into open source?

Barak Schoster Goihman  01:33
Well, I think that the thing that was really, life-changing in the way that I think of open source, is being part of a company that open source was a core part of it, the name of the company was Forscale, a company that was acquired by RSA. And we really realized that open source was a game-changer for us in the area of data engineering. Forscale was a machine learning company. And we heavily used a software that now is very popular, called Apache Airflow for data orchestration. And it was really the thing that solved a lot of problems instead of building from scratch by ourselves, gaining the power of community and free and very easily to use a license and we've incorporated that into our machine learning product. After that, I have decided to spin up my own startup, which is an open core startup named Bridgecrew, which has various open-source tools… Each one is trying to tackle a different problem in the cloud security space, we try to make it as developer-friendly as possible. And we thought that the best way to consume products by DevOps and developers are through open source.

Liran Haimovitch  02:51
I mean, everybody loves open source, especially these days, especially in the cloud. But how do you go about picking something-- Picking an open-source product, deciding to build an open-source product and say, 'that's mine'?

Barak Schoster Goihman  03:03
Alright, so the question whether you should build an open-source product or a SaaS only, or commercial or on prem, is really about the end-user, the persona that you want to use that product. Our persona in Bridgecrew, which is an infrastructure as code company, an open-source company that is specialized in the areas of cloud security, and our consumers, our DevOps engineers, the way that this market is used to consume products is from open source. Examples for that would be TerraForm is an open-source project for DevOps engineers, Chef, Puppet, Ansible, and others are already the way that this persona, the DevOps engineer is regular to consume products. This is one thing. And other thing is the market was dominated previously on the on-prem era, by very closed source and people did not like it, they did not like the top-down approach where a decision-maker like the CIO or the CTO chooses a product that is not necessarily the thing that people like to use on a day to day basis. And the third aspect is, if you want to have grassroots adaption in large enterprises, open-source might be a better go to market than a top-down approach, where it's really hard to reach to those DevOps engineers because they're not used to working with enterprise sales and procurement processes.

Liran Haimovitch  04:35
But now as you're working with those enterprises from within Palo Alto, it's not just open source anymore.

Barak Schoster Goihman  04:41
It's not just open source. So Bridgecrew has basically three types of products. One is our open source products, which as an individual contributor in a company, you can download either using Docker or PyPi package and start using it. And the second line of product would be our SaaS offering, which you can consume by going into Bridgecrew IO and signing up and start using the platform, which is much more feature-rich. And the third line of products as part of Palo Alto, is having Bridgecrew as another module in the suite of tools called Prisma Cloud, which will enable you to secure anything in your cloud, not only infrastructure as code, which is what Bridgecrew is specialized about, but also container, workloads, networking, etc.

Liran Haimovitch  05:30
Speaking of Prisma cloud, Prisma cloud is a SaaS offering by Palo Alto?

Barak Schoster Goihman  05:35
Exactly, yep. It's a very large suite of products that was created from various acquisitions of different startups. Twistlock, Bridgecrew, and others. And everything was, is being consolidated into a larger suite with homegrown features too.

Liran Haimovitch  05:56
So today, you have an open-source product, and many of its core-- Much of its core functionality also exists in your SaaS version.

Barak Schoster Goihman  06:05
Correct.

Liran Haimovitch  06:06
How do you go about managing that?

Barak Schoster Goihman  06:08
Alright. That's a good question. So, there is two aspects to managing a good open-source and Chekov our main open source or our most popular one, already has more than, around 3 million downloads and 3k github stars, and it became the standard for infrastructure as code security or the place to start with. And the way to manage it has two aspects. One is managing the community, having the project being easily accessible and easily being contributed to. And another aspect of managing it is scaling it CICD pipeline. So on community management, we try to have a lot of talks like this one, podcasts, webinars, tutorials, workshops, getting people to how to start using this open-source project. We try to have the code to be easily readable, having it with good tests and integration tests around and really guiding and interviewing people on how was their contribution experience to the open-source, how was their usage experience, and asking them where we can do better. So, that's on the community management. Think of it like Product Management for open source users. And as for scanning it, we already have more than 100 different contributors, there are not Palo Alto employees to this product. And what we wanted to have is to have good testing good security practices, and also an automated process that will help open-source users contribute to the platform to the SaaS platform in a really seamless manner. So, we have a CI pipeline that after a code review and having good tests on each new feature, or each new policy, it auto bumps the platform in getting all of the different contributions of the community into the SaaS platform. So checkov is actually a framework for policy as code to create policies on top of your cloud resources. Best practices like ensure your s3 bucket is encrypted or not public. And all of those rules are different policies that were contributed by people who are doing cloud security for years now. And each time a community member is contributing a new rule, a new policy, it is one being audited by the entire community and getting contributed to the entire community. And 2) it's being provided also as an option of consumption to all of our SaaS users.

Liran Haimovitch  08:38
So, how do you go about making sure that your infrastructure as code, as you mentioned these best practices. What does it even mean when I write my configuration in infrastructure as code? What do I have to worry about?

Barak Schoster Goihman  08:52
That's a good question. You have good ones today. So really, what the cloud has enabled us is to have an API first approach. AWS have exposed a set of API's and a set of SDKs on top. One of the things that those API's have not enabled us is to manage the state of configuration. And this is where infrastructure as code is coming in. Why do you need state of configuration? Let's say that you have five engineers in a team, which is a reasonable amount. Each one would like to change the configuration of a different microservice. Things like backup and recovery, versioning logging and auditing things that make it very easily monitored in production and safe in production. So, you have a set of best practices for every single resource type. You have best practices for s3, which is an object storage, you have set of best practices for lambda functions, and a set of best practices for databases. And you have more than 160 different services only on AWS, not including the other cloud providers like Azure, TCP or Alec cloud. So, if you want to have best practice on each and every service, it's really either following a set of compliance benchmarks like CIS, NIST, PCI, HIPAA, which is for the more regulated companies, or just following the best practices of a community, or a company like the AWS foundations. All of those are generally around things like logging, encryption, backup and recovery and the different pillars of DevOps, reliability in the cloud.

Liran Haimovitch  10:31
Somewhat intimidating, not to say overwhelming.

Barak Schoster Goihman  10:34
Yeah. So previously, in a team of five, you would probably make sure that you have best practices in a very smart process called Code Review. And in that code review process, you'll have the senior engineer telling the more junior engineer, Hey, you forgot something here like to configure the logs, to be not only on debug or something like that. And it's a very tedious task. And it doesn't force the time of the senior engineer all of the time to be very repeatable. And it's creating a burnout in the more senior engineers to have the same review over and over. So, this is where we understood when we interviewed a lot of seniors and engineers in the cloud, that this process of testing can utilize the power of open source, we can create a community around those best practices and to create a testing framework in the cloud. It's called Policies code, that will help those senior engineers to automate the code review process. And this is how Checkov came into life.

Liran Haimovitch  11:34
So using Chekov, I can use policies code such that I can configure some of my Git repositories, the requirements for running code in production in the best practice manner. And then all my infrastructure as code configuration is going to go through that process. And if something isn't aligned, it's going to be flagged,

Barak Schoster Goihman  11:55
Correct. So you're going to-- If you have multiple options, you can either stop the build or have an automated comment on the code review process. But generally, you will take, as of today, around 1000 different policies, 1000 different best practices managed by the community. And you'll have your code inspected automatically for those best practices.

Liran Haimovitch  12:15
And I can do that simply by using Chekov as a CLI tool within my CICD process.

Barak Schoster Goihman  12:21
Correct. The most popular consumption method is actually GitHub actions, because it's becoming the most popular CI, but we've seen people use it in Jenkins and other CI systems too.

Liran Haimovitch  12:32
And if I were to use GitHub actions, would I be using it with the CLI tool or with the SaaS version or with both?

Barak Schoster Goihman  12:40
So, there is an option to use the CLI tool within the GitHub action, there is an official GitHub action in the GitHub marketplace for Checkov. And you can enhance the capabilities of it, by using an API token from the platform, you will get other features like compliance, auditing, and automated fixes, and some dashboards that are for larger teams. If you'd like to really collaborate at scale on your compliance status on your infrastructure as code. This is where the platform will help you.

Liran Haimovitch  13:10
Yeah. So if I'm looking as a larger team, and just, I want to go beyond just, you know, testing locally for this microservice and making sure it's fairly good, for honest to get a broader picture, and get some reports and collaborate. So that's when I go with the SaaS platform. And how is the SaaS platform built from inside? How have you gone about making this open source project into a platform that SaaS, that's fun to use, that's useful for you, for the ecosystem of Palo Alto?

Barak Schoster Goihman  13:11
So, to make it fun to use, we really wanted to focus on the developer experience, how fast onboarding, time to value would be two minutes top. From the moment of onboarding, connecting GitHub repo to seeing the actual things that you need to fix on your infrastructure as code. But behind the scenes, it's a very complex system. So, it's very simple to use from the end-user standpoint. But on the back end, there is more than 150 microservices, most of them managed in AWS lambda.

Liran Haimovitch  14:15
Wow 150 microservices, that's a lot.

Barak Schoster Goihman  14:18
It is a lot. Each microservice is responsible for a different process and analysis. We use static analysis as a data pipeline where each and every user is doing a commit on the infrastructure as code repository. It's having-- It spinning up a pipeline orchestrated by AWS step functions that inspects the change file in the infrastructure as code and tells the users and reports the users if any of those changes are not compliant with those best practices. So, to make it easy to use because 150 services created only by 19 engineers at the time is really hard to control. So to make it easy, we had several principles when we built our backend. One, is everything should be managed in infrastructure as code. Also in our backend. Two, we wanted it to be easy to scale. In case of spikes of usage in production. This is why we chose lamda. We wanted everything to be version control. So, it's very easy to roll back in case we had a code change that create any kind of outage. And we wanted to have a very fast time to resolution, we wanted every change to be auditable and we wanted every change to be very easily solved by the right person in the right team. So, we came out with the idea of stacks. Today, our production can be deployed on multiple AWS accounts. AWS is our backbone. And we have four layers of stack. The first one is the account stack, where we configure stuff in the account level, like having cloud Trail, which is the API audit log configured on the account level. The second one is a branch where each lambda can be deployed from a specific branch. That way an engineer can AB test either the production branch, which is master on our Git repository, or a feature branch. So, we can better test a code from the feature branch with a specific set of customers in AB testing. Or we can decide to apply to all customers by merging the feature branch into the master branch. The branch stack essentially is deploying the same lambdas as production one as the master branch, just with a different naming convention, having the branch name and having the ability to run in parallel to the production code. The third layer is customer stack, which is things that are deployed in infrastructure as code in the customer level, like I am groups and roles, which are different for every customer the way that they choose to configure their IAM policies within the Bridgecrew application. And the fourth layer of stack is integration stack, where it's another set of TerraForm code that triggers provisioning of cloud resources, like CloudWatch triggers to our lambdas depends on the integrations that customers have decided to implement on their own SaaS tenant. So, Bridgecrew is a multi tenant application and when you choose to connect your GitHub repository, you will spin up in the back end of our SaaS, a TerraForm code that will provision CloudWatch triggers to scan that it will trigger the lambdas to scan your infrastructure as code within GitHub, if you integrate your cloud environment like your Azure GCP or AWS, it will provision a different cloud trigger and a different step function that will be responsible for monitoring your runtime environment, your cloud environment.

Liran Haimovitch  18:08
So, whenever I'm registering a new account for Bridgecrew, I'm essentially getting my own specialized set of lambda functions?

Barak Schoster Goihman  18:17
So, it's the same lambda functions, it's the same code. The things that is different is the triggers and parameters and configuration that will be dedicated for your tenant. And we really wanted it to be reproducible for all of the different customers. So, there is no little Minion behind the scenes configuring everything for you. Because configuring 150 different microservices takes a lot of time. We wanted the experience to be blazing fast. So, everything is managed in code for you and the upgrade process is the same.

Liran Haimovitch  18:53
Sounds like you're somewhat of a infrastructure as code magicians, kind of floating stuff around multiple layers of abstractions, 150 microservices, everything defined with a click of a button. Kind of, I'm not sure everybody can do that. Isn't that something that's a bit too hard for many people?

Barak Schoster Goihman  19:15
It took a lot of iterations. And I'm sounding like a preacher, if I would have choose a place to start is to have first start with TerraForm. Because even if your architecture is not right in the first place, I don't say that ours were, we have more than seven re-architecting. But re-architecting sounds like a very heavy lifting task. Actually, when you're managing everything in infrastructure as code, it's like moving classes between directories, or adding new classes and new abstraction layers in the application level. Once you're managing everything in code in DevOps, it's becoming very easy to change and very easy to adapt and to evolve in case of need.

Liran Haimovitch  20:01
So, you're a big believer in infrastructure as code.

Barak Schoster Goihman  20:04
I am a big believer. Yes.

Liran Haimovitch  20:06
Speaking of infrastructure as code, one of the biggest complaints we're hearing about the cloud, is how poor development environments are, how little can you rely on working on your local machine, and how much work in difficulty it can be to set up a development environment in the cloud. I know that infrastructure as code and especially TerraForm can help with that. Isn't that right?

Barak Schoster Goihman  20:30
I totally agree that it's very hard to develop locally, lambda functions. There are some workarounds, like local serverless. But in the end, you're always dependent on a database that runs on production, or it could be very expensive and time extensive to provision an RDS instance, like a Postgres on your laptop, or the different dependencies like DynamoDB, or other data resources that you have only in your cloud. The way that we use TerraForm is actually to provision multiple development environments alongside your production. So, you can have a dedicated database for each developer, or for each feature branch that does not mix with your production data, and does not have access to any of the production resources. You can provision it on a debit AWS account that does not have the same access or risk as the production.

Liran Haimovitch  21:25
How do you go about managing those development environments in Bridgecrew?

Barak Schoster Goihman  21:29
So, we try to follow a methodology called GitOps, where each and every change that exists in the cloud environment can be mapped to a Git commit within our Git repository. So that's one way. And another way is we highly rely on monitoring systems and we monitor the different environments from dev to cloud. As for debugging, we are lacking a debugging tool. And this is where we'll probably start to use Rookout somewhere in the future.

Liran Haimovitch  22:02
You're welcome. Sounds like a great idea. But speaking of GitOps, I know you guys are releasing crazy fast, right?

Barak Schoster Goihman  22:11
Yeah, we are about nine times a day to production, can be more, it's just the cadence of coffee coming in and coffee coming out.

Liran Haimovitch  22:20
How do you go about releasing in that frequency, let alone everything you've mentioned 150 microservices, all those advanced triggers and stuff, customizable per account, open source into the mix, how do you go about releasing while ensuring quality performance?

Barak Schoster Goihman  22:39
Alright, so we have every feature branch that does a pull request to master which represents production has obviously unit tests being run, integration tests, and we are actually provisioning a full environment for each branch. So, we have a full TerraForm apply alongside production for that feature branch that we can see with our own eyes, before merging to production. Another thing that we have is automatical teardown once feature is being merged to master. So, those cloud resources are being  deprovisioned because they're now part of the production part of master. And everything that we have is that in order to have the ability to track back, whether it is an open source change, or a specific team change, we have four engineering teams at the moment. We have an automatic tagging, a process that tags the owner, and the Git commit and the branch that this change was sourced from on the cloud resource. So, let's say that you, as a team member, have decided to modify a specific lambda, you're now the owner, you have the owner mapped as a tag. So, if there is an error in production, through the monitoring systems, the pager duty will know that you are the one to receive the next alert in case there is one from pager duty and to nag you over phone emails. And if you are not available, we have a mapping also in tags to the owning team. So if you're not available, someone else from your team can cover for you. And that entire team is responsible on your code review process, and can be escalated to the team leader, the r&d manager or to myself.

Liran Haimovitch  24:27
So, lots of escalations, I hope you don't get paged too often.

Barak Schoster Goihman  24:30
No, we have a very talented team. Thank God.

Liran Haimovitch  24:33
So, you seem to be releasing quite often, how much of the build pipeline of the CICD pipeline is your own? How much are you using existing tools for that?

Barak Schoster Goihman  24:44
So, it's a very large pipeline. We highly reuse open source tools were possible. For example, for our own static analysis, we use charcoal on Bridgecrew so we're done cloning. We're also using image scanner to make sure that we have best practices on Docker images, we highly rely on infrastructure as code framework. So, we have a mix of serverless and TerraForm, on our back end, and we have every engineer, as part of the onboarding, contribute a part to that CICD pipeline, because I can say that the beating heart of our application are the things that keeps the beats running is our CICD pipeline. So it's highly coupled, the way to GitOps is structured is that your CICD is highly coupled, actually, with the business logic in the business application. You'll have your CICD pipeline update on a specific resources, if required, by a specific business logic, you have your CICD pipeline also integrated with your feature flag system. So there is a lot of glue there. We are not using a single product for it. But it is crucial to have a continuously innovating product. To have it as part of the CICD platform. We use Jenkins mainly and GitHub actions as part of the product architecture itself.

Liran Haimovitch  26:09
And how has that shifted since you've joined Palo Alto?

Barak Schoster Goihman  26:12
Great question. So in some cases, we learn from the bigger scale company, we learn how to support 1000s of customers at production. And we also learn about more best practices that Palo Alto have learned during the years. So Bridgecrew is a very young company, we existed only two years at the point of acquisition and Palo Alto at that time existed for 16 years, I believe. So, they also had a very good set of controls that we could could learn from. And we also had our engineering team scaling at least twice in size in six months. So, we actually learned a lot of best practices from the larger company.

Liran Haimovitch  27:00
So There's one question, I love to ask everybody who comes to this podcast, kind of what's the biggest bug? What's the most memorable bug from your career?

Barak Schoster Goihman  27:09
Well, at the beginning of the startup, you have a process called Founder Led Sales. As a founder, you fly around the world, you try to learn what's your product-market fit, what features would be most matching for your latest customers, and you have a lot of interview processes with those customers trying to understand what would help them the best, in our case, to scale their cloud adoption. And I remember that one of our first customers, we had a new requirement for monitoring our own system, and also have gained them the visibility to monitor their own cloud environment. And I remember that around five minutes before our meeting, the system have crashed due to a new commit that one of the developers have done. So, we were in different time zones, I was at the US that time, and the engineer was in Israel, 10 hours time difference. And in around 3am, the engineer have discovered via pager duty that the system is crashing, and about five minutes before the meeting, he has performed the rollback to make everything work and to have the meeting very smooth. So, one thing that has changed is our coverage. On testing before that, in our code review process, we have two reviewers per pull request. And we were lucky to have the rollback process already in place in production. So, we could easily in a click of a button, roll back to the previous commit that did not have a crushing experience for the user.

Liran Haimovitch  28:47
It's so much easier when the bug was introduced on the last commit and you can just roll back. Sometimes when the bug goes deeper, or longer into the past, it can be much harder to find it and fix it.

Barak Schoster Goihman  29:00
Yeah, in that case, you have to hotfix and create a patch on top. Because if you go way too back, you will have a lot of features removed, right?

Liran Haimovitch  29:10
And assuming you can find it. If the bug was recently introduced, it's very easy to hold back and see if that fixes it. But if the bug was ages ago, then you might not even know that quickly. Where is it coming from? Or why is it happening?

Barak Schoster Goihman  29:25
I agree. I agree. And it's also when you have so many microservices in the cloud, at the time about two years ago, we didn't have so many but still we had quite a few. It's really hard to track which service is talking to what and where is the message falling apart and causing the system to crash?

Liran Haimovitch  29:44
Exactly. So what does the future hold for Bridgecrew, for yourself?

Barak Schoster Goihman  29:49
So really, the future holds a lot of scale challenges for us which is always interesting to support larger scale as part of the architect role, and also enterprise ready features and new very interesting stuff on the open source era. We released since the acquisition Checkov of 2.0, which has graph analyses. And by now more than seven infrastructure as code framework supported from Docker file, Kubernetes, Helm charts, cloud formation, TerraForm, and more. And we have a new open source called your that was recently released. And this is where I'm investing most of my time lately, which is around the areas of having good tracing on infrastructure as code from code to cloud.

Liran Haimovitch  30:37
Sounds promising. Can't wait to keep track of your exploits. Well, Barak thanks again for coming and joining us and sharing with us your knowledge about the cloud and infrastructure as code. It was great having you.

Barak Schoster Goihman  30:52
Thank you. Thank you so much. It was fun.

Liran Haimovitch  30:54
Fun for us as well, and for me as well. So, check out Bridgecrew. So, that's a wrap on another episode of The Production - First Mindset. Please remember to like, subscribe, and share this podcast. Let us know what you think of the show and reach out to me on LinkedIn or Twitter at @productionfirst. Thanks again for joining us.