Rookout CTO Liran Haimovitch sits down with Stefan Thorpe , CEO at Cherre. They discuss how he knew that the cloud was where he was going to go, the whole point of DevOps, why they monitor everything, what they’re doing with Kubernetes and why they’re excited about it, and why they don’t use terraform.
Stefan Thorpe, Liran Haimovitch
Liran Haimovitch 00:02
Welcome to The Production - First Mindset, a podcast where we discuss the world of building code from the lab all the way to production. We explore the tactics, methodologies, and metrics used to drive real customer value by the engineering leaders actually doing it. I'm your host, Liran Haimovitch, CTO and Co-founder of Rookout. Today, we're going to be discussing the complexities of data integration. With us is Stefan Thorpe, Chief Engineering Officer at Cherre. Thank you for joining us, and welcome to the show.
Stefan Thorpe 00:41
Thank you for having me. It's a pleasure to be here.
Liran Haimovitch 00:43
So Stefan, tell me a little bit about yourself, who you are.
Stefan Thorpe 00:47
So obviously, I, head of an Engineering Department at Cherre. I've worked in the technology space for, wow, 24,25 years now. I come from an ops background, I used to build networks, build servers, like everything from laying cables to-- Actually I worked on the physical building of a data center back in the UK, that was a long time ago. I was also really lucky to get involved in AWS and actually do a POC with AWS back in 2008 2009. And I knew instantly, the cloud was exactly where I was going to go. That has put me at the forefront of DevOps for just over a decade and a bit nearly a decade and a half scaling system. This is what I've done and what I've focused on, I've traveled the world doing it and that's exactly what I love doing.
Liran Haimovitch 01:42
Awesome. Now, you mentioned you've started with operations going as back as you know, data center operations, networking operations. And I have to wonder, when people think about, you know, the head of engineering, the first image that comes to mind is usually sounds X software engineers, somebody who writes tons of code. And I kind of have to wonder, how do you see it? How does this different background story so to speak, makes a difference or doesn't make a difference in how you see your role?
Stefan Thorpe 02:12
It makes a difference in a number of ways. I think really partly my view of what DevOps is becomes important. Whilst I come from an ops background, I realized, and always had an interested in some of that development background as well. And especially as I started to adopt those cloud environments, like the first thing I did for most companies, Circa 2009, 2010, was migrate them on to the cloud, everyone was really excited about massive scale and everything else. And so, I was moving these Ruby applications on maybe PHP, and then they would tell me like, great, we've got this system that scales, but it's now not scaling. And what I quickly learned was, it is because we weren't architecturing the actual software. But the whole point of DevOps is its Dev, and the operations. And so, from that point on, as soon as I started to realize that I just regained my training and scaling, I then spent four or five years heavily in the development space, I still did ops, but I learned everything about design patterns. How, when to apply functional, which ones work, which ones don't work, and really bring that, so I just spent that entire time, I still do that. It's interesting, I now code in more languages than I care to speak. But every time I sit down and look at a system, I bring the software architecture and what I'm going to do and my code and then the application, and I can't help but view the entire system in this kind of 3d Integrated Model was in my head.
Liran Haimovitch 03:52
Makes perfect sense.
Stefan Thorpe 03:53
So I had to work on the Dev side of it. But yeah, it's made me a better engineer. It's actually one of the things I push all my engineers to do. So whilst I have a team of some 30, 40, data engineers, many of those guys are more skilled in say, Kubernetes and infrastructure, than some DevOps guys, I've met, no offense to the DevOps community. But it's all part of our continuous learning, our continuous improvement process. We'll probably talk about this more later. But we want a blameless post-mortem process. The very root of that is getting down to those five why's, what's going on? What's happened in the system? Now that system as a whole is the people, the actual production, development operations. And then, so we really have to run through that entire stack. And through that process, our team learns about all of those technologies.
Liran Haimovitch 04:47
So, you've mentioned the blameless post-mortem. So let's dive into that - how do you monitor your SaaS platform and where does the blameless post-mortem come into play?
Stefan Thorpe 04:58
We monitor everything. And the mechanism we use predominantly to assess how well we're doing is SLOs, service level objectives. Okay? To put it in that simplest term for those that might not be aware, it's a ratio that we do something X amount of time over a certain percentage. And that gives us an error budget and a response time. Through those SLOs, we work out how well the system is doing. And as I said, The system includes people, not just the technical system, so are we able to deliver and set our client expectations? When I say we monitor everything, it's phenomenal how much data we produce from just the monitoring and metrics, like we literally have hundreds of metrics across hundreds of variations, we still miss stuff. And again, that comes into the blameless post-mortem process. Okay, if something happens, and as we go through that five why's, so like, oh, well, we could have had a test here that might have alerted that, the first thing we go and do is then go and add that test into the system. So, maybe we have anomaly checks. And again, you have to continuously improve monitoring and those benchmarks and maybe the anomaly, or the standard deviation was too wide or too narrow, and alerts too much or doesn't alert enough. And all of that plays in. So we're constantly tweaking and improving those. We're also constantly looking for additional systems to monitor in very different ways. Honestly, I think the post-mortem process and the fact that it's the system, always, always to the system, it's never a single person, or a single thing. Frankly, if a person has been able to do something, the system has been configured incorrectly. So, the first thing we do is like in those scenarios, we sit everybody down, it's like, okay, we get it, I'm sure you get that kind of ice-cold bath, when somebody has done something, or maybe they felt like they were, they could have done something or stopped it, or maybe I could have raised my hand up and said X, Y, and Z. But the point is, the system should do it. And from there, we just look at how we can improve it, add in the fix, move on to the next thing, and continue to move forward.
Liran Haimovitch 07:16
So I have to wonder, you've repeatedly mentioned metrics and SLOs. So I'm guessing that when it comes to the three pillars of observability, you're very metrics-driven compared to other types of observability. Can you share a bit more about that? What key metrics are you looking at? And especially how do they differ between different parts of the system?
Stefan Thorpe 07:37
For Cherre as a data integration platform, it's all about delivering, in fact, any startup, it's all about delivering for our client needs, okay. And in our case, our clients, everyone's interested in uptime, because you want to make sure that you get access to the platform that you're using. Okay, but that isn't important, everyone should do that out the back. For us, it's around the data and getting the data to our clients in a consistent manner. And so, we want to make sure that we deliver data on time, okay. If the source has an update, Monday 5am, we might want to get it out to our clients by 8 am. So, they've got it there for their working day moving forwards. We deal with all types of data, by the way. So, we actually pull public data sources, we pull our data partner data sources, so we're working with the data partners, because they're able to deliver quicker, and then we work with our client's integration services. And that's how our clients are able to form their insights quicker, make their decision better and go through that entire process.
Liran Haimovitch 08:57
Let's take a step back for a second and say a bit more about what are you doing as a company? And what are you doing with all that data, you seem to be moving around?
Stefan Thorpe 09:06
So, Cherre is the real estate's leading data integration and insights platform. What that means is we enable our customers to connect both their internal and external disparate data for insights and better decision making. The world knows that data has value. Okay, but for it to have value in this, connect and join seamlessly. And so, I mean, a very simple way is that we built data warehouses for our clients. And what I was coming to is, that can be public data. It can be data partner data, especially in the real estate world there's a lot of people that focus on the right comparison for I don't know a single family home or the trends over a particular sub-market and those kinds of things. So, there's companies that need niche down on getting the right quality in a very particular question area. We don't. We want clients to be able to bring all of that data together, and then start to look at the entire picture as a whole. What that allows them to do is make better decisions, make better and quicker business decisions. And if we do that in a consistent way, it's much better. So, we focus on reducing risk, we automate the entire process. And we accelerate those times to business insights, and to giving clients value to their data.
Liran Haimovitch 10:36
So, if we go back to the metrics, you've mentioned, I understand your key metrics around the data. How quickly do you move to data? How accurate is the data? How well do you integrate it or process it?
Stefan Thorpe 10:51
Exactly. That's exactly what we're looking at. So, we look at end to end how long it takes us to process and the data delivery windows, that in itself, as contractual SLAs, and then we have our own objectives to delivering that consistently. We then have performance SLA s. And I'll get on to that once we talk about how we deliver data to our clients. And then the other thing we do is check for quality and anomalies and changes within the data, what you typically see with almost any data set is trend patterns, maybe the fill rate of a column increases by half a percent every time you do a weekly update, okay? If all of a sudden-- So you've got this trend that goes up by half a percent each week, if all of a sudden it drops off by 10%, our alert system will come up and say, okay, something was different here. Not sure whether it was right or wrong, but it was different, and then will trigger an incident or check what's going on, work out, it could be that the source was down, it could have been something as simple as the file that we downloaded was corrupt, and we weren't able to process it. And so, we come back and we restart that process. So that's all around our data delivering what our clients want, then we obviously have our performance of our systems underlying in every system we touch, is our ingest system working as is expected, are packages that build up all of those still functioning at the same way they're doing them. We then have machine learning models, are they performing in the same way that they were previously doing? And again, what you do with metrics is you build up this picture of what's going in on a live environment and frankly, trigger on anything and everything to start off with. And then, you start to work out which metrics are valuable and which ones aren't. I have a set of monitors that I thought were going to be hugely important and then I've never heard from them ever again in at all time. But that's part of the joy of going through the system and continually learning from what's needed and what's not.
Liran Haimovitch 12:57
You know, it's all about trial and error, something can seem completely reasonable and make it jump as an alert every time after time after time. Or something else that might seem obvious can never ever make an appearance.
Stefan Thorpe 13:14
Yeah, exactly. I mean, honestly, those alerts that go off all the time, are probably even more dangerous than the ones that don't.
Liran Haimovitch 13:22
Exactly. Now, so kind of wondering, you've mentioned a lot about the data you're ingesting, kind of what data is it? What data sources do you integrate? And what are the challenges of working with them?
Stefan Thorpe 13:35
It's any real estate data that our clients need, as tangential as that needs to be. We've connected everything from the court systems so that people can understand when an asset is in a distressed state. And there's something going on within the court systems and illegals. We've connected floodplain data so that people can understand the risks and maybe an insurer is looking at and saying, okay, is this building within a high risk area? It's literally any data, the challenges that come with that are burying and massive, okay, which is probably one of my favorites. We had to deal with an FTP server that was nearly 25 years old.
Liran Haimovitch 14:25
That sounds like a recipe for great success.
Stefan Thorpe 14:27
Oh, honestly, it was so much fun. It hadn't been updated. So obviously nowadays, it's running these obscure security protocols, which took some time to go and work out and try and get all of that through all of our layers of security and back out and that was fun within itself. And then the files that had been uploaded have been manually edited and added since 95 by Windows. So, the encoding system - once this was templated - the encoding system for this has been every variation of Microsoft simple coded fonts since Windows 95. And Microsoft have had some really interesting operating systems during that period. So, we would process it, we'd get 25% Answer and be like, okay, we found an error like, okay. First time around, we were like, okay, what's going on here and we're like, oh, it's an encoding thing. Okay. So, we then kind of built this really long list of like, does it match this encoding, does it match this encoding? Those kinds of things. We then have to deal with the other end of it, high speed data that's coming through in much more quicker scale. One of my favorite analogies around this, there's a video that's going around the internet about a philosopher talking about how heavy is it to hold a glass. And what he goes on to say is like, the glass isn't heavy, okay, but you stand there and hold the glass for 30 minutes, he gets heavy, you stand there for two hours, your arm is going to be aching and swabbing and now that glass is really heavy. When you're managing hundreds of pipelines, you've got hundreds of those glasses that you're holding and having to keep in a working state constantly trying to deliver. And let's say, we do that coming back to all metrics and just making sure that they're running smoothly. Do that consistently over weeks, months, years, as I said, some of our ingests to 15 minutes, pretty high speed. And then some of them, it goes off once a year. So you don't look at it for a year. And then it fires up and it comes out. And then, if there's a potential issue, you're like, okay, we worked on this a year ago, like what were we doing? So, yeah, and there's a whole heap of them, but it's a lot of fun, honestly.
Liran Haimovitch 16:40
So, how do you go about building hundreds of different data pipelines? Do you have a platform for that? Do you automate some of the process? What does it look like?
Stefan Thorpe 16:50
Everything in our production system is automated, even on manual processes. And I'll come to that in a minute. As I said earlier, when I think about systems, I'm really good at picturing the high level like, systems architecture is kind of the space over LinkedIn. And for me, layering is the most important thing, okay. And so our system is layered, each area has a very particular set of responsibilities, and it must just deliver the same thing at its same or it's joined to the next layer. And this was really a concept I took from the OSI model, the seven layer networking model that you learn when you go and do Ciscos or you go and do operations. And that seven layer model has stood the test of time, that's older the most people have been working in this industry. And it works, because as long as the same is the same, what you do in that layer doesn't matter. Now, what that gives you the ability to do is version one section. So we've gone through four, or five, maybe six versions of our ingest, to work out the most efficient process. And as long as it delivers into the warehouse in the same format, we can do what we like, same in our data warehouse, we've gone through-- We did things in sequel on Postgres, and then we did things in sequel on BigQuery. And now we're doing things in DBT. And even just the standard reusing within DBT is layered. So we iterate that. So again, we layer everything that gives us huge amounts of flexibility. That gives us our automation across the entire stack. Coming back to what I said about the manual, even on manual stuff, sometimes we have to do some data crunching that's very unique for very particular use case that we know we're not going to repeat, doesn't make sense to build a system that's fully automated to do that, that's gold plating, it doesn't fall into that 80-20 rule. So what we simply did was, we built a system that allows us to drop a file into a particular safe bucket with the right approvals. And then that system gets loaded into our database and into our data warehouse and then check. It means it goes through the automation checks that most things go through. Obviously, it doesn't have the same qualities, but it goes through some safety metrics to make sure it gets into production. So that way our entire production is hands off. We don't have anybody doing anything. Even our code deployment is all through CI or through that entire process.
Liran Haimovitch 19:24
This sounds like some pretty impressive stuff, and a pretty complex and robust platform for ingesting the data, integrating it and making it available for your customers. So, I have to wonder what are the future developments you have in mind?
Stefan Thorpe 19:40
Scale. We've still got hundreds of 1000s of data sources to add. Our schemas already massive. We recently spoke to an API gateway company. A well known one, I won't name them because they're a great company. But the first thing they said when they looked at our schema was, your schema is bigger than any schema we've ever seen, quite literally.
Liran Haimovitch 20:06
The size doesn't matter.
Stefan Thorpe 20:08
It doesn't matter until it breaks everything. But we've got more like, there's so much more data in the real estate industry out there. It's growing at an exponential rate. There's so many different ways to look at it. So, we've got more to that, we've got a long list of data partners that we're working with, and we're adding them into the system on a very quick cadence. So we're doing more of that. As we add each of those data sources in and it builds out our graph and our machine learning data scientists-- Data science teams, just continue to push the boundaries on what we can do and just what they can find out. Like in a theoretical world, if you can see all the data, you're going to get the answers right. And so we're just adding into that and we discover new spaces and new scopes of work that we can go into. So we'll continue to brush the boundaries on that. For myself, as I said, I'm a DevOps engineer. So personally, one of the things that I'm really interested in looking forward to is what we're currently doing with Kubernetes and custom resource definitions, right. Kubernetes is so extensible. In theory, you could build a domain object model or an API that allows us to deploy a pipeline with like, 10 lines of configuration. Tell us what the sources, tell us whether-- They should be really simple. And that's where we're driving right now. So, we're using the extensibility of Kubernetes to drive some of that. I've always found that deeply Interesting. Yeah, we don't use TerraForm. I know it's the De facto within the industry. We did use TerraForm. But everything is within Kubernetes manifests. Again, part of the reason I said earlier, my team understands Kubernetes is because they only really need a couple of languages. How does a Kubernetes manifest work? What does it do? And can I build my own? The answer is yes. If you learn that, and you learn DBT, you can do a hell of a lot within our system. So, that gives us growth in an area and again, just continues scalability.
Liran Haimovitch 22:15
Stefan Thorpe 22:16
Liran Haimovitch 22:18
I have just one more question to you before we wrap this up, it's a question to ask all of my guests. So, what's the single bug that you remember the most from your career, from all of those, you know, everything from data centers to the cloud and Kubernetes?
Stefan Thorpe 22:33
I knew this question was coming and I couldn't think of a single bug. But the one thing that just kept playing on my mind, I'm dyslexic, pretty heavily. So, it's interesting, it doesn't come up as much in coding. But the one consistent thing is that it always come up in coding for me, is the messing of like a single letter. Again, with dyslexia, you read something, can you-- The waveforms, the letters are either there or not there. And I was joking with one of our employees, I had to type in a Wi Fi password, something really simple. The password was quickow, okay? And rather than put a U in, I put KW, and I went - I'm not joking - I had 15, 20 minutes so like, me, Shannon, my wife, like, I know, this is the password, why is it not working? And she walks over like, you have a W rather than a U and I'm like, oh, okay. That has happened in coding for me so often, I've banged my head looked at something code put it down, you know, come down, come back 24 hours, look at it, and you go, yeah, I was missing a letter. I had too many letters in that word. And something as simple as that is probably the biggest one.
Liran Haimovitch 22:34
Dyslexia can be challenging at times.
Stefan Thorpe 22:44
It's interesting. I think it's what forms my systems thinking that. So, whilst it's not great for writing, and my slack communication is interesting. I think it's part of how I visualize everything, and go through that. I recently saw that LinkedIn and Richard Branson are now doing a dyslexic mindset as a tag, you know those things that you say 'I'm good at this, I'm good at that', they've started doing that within there.
Liran Haimovitch 24:26
Any final thoughts for our listeners?
Stefan Thorpe 24:29
I mean for me, Cherre's always wearing, we're a culture first company, meaning that we literally hire for culture above less experienced. So, someone's got one year experience or 15 years of experience. We're growing rapidly. We're doing some very exciting stuff in lots of locations across all of the cloud providers. Yeah, some great opportunities. And if not, I'm always just happy to network with people.
Liran Haimovitch 25:02
So, Stefan, how should people reach out to you to network or to learn more about Cherre?
Stefan Thorpe 25:07
They can go to firstname.lastname@example.org. That's the first place if they want to apply. If you just want to reach out to me, my LinkedIn is just Stefan Thorpe and you'll find me under Cherre. Yeah, they're probably the two best locations.
Liran Haimovitch 25:24
Awesome. Thank you very much for joining us.
Stefan Thorpe 25:27
Again, thank you for having me. This is a lot of fun.
Liran Haimovitch 25:34
So, that's a wrap on another episode of The Production - First Mindset. Please remember to like, subscribe, and share this podcast. Let us know what you think of the show and reach out to me on LinkedIn or Twitter at @productionfirst. Thanks again for joining us.