Rookout CTO Liran Haimovitch sits down with Or Hiltch, VP of Engineering at JLL Technologies. They discuss how dentistry shaped his love of tech, working with - and processing - large amounts of data, why Serverless should always win, and the tradeoff between ease of delivery and friction.
Liran Haimovitch, Or Hiltch
Liran Haimovitch 00:02
Welcome to The Production - First Mindset, a podcast where we discuss the world of building code, from the lab all the way to production. We explore the tactics, methodologies, and metrics used to drive real customer value by the engineering leaders actually doing it. I'm your host, Liran Haimovitch, CTO and Co-Founder of Rookout. I'm joined today by our guest, Or Hiltch, computer programmer, inventor, and engineer. Recently, JLL acquired his latest startup, Skyline AI, where he was the CTO, also an open-source developer in AWS machine learning hero, and the co-founder of the Tel Aviv CTO club. Thank you for joining us, and welcome to the show.
Or Hiltch 00:51
Thank you, Liran. Thanks for having me.
Liran Haimovitch 00:52
Or, Can you tell us a little bit about yourself?
Or Hiltch 00:54
Sure. So I had the privilege of growing up in a home where my father was, actually still is, a dentist. The thing about my dad is that he really likes computers, but he didn't really know a lot about operating them. And then that meant that at a pretty young age, I think it was around elementary school, I got a job of backing up a lot of the dental program, kind of like sketches and stuff, between a PC we had at our home and the one we had at the clinic. That also meant -- by the way, that was like almost 30 years ago, which means that I was one of the first kids on the block with a PC at home. That was kind of my first experience with command lines, some sort of code. I'm not sure if you can call that code, but command-line operations. Yeah, and then following that, I got into chat software, like mIRC, which was highly popular at the time. And like many other geeks, I like to tweak around with scripts around the mIRC. So I kind of learned about servers for the first time, and wrote some scripts there to automatically query and download stuff from the mIRC servers, and customize the user interface and what have you. And then a couple of years later, I think that was around high school, I got into actual programming. So I got myself some C++ books and started writing simple games. And that's how I initially kicked off my career as a computer programmer. A lot of my future was kind of predetermined by this early start. Then, when I grew up, a good leader in the space, I used to write software for a lot of different fields: cybersecurity, aviation, finance, until I had the amazing luck of meeting, who became later my current co-founders… in one of their first startups. So that was a cybersecurity startup that ended up being acquired by AVG, the antivirus security company, I was actually the first employee at AVG post-acquisition. And then after a few years there, we actually decided to establish a new company together. And that's kind of how I made the switch from an engineer after about 10 years of operating and managing R&D groups into an entrepreneur. And then I had this amazing experience of building something from scratch at a company we call the Stream Rail, which was about video optimization, which ended up being acquired by our own source. Now it's a public company, back then a much smaller organization. Then post-acquisition, we actually started to invest in real estate, which eventually led to the idea of Skyline AI, which is where our story begins today. So that is an AI investment management platform, which covers-- You can think about it as something like an AWS for real estate. So if you want to develop any kind of real estate application, you need to have solid foundations. And that's what we've been building at Skyline AI. Until a couple of weeks ago, as you said, it was announced that JLL, one of the largest real estate companies in the world has decided to acquire the company. Nowadays, I'm a VP of Engineering at JLL.
Liran Haimovitch 03:55
Now, as you mentioned, JLL has acquired Skyline AI, reportedly for about $100 million. What was it like? What can you share with us about the experience, about the process?
Or Hiltch 04:07
So it was a super interesting process, because JLL is one of the corporate investors, as well. So the dynamic was really interesting, because imagine that you end up being acquired by someone who actually knows all the cards because they get quarterly reports of what's going on with the company as a shareholder. I will tell it from the beginning, the incentive for a company like JLL to invest in Skyline AI, was to get a glimpse of what's happening there, right? So they keep hearing about technology and AI disrupting stuff and they wanted an inside look, which is why they were one of our earliest investors in the company. And then through this investment, we kept having a dialogue throughout the years of interesting things that we do together, stuff that we do with our clients, until eventually, it was interesting enough for them to decide that they want to keep this weapon for themselves. And they decided to acquire the company.
Liran Haimovitch 04:57
JLL wasn't just one of your earliest investors, they were also one of your biggest customers.
Or Hiltch 05:01
Right, so Skyline AI, as I mentioned--You can think about our technology platform as something like a pyramid structure, where at the base, you have the data infrastructure, the ability to acquire real estate data on a daily basis and the ability to have all sorts of data integrity algorithms, entity resolution, and all of the things that you need to set up before you can build anything in the world of real estate data. Then on top of that, what we ended up doing after a couple of years of solving this challenge, of having this base layer, we developed kind of the Swiss Army knife of real estate algorithms. So we have a pretty big data science team led by Dr. Zeev Wax, who has been developing algorithms for things like transact-ability, which is an algorithm for detecting whether a certain property is likely to sell before it goes on the market, algorithms like value prediction, so predicting how much it will sell for, algorithms like rent prediction which determine what the deserving rent of a property is, and so forth. So we have a couple of dozens of these. And then on top of that, we ended up developing platforms, so a lot of apps. So applications for sourcing and originating debt or equity, applications for quickly and efficiently underwriting lucid properties, and so forth. And I think since JLL is a global real estate services company, it's really all over the place when it comes to real estate operations. So they have a large investment management company called a LaSalle. They have a large property management business. And they have a larger origination business around debt and investment sales. And I think all of these applications were highly relevant for JLL throughout the years, and certainly now, at the point that we've taken this to.
Liran Haimovitch 06:48
You're a pretty big technology platform, you have a lot of data underneath, you have lots of cool algorithms that allow you to predict the future, or at least the present. And then you're building various services, on top of that, allowing various real estate analysts to kind of know what's going to happen. And JLL is pretty much more real estate analysts than anybody else.
Or Hiltch 07:09
Yea so JLL is a pretty huge company, I think it's around 100,000 employees, considering ths fact probably means that Skyline AI is an internal r&d group within JLL; it's probably the property company with the largest amount of users. So indeed, rolling out this weapon to such a large group of such diverse kinds of professionals. It's a pretty amazing experience for us.
Liran Haimovitch 07:33
I know it's still early, but this is going to be a big shift for you from building a platform that you're selling out to a platform that you're supplying to your internal customers, how do you envision this change transforming the company?
Or Hiltch 07:46
That's a really good question. I think that it has the advantage of building something when you already know that there's going to be a product-market fit. Because now we're going to have the privilege of having the world's top real estate professionals as product people eventually telling us what they need to really influence the bottom line before we start developing it. So when you develop in a vacuum, when you do this outside a company like JLL, we go through the normal product-market fit search, you can interview potential clients, but eventually, until you roll something out, it's pretty hard to know if you're spot on. I think one of the greatest benefits of doing this inside a large company like JLL, is that we're going to only focus on stuff that's going to, with a high degree of certainty, going to influence the bottom line. So no wasting time on developing cool stuff that's not going to be used, so then we're all very effective and efficient.
Liran Haimovitch 08:44
So it's going to be much easier to tie the bottom line of the products you're building with the bottom line of the companies in the customers you're serving.
Or Hiltch 08:51
Liran Haimovitch 08:52
Let's guess what you can see? How much money JLL is making off a certain field? What's the potential? And how much can you make it better?
Or Hiltch 08:59
Not only that, it's not only looking at the bottom line, so you can look at a financial statement, it's actually a public company. But then it's also the ability to interview the actual analysts, the actual broker, the actual investment specialists working there, to hear their opinions of when or what is currently missing, right, so how we can take what it is that they're doing and give it a TNX technology boosts to really help them up.
Liran Haimovitch 09:21
You've mentioned that in the early days of Skyline AI, you've spent a lot of time and effort on data collection. So how do you go by that? How do you identify data sources? How do you collect them and how do you store them?
Or Hiltch 09:34
It's a good question. There are two parts to this question. The first is how do we actually identify what is interesting, and then the second part, which is the more technical part of storing them and actually processing them. That's what I think the Skyline secret sauce is. So when it comes to commercial real estate, I think there are two interesting things about the type of data sources there. First is that there are actually a lot of pretty good data providers out there that perform old school style data collection, so they're taking surveys, they're making calls, and they take care of the heavy lifting of collecting the actual raw data from the market. That's one type of data source. And then the rest; it's all about alternative data, like it could be anything from restaurants around the property to population density from something like the US Census Bureau, demographics, obviously, are very important. Pretty much everything could actually influence real estate property, which means that there are quite a lot of data sources to collect, hundreds of data sets that are being collected on a daily basis. But the collection is actually the easy part. The hard part about real estate data is that-- You know, I'm 36 years old now and I think I've been doing data for maybe 20 years or something and I've never seen data as challenging as it is in real estate, because it really-- When you think about it, there is no real globally unique identifier for real estate properties. So there's no ID for a property, you can get close, you have parcels, you have addresses, but then it's never actually uniquely identifying stuff, which means that for starters, you have a huge entity resolution problem. So if you want to fetch data about the property from 10 different sources: one is providing you with transactions, one is providing you with leasing information, one is providing occupancies and one is providing you with an a restaurant or the property, what have you. And then you have to come up with algorithms to actually say, 'hey, you know, this asset that you will fetch from this data stream is actually the same asset as it comes from this data stream, even if their address is a bit different, even if they actually have conflicting information'. And then once you connect everything, and you have this collect golden record where you were able to tie all of the different sources into a single entity, you have to resolve those conflicts. So if one source says, 'hey, you know, the asset was built in 1985' , and second says that 1986, who was right? So that's the hard part, processing the information, resolving those conflicts, and understanding the data. So real estate is actually quite complex, because it's highly tied to no financials; you have to understand the underlying debt, and the amortization of the load schedule, debt service coverage ratios, which are changing throughout the years. So it's actually quite a complex domain to master and that's part of the reason why training a data scientist at Skyline is almost like one year of college work, just learning the domain, understanding the different interactions between the features when it comes to real estate. So I would say that the bulk of work, at least for the first few years at Skyline, was processing information and understanding it until we ended up with something like a ground truth data set that we could use to build stuff on top of.
Liran Haimovitch 12:41
Can you share with us some stats on the amount of data you're storing?
Or Hiltch 12:44
Yeah, so without getting too specific, there are a few hundreds of data sources when it comes to data points. So it's probably millions of data points, because a lot of these data sets have quite a few data points in them. It's mostly time series information, it comes to things like weather, which obviously has an impact on real estate; weather, rents, occupancies, everything that's real estate related. There's also a lot of static, or almost static data points, things like elevation, how high the asset is, things like whether the asset is renovated or not, a lot of actual photography data. So when you look at the asset, how does the asset look like, you could actually extract all sorts of interesting insights and classifications from that, like whether it's a luxury apartment, or if it's an apartment Grade A, B, or C, all sorts of things like that. There's also a lot of data around demographics, again, demographic data is also highly complex. It's actually the best example of a data source, which is easy to extract, but hard to understand, that's the US Census Bureau data set. So it's like thousands and thousands of tables with demographic information, but it can actually get pretty tricky to process until you get to a point where you could extract insights from. There's also a lot of geospatial type of data, so our main data warehouse, that's Google BigQuery, which is nowadays branded as the serverless data warehouse. It's got pretty advanced geospatial capabilities to be able to associate things like demographics per one-mile radius when it comes to property and things like that. We do cover all multifamily assets in the US, so that's roughly 400,000. So for those of you who are not familiar with the concept of multifamily, multifamily is basically a residential building where the main difference from a condo building is that in the multifamily settings, one would not be able to acquire a unit, so if you want to get a unit you can't buy the unit, if you want it, you have to buy the entire building or property, and therefore it's an institutional asset class. So usually, the buyers and sellers in this market would be professional investors and companies and not individuals. That means we also have a lot of information about these, so we have pretty substantial data sets about funds, fund operators, who sold to who. We have entity resolution algorithms around owners. So if you take a company like JLL, it goes by a lot of names, so you might see JLL, you might see Jones Lang LaSalle. That's what it stands for, you could see just LaSalle asset management. There's a bunch of names, you also need entity resolution there. And that's in a nutshell.
Liran Haimovitch 15:28
Now, you mentioned BigQuery, what other technology platforms are you using to make everything possible?
Or Hiltch 15:33
So I would say the general architecture when it comes to Skyline AI is that you have to think kind of like two ends. On one end, you have the data warehouse, that's where the data scientists start their work. So until it actually arrives at the data warehouse, there's a lot of data engineering to be done and that work is basically managed on EKS on AWS, the elastic Kubernetes service on AWS, by running an application called Argo workflows. It's a platform that enables you to schedule jobs with dependencies between them on top of Kubernetes, where if you imagine this graph of applications that extract, transform, and load data, each and every ETL, an application that extracts from the load data, is a node in the graph, and it gets scheduled on a Kubernetes pod. That's what we refer to as the raw data pipeline. That runs every night, basically collecting data from hundreds of sources and then it ends up loading the data into Google BigQuery. Then once the raw data is on Google BigQuery, there is a second pipeline, that's what we call the processing pipeline, or the logical pipeline, that's actually written in Python. It is using an application called Prefect, which is similar to Apache Airflow, which is a bit more well known in the industry, an application for writing directed acyclic graph, so basically, graphs of ETLs. In this case, this pipeline starts running when the raw data is already there. Its job is to engineer the features from the data. Let's say for example if you have like something like a weather data set that says 'okay, it was sunny here, it was rainy there', you want to create a feature like: 'is this a sunny place?', so we have to have a method of doing that. What defines a place as a sunny place? Maybe there are 30 consecutive days of no rain. So that's example of a feature you would have. These type of features get calculated there. And then all of the machine learning models are also being queried for inference there. So we have a part of this processing pipeline is about basically making HTTP calls to Databricks, which we use as something like a managed ML flow solution on top of AWS as well. That's where we have our machine learning model repository, and also live serving, so the same processing pipeline. I mentioned queries, the data bricks live serving model, generates inference, again, on a nightly basis. This thing is loaded into a new Google BigQuery project. It will call the summary like a release package of a dataset. And then finally, this entire thing is migrated on to all of our different applications. So once the data set and raw data has been acquired, it has been processed, you have new predictions, everything is fresh, it's time to be able to serve this information to clients. In order to do that, we have a pretty extensive workflow of continuous delivery for data. Alright, so the way this works is-- Continuous delivery is actually mostly about automation test. That's the truth. We have applications that basically migrate all of the information from Google BigQuery into various data sets. The main ones are Redis, Elastic Search, and Aurora RDS, SQL on Amazon, the databases which are used by our user facing applications. These facing applications run on AWS as well, mostly using an application platform called Amplify on AWS, which is a managed platform for running apps, where the back end for those applications is mostly AWS Lambda, using the server-less framework, and API gateway, and all that good stuff. And the front end, that's mostly Angular applications stored on s3 served via CloudFlare, CDN, and so forth. That can be the general architecture.
Liran Haimovitch 19:22
That's cool. So you have three layers of data processing: one for fetching the data, one for processing the data, and the third for exporting the data for user facing applications. And then you have a bunch of user facing application for the analysts and customers to benefit from all the real estate data. That sounds like a lot of data processing being done every single day. How do you go about monitoring that?
Or Hiltch 19:47
Yes, it's definitely a lot of data to work with and we have a lot of challenges around that. I think we are lucky to be living in an age where we have a lot of these server-less type applications and platforms. I think that Google BigQuery is a huge win there, because you could actually use it to orchestrate a lot of the heavy duty data calculations as you would with something like Spark. So you could transform data there, you can actually even train machine models, right on BigQuery. We're lucky to be live in an age where we don't have to manage the servers that have no bearing of all of this load, it's being taken care of by the cloud providers, be it Google with BigQuery, or AWS, data bricks for machinery model training, and all of these things. Actually, the main point, the most challenging bit of continuous delivering of this type of big data is testing. Because data can get pretty wild, especially when it's collected from sources which are public. Many times you have discrepancies, you have errors in the data, you have the types being changed. So a lot of effort is being pulled on validation of this data. The majority of the code in this data pipeline is actually testing data, or just making sure that once everything has been fetched and processed, does it make sense, can we continue to deploy this automatically to production? That I would say is one of the main challenges when handling this type of data.
Liran Haimovitch 21:12
So for you testing production is a day to day effort, because you're not just testing your code, you're also testing the data that's coming in and making sure it doesn't change. For instance, that some data source you're relying on us change the unit of measurement or the format or anything else, and your data has been messed up.
Or Hiltch 21:30
Exactly. We're using various methods for that. But one of the nice platforms that we have is called Great Expectations, it's kind of a platform for writing expectation suites, kind of the same you would do when you write expectations with normal code. But that's for data. So the cool thing about this platform is that you could really end up running a lot of pretty generic tests and connect them to any type of database driver.
Liran Haimovitch 21:53
How often do you get alerted that something has gone wrong?
Or Hiltch 21:57
Or Hiltch 21:58
I don't think there's been a single day without an error in the data, especially on this scale of the data. It's pretty wide. But again, it's also because it's the type of data that you gather from sources that you don't control. So if I contrast that with something-- I was mentioning my previous company stream, when we were on the introduction. At Stream Rail, we work with a lot more data when it comes to size, we had petabyte scale data we used to collect especially if you look at the time. So just prior to the acquisition by our source, we were at a point where we were collecting like 2 million events per second. That's really petabyte scale information. However, that data, even though it was much larger, it was much easier to manage because we had really tight control of how the data looks like, because we would write the actual data. So a user would interact with a video player, we would be the ones logging the events. And then you could tell that if there is something weird with the data, it's probably a bug on your end. That's a really beautiful world to be living in compared to real estate. When it comes to real estate, you rely on public sources, let's say the state, the county assessor's or governmental sources, like the US Census, New York City, governmental website or whatever. So a human error could actually end up as an alert on your pipeline.
Liran Haimovitch 23:15
And once those alerts go off, what do you do about it?
Or Hiltch 23:18
So we have basically ways of-- when you look at the CI/CD pipeline of the data that I was mentioning, it's managed on Argo workflows on elastic Kubernetes service, which means that it's resumable. So whenever we detect an error, we can pretty quickly fix it and then resume and that's the process that we really do a lot at the company.
Liran Haimovitch 23:36
When you mentioned CI/CD of data, you kind of mentioned you meant the process of importing the data from the outside world, integrating it into your internal format and processing it to deliver a fresh set of insights to your customers.
Or Hiltch 23:50
Exactly. So every day in real estate, there are new transactions, assets go on listing, assets go off the listing, occupancies change, wealth change. Actually, these days, you have daily pricing systems set up on properties so it's stuff that you need to refresh every day. Oftentimes, you do not need to actually train machine learning models every day, that you could do once a month or even once a quarter depends on the model, but you definitely want to show really fresh data to the user.
Liran Haimovitch 24:21
Now, what do you use for CI/CD in your own application code, not just the data?
Or Hiltch 24:25
So what we do is-- our stack is a pretty classic CI/CD stack. So we're using GitHub for managing the repositories. We're using a Git flow style workflow for Git, so we have feature branches, release branches, and development branches, and all these things. And then we're using circle CI for managing all of the integration and deployment. Most of the applications are mostly serverless applications. Unfortunately, it's not always possible to serverless, but when it is, we will definitely use it. We really like the serverless framework, it's a great enabler to easily deploy stuff to multiple cloud environments, which we need because we have our main applications on AWS, but the data warehouse on Google, so having something like serverless framework is very useful there. And we have the Push To Deploy style architecture; wherever you push something, it gets built on circle CI, and then deployed to the correct environment using serverless deploy mechanisms, which is a big win, because one of the greatest benefits of something like AWS lambda, or serverless functions in general, is that their design is made in a well-architected way for continuous integration delivery. And I think part of that is that in things like lambda, whenever you deploy a new version of the function, or a new code, you get a new version. And then pretty quickly, you end up having multiple versions, which are on the shelf, and then eventually, you could decide which listener you want to connect to which version. Imagine you're the developer of the team and you want to test some new functionality, you can write some new code for the function, you can deploy it. It could actually be deployed to the actual production environment, but no one will be listening or looking at this change, because it's a newer Lambda version. And then you could actually use something like load balancer, or an older application or API gateway rules to personally be the only one that gets the new version, that you just deploy it and test it. Since Lambda, also supports tagging, you can have… that's the most popular tag, deploying new versions just means tagging them. It's amazingly nice for things like continuous delivery of code.
Liran Haimovitch 26:47
You've mentioned you're using serverless, whenever possible, but sometimes it's not possible. Why is that?
Or Hiltch 26:53
Yeah. So first, concerning the first statement of using serverless, when possible, it's definitely true. And something I noticed with my years in the industry is that computer programmers and engineers have this natural adaptation of the Dunning Kruger effect, where the less you know about something, the more confident you are about your knowledge there and the more you learn about it, you realize what you don't know. I think we have to, as an industry, as engineers, we have to have the Dunning Kruger mentality because being good developer means that you need to know a lot about a little right and a little about a lot, kind of like using other abstractions. You want to operate different libraries, you want to be able to play with these Lego bricks on the cloud without actually understanding everything. Oftentimes, when it comes to hardcore engineering, people tend to prefer the flexibility of doing things on their own. Even if you take something like Kubernetes, which has fairly managed versions like Eks, and AKs and GKE, even there, I would say that the less experienced DevOps engineer would prefer managing the cluster on their own, instead of using something like Lambda. But eventually when you look at the perspective of the entire r&d organization, that's usually not a good choice, in my opinion. I think, whatever can be managed for you, and whatever can be serverless, should be like that. That's because on the balance of trade off between ease of delivery and little friction, versus flexibility, I think ease of delivery always wins.
Liran Haimovitch 28:35
Makes sense. So you've done a lot of work from your early days, working at the dentist's office, making backups and then doing lots of stuff, founding Skyline AI, what bugs do you remember the most from that career?
Or Hiltch 28:52
Unfortunately, I have a lot of bags to tell about but two huge bugs that standout: one is from my time at a large financial institution, I used to be an R&D manager for a large financial institution here in Israel. We had this application one time that was supposed to generate leads for a call center. There was this call center, like 50 people who are eager to make calls and make offers to clients, and then we will have to generate these lists of phone calls for them. And the phone calls would actually happen automatically, using IVR and these type of systems, and then the telemarketing guy would make the call. We have written this ETL application that scans the entire client database, and then determines who needs to get a call for this specific financial product that's well suited for them. That was using SQL and that application had the assumption that IDs were unique, which is a pretty fair assumption to make because everyone has an ID. But then it turned out, we had a few contacts on the database, which for some reason did not have an ID, so that that field was null. Then what happened is that, sometime during the joint operation on the SQL of the application that generated the leads, those specific contacts that had no ID were generated with thousands of calls. So within 10 minutes, these people got hundreds of calls from teenagers at the telemarketing data center, offering them the product. Obviously, that didn't end really well. So that's one bug that stands out. Another thing that I remember from my previous company Stream Rail. At Stream Rail, in order to optimize the video delivery, we used to collect cache videos on our end. Let's say you have this video advertisement that you want to publish as an advertiser, we would basically download that video, we would transcode it to some encoding that is optimized for delivery on the mobile device and then we will access this cache of videos using the URL of the video as the identifier. That worked pretty well until one day I'm getting this call from one of my clients that says, 'Hey, you know, I'm looking at this page we have where all video ads are on and instead of seeing advertisements, I keep seeing this video of some guy feeding an alligator at the zoo.' That was like a pretty popular website with lots of traffic. So I was telling them 'Are you sure?'. And then he said, 'yes, I'm seeing this alligator'. He was pretty puzzled about what he saw. He sent me the link and I was watching it and indeed, I saw a video of someone feeding an alligator in a zoo on all the videos on the websites instead of advertisement. After a couple of minutes of laughing. We had to understand what caused that. What ended up being the problem was that there was a specific video ads provider that had dynamic URLs, so the same URL would return different videos. And since we use the URL as a cache key for the video, it was enough for one video to be this video. And then we just basically used that video for all placements for this specific provider. That was a huge defining bug as well.
Liran Haimovitch 32:08
Thanks for being here. Oh, it was a pleasure having you to learn a lot about Prop tech and big data, machine learning, and so on.
Or Hiltch 32:16
Pleasure is all mine.
Liran Haimovitch 32:17
Thanks for listening in. So that's a wrap on another episode of the production first minds. Please remember to like, subscribe and share this podcast. Let us know what you think of the show and reach out to me on LinkedIn or Twitter at production first. Thanks again for joining us.