Rookout CTO Liran Haimovitch sits down with Shir Chorev, CTO at Deepchecks. They discuss what machine learning means to her, why we’re far away from aliens taking over the world, how machine learning is different from traditional engineering, and where automated tests shine in machine learning.
Rookout CTO Liran Haimovitch sits down with Shir Chorev, CTO at Deepchecks. They discuss what machine learning means to her, why we’re far away from aliens taking over the world, how machine learning is different from traditional engineering, and where automated tests shine in machine learning.
SPEAKERS
Liran Haimovitch, Shir Chorev
Liran Haimovitch 00:02
Welcome to The Production - First Mindset, a podcast where we discuss the world of building code from the lab all the way to production. We explore the tactics, methodologies, and metrics used to drive real customer value by the engineering leaders actually doing it. I'm your host, Liran Haimovitch, CTO and Co-Founder of Rookout. Today, we're going to be discussing testing machine learning applications. With us is Shir Chorev, CTO and Co-Founder of Deepchecks. Thank you for joining us and welcome to the show.
Shir Chorev 00:41
Thank you, happy to be here.
Liran Haimovitch 00:42
So Shir, what can you tell us about yourself?
Shir Chorev 00:44
So, in the past 12 years, I've been in tech. I started originally, in the unit 8200, I was working on cyber low level research after finishing the program. And in the past six years, I've been in the field of machine learning, initially working on it for various purposes of anti-terror, and using different types of data sources to find that interesting phenomenon. For the past two years, I have been in Deepchecks where we help machine learning models work quietly, in a better and more trusted manner.
Liran Haimovitch 00:44
Now I know everybody's talking about machine learning and I would love to know what machine learning means for you.
Shir Chorev 01:20
So, while maybe I wouldn't say popular misconception, but probably looking like 10 years back, thinking of machine learning, people would maybe imagine kind of robots taking over the world and having really their own understanding which we would call today artificial general intelligence. And we're still quite far away from that. I'd say that machine learning is generally a simplistic concept of taking data and kind of correct answers, or as we call them, the labels, and having a computer or a model, understand what is the correct relation using various statistics and optimization methods in order to understand the algorithm by itself. So, that would probably be the way I see it currently.
Liran Haimovitch 01:59
So essentially, a machine learning application is some sort of application that can take a piece of data and label it somehow.
Shir Chorev 02:07
Right. And the way it does it, somehow is based on previous data and labels, usually. There are some exceptions to that.
Liran Haimovitch 02:13
So, we train the application to be able to tell us stuff about new data that's going to come in the pipeline, sometime in the future.
Shir Chorev 02:21
Exactly. And we don't do it based on like heuristics that we say explicitly, unlike in regular algorithms, yes.
Liran Haimovitch 02:27
Which is where the machine learning part comes in.
Shir Chorev 02:30
Right?
Liran Haimovitch 02:30
You've been doing a lot of engineering, a lot of engineering before you delved into machine learning. So kind of, how is machine learning different from traditional engineering?
Shir Chorev 02:39
So, I'd say that conceptually, there are quite a few differences and challenges. And well, one of the things is that machine learning essentially is more like algorithmic research, and you try to understand, did the model learn it successfully or not? And it's often also very hard to understand, like, what exactly did the model learn? And what is it supposed to learn? So, I'd say that the focus of the modeling part is quite mathematical and algorithmic, while many of the actual challenges in deploying and developing machine learning algorithms are actually everything that has to do like, surrounding the model. So, while you need people with a very, very deep understanding of algorithms, and data, the challenges are usually on the engineering part. So, we have the model serving and data engineering and everything on data cleaning. So, there are lots of challenges that are not actually in that specific area. So, I'd say it's a different kind of connection between these two areas.
Liran Haimovitch 03:31
So, if I were to compare it to traditional engineering, where a product manager comes and writes some spec, about what needs to be done, then this spec now has to be converted into something that can be used for learning, whether it's the input data, or the labors, or anything that comes together, we kind of have to define whatever we want to teach the computer, and then make it accessible for the computer to learn.
Shir Chorev 03:55
Right. I would split in the way you kind of represented reality like that, I would split the areas of the differences to two parts. And, and one part is everything that's around. So, it's the engineering and the modeling aspect. And the other part is that I think what you relate to is that it's relatively more researchy in the fact that you have maybe the spec that you're trying to achieve, but like many research tasks, you don't know whether the data is good enough, whether you'll be able to reach there. So, there's kind of two areas I'd say.
Liran Haimovitch 04:21
Now as you go through this process, that you build those models, obviously the next steps, kind of like in software engineering is testing it, testing whether the model you've built, the application you've built is good enough, testing it for various edge cases, testing it's performance, testing everything you would test in a normal software application. But how is it different? How is testing machine learning applications different than traditional software engineering?
Shir Chorev 04:47
So first of all, machine learning is a relatively new field. So, just like in software, there was a quite a long process of evolving, how do we do it? Do we do it manually? Do we have like a test-driven development and we do it in a methodological manner? I'd say that machine learning in that aspect is a few steps behind. So, there are many differences in how do we approach it and what should be tested. I think many of the specifically testing machine learning is that it's more challenging to understand what's supposed to happen. So, if I take a few concepts from software, let's look at things like coverage testing. So in software, one of the things we want to do is go over kind of all of the branches of code, and then we think, okay, so if we went over all the flows, this does represent that we did a relatively thorough testing, and we have some ease of mind. But for machine learning, this isn't even possible, like what is-- how do I cover everything? Like cover all the input space? That's not even possible. Because there's-- it's infinite, not infinite, but theoretically infinite. And do I cover the model? Like all of the-- Let's say, it's a neural network, Do I cover all of the numbers that it should? Like, do they go through all of the branches? So, in that aspect, it's much more challenging to understand, like, what do I test? And well, there's also additional aspects, like the fact that it's very likely to fail silently, which is something that is very troubling, because essentially, you put in numbers, even if the numbers don't represent what you think they'll present, the model just still takes a number and outputs a different number. So, from the way it's built, and sometimes you'll discover it only many months later, especially if it's like in production, and you don't even have the labels, you don't have the correct answers, and suddenly understand that some loans that you've been approving a few months ago, are actually performing much worse than you thought.
Liran Haimovitch 06:23
Now on the one hand, you can obviously go about manual testing, saying, this is something I want to test whether it's an edge case you find interesting, whether it's a performance test you want to run, but you can also automate something. And as we've seen in software engineering, obviously, automation is great, because it allows you to retest and retest time and time again, for every small modification often in a very quick manner. But you're also saying that automation testing is not without its price, whether it's about designing the automated tests, or it's about running them, spinning the computer on them, dealing with an unstable-ness of some tests. So kind of, from your experience, where do automated tests shine in machine learning?
Shir Chorev 07:06
So, I'll say that initially, when we in our own projects faced these problems of machine learning projects, having some unstable behavior or some unknown behavior, and we try to understand how can we catch it as early as possible? And we wanted to see how can we do it. And when you ask peers, or search the internet, or check existing stuff, basically, you understand that there isn't yet something that like is in place, that okay, this is how you do it, you have a unit test for this part. So, I'd say it's maybe a few steps. I mean, I'll be happy to say how we do automated testing, but I think that conceptually, this is something that isn't yet in place. And I would say that's also one of the reasons that I think solutions like that are -- Or from our experience, like with our package are being quite keenly adopted, because people are kind of craving for it. Because currently what happens, usually is you have something and it's kind of ready to go to production, you would probably have to have the most senior member of the team go over it manually, and check everything, and verify that no edge cases were missed and stuff like that. So, say we're on the verge of transferring to automated testing. What we saw is relevant, I think some of the things that really had to do with the methodology of machine learning, the fact that we have training data and test data, or the fact that we have to do overtime, and the fact that we have, like various ways of how it's built, enables us to actually look at it, kind of as a generic approach of, okay, let's check stuff about the data integrity, let's check stuff about data distributions, about the model performance. And these are things that are really relevant for, basically any model. Of course, what exactly is a problem or not, that's something that you have to like, make sure in your specific domain, but what we really like in machine learning and automatically testing machine learning is that the ideas are the same. And then, you only have to do the customizations and adaptations for your use case, enabling you to work in a more methodological manner and the data framework that will really help you be able to comprehensively check stuff with less rewriting the code or with less manual inspection every time.
Liran Haimovitch 08:57
So there, you mentioned, the most senior member of the team would probably go ahead and do a bunch of manual tests, kind of, what would they be focusing on? What would be important to test? How do they go about testing it?
Shir Chorev 09:09
So by the way, I'm not sure it's a bunch of manual tests, but it's rather to go over along with the researcher data scientists that develop the model and make sure what did they test for or not? So, I think it really splits it into various areas. Probably the most straightforward is test evaluating the performance of the model. Well, it's very typical to do that at like high level of, okay, let's look at the overall performance. But what about things like, how do I perform on a specific demographic? Or is there a difference in performance of the model between various demographics for obvious reasons, some use cases, for example, insurance related cases that sometimes have to have the same performance or the same kind of estimations for different genders and things like that. So that's like, I would say one area is everything that has to do with performance evaluation. Other areas, which I think every data scientist at some point in the career makes a very embarrassing mistake of not noticing something in the data. For example, if you have a specific a leakage in the way you collected the data, so you didn't know, but you have much more car accidents from Europe, while the data that you collected from the US is with like much less accidents. This is something has to do with the collection, not even with your model. But this is something that you have to kind of manually inspect, does my data represent the world of my problem? Obviously, as you get more experienced, there are more and more pitfalls, which you want to look at, like have an overview of your data and make sure that it works as you expect it or as you understood that it does work. So, I'd say that the senior in this case, like someone experienced will go and kind of walk and check, what are the things that were tested or not? And do they represent what we expected to?
Liran Haimovitch 10:44
Now let's assume you have this model and I've been working on it for I know, a few months now. And I've released a few versions. So, I have a great checklist of everything that needs to be tested whenever I release a new version of the model. And now I want to go about automating that, what should I do?
Shir Chorev 11:00
I'll take one step back and say that you don't or you shouldn't wait till you get to that phase. The we will look at testing and machine learning is, for example, things that have to do with data integrity, this is something that can arrive at any phase, it can arrive in your initial data set. And it can arrive when you have now you retrained your model and you have new data coming in and suddenly some data source changed because whoever you bought data from didn't tell you that they changed something. So, I think you have to do two steps. One is to-- Well, have a checklist and check different things that have to do-- specific examples. But again, it's in the data integrity and the methodology and in evaluating your model. And what you can or should check for really varies in which step you are in the phase. So, some things are very relevant in the research phase. Because, for example, the example I gave with data being collected in a representative manner, this is something you want to check, you want to verify much earlier.
Liran Haimovitch 11:52
Potentially as soon as you start collecting the data, as you're collecting the data you want to check with every new data source you bring into your research is aligned with what you want to train the model to do.
Shir Chorev 12:04
Exactly. And same for-- Also for examples of data that is maybe stale and has like things that don't represent what-- for various reasons, just like mistakes in the data. So essentially, what you want is a very elaborate checklist, and then try to understand which of these checks which can usually be really kind of run in the same manner. So, you don't have to re-run them every time, you can have some library framework in which you implement relevant checks. And then, in the relevant phase, which I would split it into a few phases. One is the during research, or both when you get to the data and also when you finish the research. When you retrain the model, which is something we do usually once a day, once a few weeks depends on the use case, and in a continuous manner. But that would already be called monitoring when it's in production. So, I'd say it's both in the research, in the deployment, and in production.
Liran Haimovitch 12:49
And you've recently released an open-source project focusing on exactly that, automating some of the most common tests, engineers and researchers should run on their data and models. Right?
Shir Chorev 13:01
Right. So while working with customers, and seeing many machine learning models in production, and analyzing the problems that they face, we understood that many of these problems could have already been identified much earlier, whether it's in the research phase, or right before they were actually deployed. And when we were thinking, how can this be? Or why did these types of things happen? And how can we help in general, like wider adoption of machine learning? And what are the missing tools? And talking with many more people and asking them about how did they test their models? And what is the process in their teams? And do they do a peer evaluation? And how does it look? We understood that it's something that's very much in awareness on one hand, but on the other hand, there isn't kind of yet a solution. And one of the things that we thought can really help advance the field and also, basically every data scientist, whether they're still in their academic research phase, or even just entering the field, and also when they're actively deploying models to production on a daily basis. It's really relevant to have a way to easily and comprehensively check the things that you know might go wrong, and no reason to wait. That was our motivation for reaching this area of offline evaluation, I'd say.
Liran Haimovitch 14:12
And that the open-source project is called--
Shir Chorev 14:14
Deepchecks.
Liran Haimovitch 14:15
--Deepchecks. So, how would you go about using that if you were doing a research? Or if you're releasing a new model? How would you go about employing Deepchecks to improve the quality of the model?
Shir Chorev 14:28
So, if you want to run Deepchecks, well, it's a Python package. So, what you need is to import it and then to give it your data or your data and your model, depends on the on the phase you are. So, if you're just starting, you have only your data, if you already have an initial version of your model, so you can give it also that. And then, it runs a very long list of checks, both on your data, and on your model, and on your performance. And of course, you can also add various custom checks or give it your own metrics and things like that. The idea of having really something that you can kind of--
Liran Haimovitch 14:55
Extend.
Shir Chorev 14:56
--Both extend and also that you don't have to every time put the effort of thinking, again, like what is relevant for your domain and what is relevant in general. We do find that we have some sometimes interesting kind of conclusions or surprises. So, one cute example that we had is that we had someone trying the package on something that they knew they had some problem with. But they weren't sure. Like, whether it will give interesting insights and will it catch it? And a simple test that many times people don't necessarily do is checking your model compared to a various types of baselines. So, for example, a simple heuristics model? Or what if I just choose like, what if I just guessed the most popular answer or various types of baselines? And this may be a very interesting indication of something going wrong that I'm not necessarily aware of, because if I see that my performance goes down from 0.9, accuracy to 0.85, okay, I mean, those things can happen. But what if, like, very naive model outperforms me. So this is a very good indication. So, many times just having a very wide range of tests and checks, can help spotlight different types of surprises that you understand shouldn't be there.
Liran Haimovitch 16:01
How is the reception of the open-source project so far?
Shir Chorev 16:03
So, it was an interesting process, because initially, when we released it, we felt that there is lots of traction of people seeing the repo and recommending to other people. And it was cool to see that I guess, people, I mean mainly people in the space. So, we really didn't know what to expect. When we released the package initially, we did have, like our internal criteria was to reach a phase where it will when we asked like a beta testers, will you recommend this to a friend? Or would you use this yourself? Their answer was yes. Well, usually yes. So, that was the criteria, but still, we didn't know how it looked, we basically kind of open-sourced it and then had a post about it and waited to see what will happen. And it was a really interesting process, because what we felt is that initially, I guess, seeing the topic and the fact that really, as I said, it's quite heightened awareness. We saw that it received lots of attention, but mainly kind of attention of oh, this is cool, let's check it out. And stars, and traffic, and things like that. And then, I'd say something like a few weeks later, you suddenly start to see usage patterns. And suddenly, we have some issues and questions. And this is-- So, it was nice to see initially, the people adopting or liking the concept and then starting to work with it and check it out. And well, we're still in that process. But it's a really nice journey to experience and see how it evolves.
Liran Haimovitch 17:18
Now, if we go back to the software development lifecycle, or the machine learning development lifecycle, what happens after we've built out our model, we've tested it, we're happy with the performance, we've gone through all the sanity checks out there. What do we do now? And what do we have to worry about? In the next step, as we're heading into production.
Shir Chorev 17:37
I'll split it into two. One thing is when we just initially deployed, we have to make sure - and these are also types of bugs that we've seen with some customers and surprised us, but I guess these kind of things happen - that the production pipeline actually behaves the same as a training pipeline. You may have things like, you just updated some feature engineering and data, like you trained the model on data that was normalized, and the feature is between zero and one. But in the production, it's still like, between zero and 100. Again, by the way, the fail silently, these are the types of things that can happen there, if you don't check your distributions and monitor it continuously. So, I'd say that first thing is to make sure that kind of the initial phase of the model, model in production and life really looks and behaves as it should. And as it was in training. And I'd say the second area is the fact that data is very dynamic, and the world changes. And suddenly, shoppers behavior is different because of COVID, or there was a specific event, but your data, for example, Christmas and your data wasn't trained. I mean, there's obviously other things that you have to be aware of. But sometimes maybe there was a specific promotion that you didn't know of and suddenly doesn't represent and you're mispricing all of your your offerings. So, I'd say the second part is continuously making sure that the data and the models still represent the reality, as you think they should. And for that there are various areas like retraining and when I retrain, how do I check that the retrain model behaves as I expected to? Or in general, how do I verify continuously that the data and model are still relevant?
Liran Haimovitch 19:08
You have any tips or recommendations about that?
Shir Chorev 19:10
So, the way I see it in machine learning is that the challenges are kind of the same throughout the process and lifecycle, but the emphasis is a bit different. So, while in the research, we want to understand or analyze the weak spots of our model, and how can we improve them? Or do we have any problems with the labeling? Or do we have things like that? So in the production, I'd say that the focus should be again, in continuously monitoring and checking, but the focus is a bit more on a few specific areas, which is data distributions, performance evaluation, if I can, if not, then things like is the model, like the distribution of the predictions of the model, and is it rapidly changing? Always stuff like data integrity, which is relevant because things continuously change, but I think, I do think it's the same mindset, but the focus is just a bit different when you're in monitoring. And also the challenges are different because now I have to do it continuously and over time and I have to take note, when do I retrain? And how do I check it? I'd say that's the way to look at it and really have a comprehensive checklist and go through it.
Liran Haimovitch 20:09
Now, I think you have some offering around that Deepchecks.
Shir Chorev 20:13
Right. So, we do look at making machine learning models work as expected over time as a wider kind of notion from research to production. And in the area of monitoring during production, that's area of our enterprise offering. And it quite naturally, also relates to validating because the same things that I want to check before I deploy it, usually I'll some of them still want to check over time. Yeah, that's our, let's say, paid product.
Liran Haimovitch 20:39
Interesting stuff. Before we wrap this up, there is one question I ask all of my guests and I would love to hear your input on, you want to share with me some interesting bugs you've had?
Shir Chorev 20:44
I had a project which had something like 50 features. And after some complex feature engineering, I managed to take it down to only two features, which had many, many advantages and made it much more lean. And just before deploying the final project, I kind of rechecked the model performance. And suddenly I see it's random. And it took me something like a day to debug what happened there. And what I understood is that, while I was trying to optimize the way the model was trained, one of the things was that I changed it to a … framework, just a different framework that I was working with before, in which the model, the order of the samples didn't stay the same. So eventually, what happened was I had lots of complex and a really good, sophisticated that feature engineering that worked well. And after that, the model was trained on random labels, because the samples were just mixed between themselves. So luckily, I did check it again, just before production and obviously sorted the labels and the samples before production. And another bug, which is-- I'm not sure if it's a classical bug in the areas of data integrity, but it certainly was a weird or surprising one is that after working a few weeks on a project, and we were also kind of stuck at a specific like, accuracy level and trying to understand how can we improve it? So rechecking the data and joining it with additional data sources, I suddenly saw that some samples have different labels, like people were both doing a certain action and not doing a certain action, which didn't make sense. And trying to understand what happened there, I understood that something like 30% of the data samples were of bad people.
Liran Haimovitch 22:24
So, I see those people.
Shir Chorev 22:26
Yeah, exactly. That did improve a lot, the model eventually, but yeah, it was, it was a nice one.
Liran Haimovitch 22:33
Awesome. Thanks to you. It was a pleasure having you on the show.
Shir Chorev 22:36
Thank you for having me.
Liran Haimovitch 22:42
So, that's a wrap on another episode of The Production - First Mindset. Please remember to like, subscribe, and share this podcast. Let us know what you think of the show. And reach out to me on LinkedIn or Twitter at @productionfirst. Thanks again for joining us.