The Production-First Mindset

Oribi's Eynav Mass - Being Accountable From End to End

February 27, 2022 Liran Haimovitch Episode 31
The Production-First Mindset
Oribi's Eynav Mass - Being Accountable From End to End
Show Notes Transcript

Rookout CTO Liran Haimovitch sits down with Eynav Mass, VP R&D  at Oribi.  They discuss what Oribi is all about,  adapting SDLC to the data that works in production, the culture required to move fast and take risks, and how they know what’s going on in production.

Painless Cloud-Native Debugging
Rookout is a disruptive developer solution for Cloud-Native debugging and live data collection.

SPEAKERS
Liran Haimovitch, Eynav Mass

Liran Haimovitch  00:02
Welcome to the Production-First Mindset. A podcast where we discuss the world of building code from the lab all the way to production. We explore the tactics, methodologies, and metrics used to drive real customer value by the engineering leaders actually doing it. I'm your host, Liran Haimovitch, CTO and co-founder of Rookout. Today is going to be all about data-intensive applications and operating without staging environments. With us is Eynav Mass, VP R&D  at Oribi. Thank you for joining us and welcome to the show.
 
Eynav Mass  00:41
Great being here. Thank you for inviting me Liran.
 
Liran Haimovitch  00:44
Eynav, can you tell us a little bit about yourself?

Eynav Mass  00:47
Sure. So, I'm leading Oribi’s engineering for more than two years now. A very talented group of people, we're mainly working on data-related technologies.

Liran Haimovitch  00:58
So what is Oribi all about?

Eynav Mass  01:00
Oribi is a big data product. Let's start from that. We provide websites event data as a service, creating deep insights on our customer's website, mainly by collecting all website actions automatically.

Liran Haimovitch  01:16
Website events, big data, those are some big words. What do you do? What do your customers get off the product?
 
Eynav Mass  01:23
So the core product, IP, is the ability to detect meaningful events on the website. There are so many actions going on our website, we click on something, we open page, we're doing that in real time, we're getting all these actions with high scale, and processing that to get insight of top of that actions.
 
Liran Haimovitch  01:45
So essentially, instead of having to add Google Analytics to specific event, and then manually, kind of adding and removing and instrumenting that you kind of collect everything all the time, and then you make sense out of it.

Eynav Mass  01:58
Exactly.

Liran Haimovitch  01:59
That's interesting. So what challenges do you face, kind of engineering way, engineering whilst doing that?

Eynav Mass  02:05
There is many. But if I would pick, like the most challenging thing is that Oribi faces high increase of traffic all the time, as our website of the customers are dynamic, their traffic is eventually our traffic. So we have both scale peaks and incremental scale up of this all traffic incoming to the system.
 
Liran Haimovitch
  02:27
So you have to deal both with you know, the system is doing 10%, 50%, whatever, month over month, year over year, growing rapidly, at the same time you also have to worry about those peaks, you know, Black Friday, or Super-bowl ads or whatever, all of a sudden, some websites are doing 2x or 10x off of these two.
 
Eynav Mass
  02:48
Exactly. Sometimes it's just a new website installing Oribi. And then, there's more and more traffic. So it keeps on dynamically. It's like, you're running a marathon in a constant pace. And you keep getting this pace higher and higher. And in the meantime, people telling you now jump, now do like flick flack, and stuff like that. So that's the way it feels.

Liran Haimovitch  03:13
And how are you managing all the data coming into the system?
 
Eynav Mass  03:16
So we're really proud of the system that we built. We have big data processing pipeline running on the cloud. That's big words. But eventually, it's a lot of components, elastic ones, it's microservices, data solutions, storage, and more technology elements that support elasticity of incoming data. They can grow horizontally, they can grow vertically, to support the dynamically changing traffic.

Liran Haimovitch  03:46
Now, besides the scale of the data, what other challenges are you facing as you're ingesting and analyzing the data.

Eynav Mass  03:53
We want to create qualified insights. So it should be not a matter of only collecting the data, you want to bring value to your customers. On top of this data and there's other, like you mentioned, Google Analytics, there's more. We're not in a void in this product business area. So we want to rise the app, and by doing also the automatic collection, but also bringing these valuable insights of top of this data.

Liran Haimovitch  04:26
So how do you, kind of, without the insights out of the raw data streams?

Eynav Mass  04:32
So we have a great product and the insights is features, part of this product. For example, if you want to know X sequence of actions at your website, assuming like some visitors sign up, and then register to some blog and then hearing the podcast and you want to see what is the ratio of that, where do you lose people? What's the bouncing grade? What brings people eventually to listen to more and more podcast chapters? So, we can create a lot of that insights out of this data automatically for you. And you don't need any Bi, or developers, those are great people. But you can just get into a Ruby dashboard and get those insights instantly. And if I go back to the challenge, doing this, computing this insight along with the incremental scale, so you also write a lot of data, and query a lot of data when you combine that. That can be like a data party, craziness of, you know, combining scale with performant queries.

Liran Haimovitch  05:45
So those features, for instance, you would automatically identify events, such as signing up for the website or registering for a blog post, and then kind of automatically extract those features from all the traffic?

Eynav Mass  05:58
Yeah, exactly. And we're doing that by using several and many data solutions different. So we can maximize for each feature and insight. The way we store the data, in a way, it's very close to the way that you will query it to get the insights. So it's also-- it's not only a data stream, it's also multiple data solutions per feature to get that insights.

Liran Haimovitch  06:23
Now, one of the challenges in working with data, is that it doesn't really exist in the lab, it doesn't really exist on your laptop, at least not high quality, high volume data. So how do you adopt your software development lifecycle to focus on the data that's available only in production?
 
Eynav Mass  06:41
First, can I mention scale again? How many scale will say in this, it will be interesting. So everything, every change, every improvement, or new design of part of the system, must be a notion of scale. So probably people that hear us now thinking, what is scale? So scale is really relative, right? I can say it's not a matter of competition, who sleep less at night or who is bigger. In every system, there is a notion of ratio, usually a matter of the amounts that the system behaves to. So it's funny, I research for the definition of scale on the web, I got many results. One of them was the definition of flour scale refers to something that is grand or that is big. An example of large scale is wedding in which you invite 300 people.

Liran Haimovitch  07:38
Oh, so that's a large scale wedding.

Eynav Mass  07:40
Yeah, probably not the Israeli weapons. But the conclusion, don't look for others’ definitions of scale, if you feel that your system is handling large amounts of data, keep on with that and plan for ongoing scale. And I think, if I go back to your question on the software development life, this is the most challenging thing, because you don't just plan a UI component, or some new service or refactor, everything gets another big nether of doing that in scale.

Liran Haimovitch  08:16
Now, early in the intro, I mentioned that you guys don't have staging environments. Is that scale part of the reason for that?

Eynav Mass  08:22
Yeah, it is. Actually, it's one of the assumptions. Because in the big data world, staging is not magic. In Oribi for example, we have large data processing pipeline, as I mentioned, and dynamically scaling it out according to traffic while in staging. That's a challenge.
 
Liran Haimovitch  08:44
At the beginning of the episode, I mentioned that you at Oribi don't have a staging environment. And I'm wondering, is that because of the scale you're always obsessing about? Because the amount of data you're processing?

Eynav Mass  08:56
Right. It's one of the main reasons in the big data world. Staging is not magic, in Oribi for example, we have-- We just mentioned a large data processing pipeline, which dynamically scaled in and out according to the traffic. So it maybe functionalities to behave according to the data flows with various data amounts, to take that to staging, you need to build something that is a breathing system on staging as well to get such amount of data into it. Because most of our product features are being calculated eventually on top of large data sets, like turning the data into insights. Now, assume that we will duplicate all of the core of it into the staging environment to have that amount of data to get value out of the staging. Because if not, the staging will get useless, if it won't get like a real stream of data. Creating such data stream, it involves high amounts and not trivial. We're talking on cost, resources, someone need to maintain, like another yet duplicated environment to get that impactful, actually.

Liran Haimovitch  10:17
So today, you're essentially doing everything in production?

Eynav Mass  10:21
Kind of, it's not, you know, I want to give great credit to the team, because we don't have a lot of failovers once we get to production. So something without the staging part is still working well for us. And there's a reason for that, because we're not only working on production, we have more methodologies. For example, we have unit tests, we have integration tests, between services between the UI, to the UI backend, and many other circles of confidence to eventually get to production, code review, and so on. Maybe the interesting one is that in Oribi, we're heavily using the product ourselves. So that's, like,  eventually we missed something, okay, we got into production, we did all the component tests, all the test methodologies, and so on. And something got missed, we'll get a very quick note about that from our own company members, because they are using Oribi, as well.

Eynav Mass  10:33
So you've brought on, you know, quality in testing, so kind of, where do you focus your efforts? I mean, all the efforts, you have unit testing, integration tests, manual testing, user testing, kind of where do you put your chips?

Eynav Mass  11:47
It's not in one place, I think it might not be a good approach to put all everything in one place. We have like to calculate the risk management, unit test is easy. So our goal is to cover maximum what we can with the unit tests. On top of that, we add component and integration tests where it makes sense. So it has a less portion, only on the more critical path. And as you go, more and more in the components that are involved in the testing, the portions get smaller, like in the end of all of it, we also have manual QA to get some, you know, the quality assurance on that. And it's like, it helped us not to put all of our confidence in one place and to mitigate the risk.
 
Liran Haimovitch  12:41
Now, it's interesting, you mentioned manual QA, because it's pretty much fallen out of favor over the past few years. So kind of, where do you use it? Where do you find it most beneficial for you?

Eynav Mass  12:53
Well, just a side note, I think what we're just-- We are recording now. My vision is not to be popular, but to make the best out of what you can for the team product and business. So we don't have staging, for example, because it won't bring value. And if I calculate time and cost, the value will be really small, with the impact. Same with end-to-end test, we're running really fast. We call these startups, right? Like to say that, but our sprints are, we're running really fast. And when you run fast on your application, and you have end to end, for example, everyone says ' I have fully covered automation', or QA and so on. From my perspective on our UI, if I'll take that approach, my developers will do more than 60 percentage out of their time, they will do automation. Now again, it's a matter of cost and value. So we kept the end to end on the most critical path. But all the features that get evolved all the time, will be covered by manual QA. And that increased our velocity in more than three, like 3x velocity of the system with just one manual QA, because we got into a place that we did more automation test, refactoring than actual coding.

Liran Haimovitch  14:18
Yeah, I mean, especially for the front end, end to end test can be very complex, right? Very, very fragile. And they can be very finicky. And so, yes we can often spend so much time maintaining them and troubleshooting them instead of actually doing value-generating work. Obviously, depends on your use case and the technologies you're using… I definitely see some areas where manual testing can carry the weight quite easily, especially if you can afford some regression.

Eynav Mass  14:47
And I think the last point that you just said, that you said it's gold. We can't point to 100 percentage and say we want to have staging, we want to have full coverage of automation, and we also want to be really fast. You need to eventually be a bit tolerant on something. So we have some tolerant measurement on the regression. And we have a KPI of how portion of that regression we can live with. And everyone knows that there might be that portion. But it's okay. Because again, if you do the cost per value, that enables us to do what we are doing today.
 
Liran Haimovitch  15:27
I know Google very popularized the concept of SLOs, of service level objectives. And traditionally, it's around, you know, availability, around uptime or latency kind of setting your metrics for being there. And what you're doing is kind of you're setting SLO for quality. You're saying we can have up to two regressions a month or a quarter, because it's not going to make a big impact on the business and it allows us to move that much faster.
 
Eynav Mass  15:53
Yes, exactly. And again, it might be less popular, but like seeing retrospectively, what we've done in the last two years that gave us like, amazing freedom to be in the place that we are at.

Liran Haimovitch  16:06
What kind of culture is required to move so fast and take those risks?

Eynav Mass  16:11
Your questions are amazing. It's interesting, because every time I stopped to think on, mainly you need to have very talented and ownership kind of approach at your team. So I think that limited me with hiring juniors, for example, because you can't-- There's a lot of frustration, getting someone to the team and say, Hey, you don't have end-to-end, you don't have staging, you need to be very responsible on your code, etc. And then, good luck production in two days, we will meet you there. So it's not--That's rough, right?

Liran Haimovitch  16:49
Yeah, walking without a safety line can be challenging when you're new to the company, to the concept, to the tech.

Eynav Mass  16:56
Yeah, it's like, jump out of the airplane and get your parachute, create your parachute while you're diving down. So, it means that we point to seniors, we have a fully seniors team. And it's not-- I think seniority is not a portion, that's not a matter of time, like how much time, or like, what how long is your experience. It's a lot about culture, and their ability to be fully accountable on your task. So it's your ownership, it's your responsibility. And this is something that I work with my group, and it's highest value for the group to be accountable, from end to end, like data, UI, QA DevOps, on the tasks, and also to be very collaborative on that and to bring the clarity to stay on time. There's a risk we'll hit the timeline, we'll miss the timeline, I need more help, and so on.
 
Liran Haimovitch
  17:59
There seems to be so much going on your production environment, especially with so many deploys going there without going through staging or other testing environment. So kind of.. how do you monitor what's going on in production?

Eynav Mass  18:12
We have multiple monitoring levels. Again, we invested a lot of time, it gave us the safety to go like we do, and also to track problems on time. So we sleep at night, we sleep well, in the past year, we had only two meaningful kind of downtimes on the system, which were short eventually. And the monitoring helps us and there's many levels. So, everything is monitored in like, it's not like CPU and memory. It's more like all the functionality of the system, it can be average query time, or the concurrency of the tread pool, everything is monitored. It helps us to react on time for scale peaks and incremental peaks as well.

Liran Haimovitch  19:04
Some pretty impressive stuff. Now, there is one question, I love asking all of my guests. Since I'm dealing a lot with debugging, at Rookout. And I have to wonder what's the bug that you remember the most from Oribi or from earlier on in your career?

Eynav Mass  19:19
Okay, it's interesting. I wonder if someone had said to you I don't have any. But we do you know, we're human. And we'll build for that, we have like the KPIs that enable us some percentage of failures. If we had a human mistake, actually, we are using extensively Elastic Search on our system, that's a very powerful data store. I think it's also more and more common in the Israeli tech, in general worldwide. And we did some change. We don't have staging, and someone from the team eventually got to delete all of our Elastic Search clusters.

Liran Haimovitch  20:02
Mhmm, nice.

Eynav Mass  20:03
They called me on the phone. They said, Hey Eynav, maybe you want to sit down and I was really panicked and said we just-- We didn't know it's possible but we just removed all the Elastic Search clusters, all of them. And for two hours, I waited for someone to say, it's a joke. We fooled you. But it didn't come and that was a tough one. And we handled that really quickly. Like in six hours, we were back and everything was flowing. And like the communication with the customers, that all was like I was really proud at the team. And we learned a lot of that. But we also learned that someone can accidentally remove data stores, which is bad, and recovering from backup. It's not a matter of flowers. So we did some like. creative stuff there to get on our feet. And I learned that the team can handle that, like to leave aside that this is like a crazy mistake. But everyone took a lot of responsibility and ownership. And it was like, I watched that on the side. Again, I thought I was fooled. So I also waited for someone to say that, but I watched from the side and I was really, like really proud of it.

Liran Haimovitch  21:22
Are you hiring at Oribi?
 
Eynav Mass
  21:24
We are hiring all the time. I think everyone are hiring, right? It's weird if you're not in Israel, but we are hiring mainly backend people that love data.

Liran Haimovitch  21:35
Amazing stuff. So come to Oribi if you're interested in big data and in working without a staging environment.

Eynav Mass  21:41
Yeah, and it will be great fun and interesting, I promise.

Liran Haimovitch  21:51
So that's a wrap on another episode of the Production-First Mindset. Please remember to like, subscribe, and share this podcast. Let us know what you think of the show and reach out to me on LinkedIn or Twitter at @productionfirst. Thanks again for joining us.