The Production-First Mindset

Jobvite's Ron Teeter - Own Your Code With Fullstack Observability

November 22, 2021 Liran Haimovitch Episode 17

Rookout CTO Liran Haimovitch sits down with  Ron Teeter, Vice President Of Engineering at Jobvite.  They discuss how valuable diversity is in tech, where the trigger point is on new technology, tech debt: how much can you afford to borrow vs what you need to pay off, why your devs need full stack observability...and how to get it, and how Rookout has helped their team to cut down hours of waiting time.

Developer-First Observability
Rookout is a developer-first observability platform that provides an unparalleled ability to collect

Disclaimer: This post contains affiliate links. If you make a purchase, I may receive a commission at no extra cost to you.

Speaker 1:

Welcome to the Production-First Mindset, a podcast where we discuss the world of building code, from the lab all the way to production. We explore the tactics, methodologies, and metrics used to drive customer value by the engineering leaders actually doing it. I'm your host, Liran Haimovitch, CTO and co-founder of Rookout. Today, we're going to be discussing one of the most interesting HR tech companies out there: Jobvite. Joining us is Ron Teeter, VP of engineering and chief architect at Jobvite. Thank you for joining us.

Speaker 2:

Thanks Liran. Glad to be here. And I've been looking forward to it for a while.

Speaker 1:

Thanks, Ron. So what can you tell us about yourself?

Speaker 2:

I am a 30 year veteran software executive. I started as a programmer in QA and on through architect. Predominantly my focus has been on software as a service since early 2000, going from startup to startup, building new tech products, and then learning how to scale with that, you know, initial onrush of new customers on through sale. Everybody wants to see a good outcome for your company.

Speaker 1:

Definitely. It's always, you know, you're always working very hard to make sure things turn out well for whatever company you're working with. And you've been with Jobvite for quite a while now.

Speaker 2:

Yeah, it's been eight years. I joined just before the end of the year in 2013 at a really interesting time in the company history. Uh, we were on premise at that time, you know, the team had been working on a major user interface overhaul, and so it was a really interesting time to join. We ended up moving to public cloud the following year, which was the transformative for us.

Speaker 1:

With us a bit more about Jobvite's core business.

Speaker 2:

Sure. Jobvite builds and services a talent acquisition suite that is primarily focused on recruiting talent. There are a lot of HR companies out there. We're one of the few that only do recruiting. So that's, you know, finding candidates, enticing them to apply, qualifying them, hiring them, and then getting them onboarded. You know, we kind of call it, you know, from your very first touch to your first day at work. That's the part that we want to be excellent at. And when we've got a variety of products and offerings for very small companies, all the way up to very, very large enterprises.

Speaker 1:

Now you mentioned you have an entire suites of many applications, and I know you have some pretty diverse tech stack behind the scenes powering those various applications. How did that come to be?

Speaker 2:

It's an interesting observation and there are really two primary factors to it, right? You can think about company maturity, right? As a company starts off you're in a very early aggressive build it and ship it phase and certain technologies and certain designs lend themselves to rapid iteration better than others, right? Like it's very difficult to, you know, especially in the early two thousands, you know, 2005 is when Jobvite was founded. So at that point in time, you know, there weren't public clouds, there weren't microservices, there weren't functions as a service. So you tended to build more monolithic, easy to test, highly reliable, you know, monoliths over time that doesn't work so well. Right? You need more agility as you grow an organization it's harder and harder to work on the same code base with more developers. So you need to start breaking it down. And that's exactly what we did at Jobvite. We started breaking it down and when you break it down, you have choices. Do I keep working the same technology or do I leverage the right tool for the job at the time? And so you see some of that just across a company's maturity cycle, there will be different technologies introduced at different time points. The other one is really going back to that diversity comment you made, we know diversity is valuable in human beings. Diversity is also valuable in technology, right? And I kind of alluded to it before. It's like, what's the right solution for the thing I'm trying to do. You know, I'm not going to try and do data science in C. I'm going to use Python because that's where all the libraries are. That's where all of the community support that is going to give me the largest return on my investment is so you have to look at the program space and say, what's the right solution here. The third factor is really what can you hire? That's really predominant right now where there's this colossal shortfall in available staff for the number of positions and the urgency with which companies are hiring right now. You know, it's something that we are well aware of and we work just like other companies and how are we going to grow by 20 or 30% in the next six months? Where are we going to get those engineers from, if we're competing for the same pool of people, everybody else's. And so you have to say, well, you know, do I want to be going out there and trying to hire, you know, say a Python engineer in India is really hard to find right now trying to hire a Ruby engineer in India is nearly impossible. So you have to be cognizant of like, where are you trying to source? And what kinds of skill sets are available in that market? Maybe that's the right thing to do because ultimately your customers don't care what your tech stack is written on. They care about the reliability and they care that you're delivering on a predictable cadence.

Speaker 1:

That makes perfect sense. Now I know it's something we're all struggling with and I would love to hear your opinion of- so kind of, in what cases do you aim to have everybody on the same tech stack or making similar choices versus when do you decide to allow engineers or teams to pick up their own technology?

Speaker 2:

I mean, ultimately it comes from, is it something that we can support and sustain in the future? Or is it something that's just fresh and novel? You know, it's very difficult to know where the right trigger point is on a new programming language or a new text technology. I would say that Jobvite tends to be a little forward leaning in terms of adopting new tech, as you're familiar with. You know, we love innovation, but we also recognize that, you know, well, 15% of your R&D total lifecycle cost is on creation. 85% is on maintenance. So, you know, designing for supportability and designing for long-term maintainability matters a lot of, you know, early startups don't care, right? Because they're just trying to fight for survival. But once you get past your midlife in, you know, you're in that, you know, C or D funding round, it starts mattering because it matters to the business economics on what your ratio of maintenance is to new development. Some companies are lucky. They can build a killer product and they can own the market space early. And they're pretty safe. Other companies have a lot of really valid competition. And so you have to stay fresh and you have to continually, you know, listen to your customers and deliver new innovation. And you can't do that. If your prior choices are slowing you down, right. It goes back to that tech, how much can you afford to borrow versus what you're going to pay off? You can borrow a lot when you're early in your life cycle, but you still have to pay it back at some point. And if you get into your later life cycles, you know, mature company, that's got, let's say 15 years, you've got to pay those tech that's bad and you can't really borrow anymore until you do. And that's the balance where it's like, we have a wide range of technologies at Jobvite. Does it mean that I don't want more? Well, if it's the right one, it's fine. But do I want to go out and start working in something that's small or not well supported in the community just to be bleeding edge because it's interesting? No, I don't really want to do that. But I like the fact that we are not mono technology. It gives our, our developers an opportunity to be multilingual. And I think, you know, we can all appreciate that's good for your career. It's interesting. It keeps the work fresh. If you can move between teams and demonstrate some proficiency across different technologies, you know, some of this is both in how you code, but it's also in how you deploy and host. And so, you know, we try to keep things fresh by, you know, moving from on-prem to public cloud, moving from public cloud to cloud-native, moving, you know, from EC2 to Kubernetes, constantly pushing ourselves to find a better, faster, cheaper way to do things is inherent not only for the business outcomes, but for the staff to feel engaged with where the pace of technology is. You know, we like to think that we're similar to a university, right? When people leave Jobvite we call it graduating. They've graduated. Maybe they stuck around for a bachelor's degree. Okay. They stayed for four years. Maybe they stayed for six and they got a master's. We think of it very similar, like people aren't going to come here and stay for the rest of their career. So as a leader, I'm looking at what can I give the staff members that they're going to take with them into their next job as an unforgettable part of their career. And that's the best hope I can have is that they're going to be graduating and moving forward relative to their peers and other companies. And we've got to keep it interesting for them.

Speaker 1:

Definitely. One of the things I found will often keep people interested is keeping them focused on what's happening with the customer what's happening in production. I mean, how is the app acting in the real world rather than just on their laptops? What do you think of that?

Speaker 2:

That's a great question. And it's something I'm incredibly passionate about. You know, I think we've seen this transition from the mid nineties where, you know, engineering did everything and then we kind of broke it in while there's going to be operations and development. Right. And we spent a decade treating them as separate disciplines and creating walls around it. Now that we're trying to do this dev ops and we're trying to mash them back together, we're trying to take it a little step further. It's like, you know, there's a lot of companies have site reliability engineering teams, right? It's like, here's a team that is responsible for watching what's going on and, and fixing things. But they're not the team that write the software that they're trying to run, right? It's more like a traditional operations or it team where you're running someone else's software in your expertise is in running the software, not understanding how it was designed and built in the corollary to that is the product team that designs and builds software doesn't have the runtime experience. And so they keep making some of the same scaling mistakes over and over again, because they're not connected that feedback loop. At my prior company, we did a experiment where we rotated engineers through operations, right? To give them some of that real-world experience. And while it was effective, it was still less effective. What we're doing right now is that the product teams own production support. So you own every piece of code you write and support and maintain from the moment it's written to the time it's deprecated. Doesn't matter whether it's in production development, quality staging. If there's a problem that alert is going to your team and somebody on your team is going to be on call and they're going to have to deal with it. What does that mean? It creates a huge amount of distraction for the product team, right? This was the first realization. Wow. This is a lot of distraction for the product team. And it's like, yes, but you're the team that created this distracting work. So, why would anybody else receive this? Because if they're not the expert in it, they're just going to punt it to you. So that adds delay to the customer turnaround. And it creates this motion of knowledge from the team that created the software to a team that has the support and sustain it. Just put the support where it belongs and then work on driving the noise down. What is it that's causing this? Oh, is we don't have enough capacity? Oh, well this function's not optimal. We'll optimize it. Right. Little tiny spikes that you can add to a sprint can go a long way to reducing the noise. And everyone hates being on-call. I hated being on- call. The motivating factor there isn't to demoralize people, but it's to incentivize them to own it and take responsibility for changing the outcome. Right? Most of the noise that you see are just that. It's noise. Why are you firing an alert on this case? It doesn't matter. It doesn't have a negative customer impact. Why aren't you firing an alert on this when it does have a negative customer impact, trying to get the product teams to understand operationality and connect them back to the customer outcome, right? Yes. It's an annoying bug, but you realize that there's a human being on the other side, that's having a bad day because our software doesn't work. Right. And it could be a job seeker who just missed out on an application that, you know, could have landed them their next job. Or it could have been a recruiter that just missed an ideal candidate for that hiring manager. Right. We have to own the outcome of our software. And part of that is just understanding where it can break and owning it and being proactive about it organizationally. Right? Not tactically. And that's why we kind of chose this approach. It's like, it is hard to balance the unplanned work and the plan to work. But regardless of how you spend your resources, that's the job. You have to do it. If you add an SRE team, generally, you're just creating motion within the organization. They're not really helping as much as you think you at least, you know, that was, you know, some of the results of our early experiments. It can create focus for the product team, but it also creates this overhead of trying to make sure that the support team has all the knowledge that they need.

Speaker 1:

It can often hide distractions instead of fixing them.

Speaker 2:

Yes. We saw that, you know, we had a level three triage team at one point for the support issues and it was actually counterproductive. It added cycle time and it reproduced lower quality repairs. And so, again, you know, we experiment and we look at the results and we make iterations and change.

Speaker 1:

So I've got to ask, what kind of tools do you provide your developers with so that they can understand what's going on in production and fix those issues.

Speaker 2:

It starts with full stack observability, right? You have to know how the systems are working and how it's failing. You know, I'm a big believer that, you know, happy path software has to perform well, but it's even more important that when software doesn't perform well, it behaves itself. Most of the support costs around failure are because of a lack of a good plan for handling the error. You know, how many times have we seen a stack trace that says there was a problem here? And it's like, what problem, what was the data that was in context right,? And these are the kinds of things that drive us to look for best in class solutions. You know, so we start with observability, as you know, huge fans of Rookout. We gravitated to that as soon as it was available. I've been looking for a product like this for about 10 years, because it is so essential to be able to say,"I want to see what's happening right now without having access to the box without having to run the production system in debug mode without having to open ports", right? It's like when you start looking at your security and your access concerns, everything you're trying to do about getting more information to the engineers in the traditional sense is counterproductive. So, you know, when we look at initiatives like zero trust, you have to find a balance and tools like Rookout were, you know, just perfectly placed in both time and solution to meet our need there. So, you know, we look at Appdynamics and NewRelic, and we've got a smattering of tools as well through merger and acquisition, which is the other way you can get mixed technologies. But once you've got these goods, we have tools in place. It's really a, okay, there is a problem, but now I need to go get more information about it. And that's where logging and profiling and data collection is so important. We believe that we should be able to get the information to the engineers without getting the engineers to the infrastructure. And as we move into Kubernetes, we're even going a step further. And it's like, no, you can't log in off the workers. No, you can't have access to the containers. You know, it's like, you've got to think about, you know, how are you going to get your data out, not how do I get the eyeballs in?

Speaker 1:

I've got to say you guys, some of my favorite customers, and I know many of our customers, many of our prospects are always wondering kind of what key business priorities does, Rookout support. And I'm wondering for you, what key business priorities did Rookout offer you to advance?

Speaker 2:

I mean, the major one is simply cycle time, right? If I can get us an issue up in Rookout, we can cut probably an hour off of data collection just to get to the point of, okay, we think, you know, yes, we know where the problem is and here's the context. Now we can reconstruct a reproducible test case having an extra hour on that cycle. Time is amazing because that means that, you know, we can turn things around same day in many cases for really complex issues that in the past would take us several hours or even multiple days to zero in on, right. It gives us a degree of control that is well worth the investment. You know, it's, we've got a variety of engineers on different tech platforms and some of them have worked out and some of them don't and the ones that don't have Rookout are kind of tapping the clock going,"Hey, when are we going to get this on my stack?" Right? Because it's not fair. You know, I want that too, which is always a great selling point. If you can get the engineers to understand why the tool is valuable, you know, it's something that's saving them time. It's improving the quality of their service. It's improving their turnaround time for customer outcomes. That always is going to be preferable.

Speaker 1:

Now I'm wondering since you have this interesting case where you have some teams using Rookout and some teams haven't deployed it yet, I'm kind of wondering what changes have you witnessed since deploying it, especially how do you see the differences between the two groups?

Speaker 2:

So the difference is mainly, you know, on the teams that have Rookout, an issue comes in and it just gets triaged and there's no noise. There's no trend, there's no cross team chatter. Something comes in for a stack that doesn't have Rookout, it's oh, hey, I need access to this production resource. Can you get me these logs? Can I get a heap dump this host? Right. It instantly starts becoming a multi team effort to collect evidence so that we can do the analysis. You know, the way we have things structured, the application teams own the user functionality, but the cloud team owns the hosting infrastructure. And so access to the hosting infrastructure for a application team member requires approvals. It requires, you know, access control, modifications, et cetera. Sometimes it's easier just to get on a screen and have them work together. But again, now I'm consuming resources from both teams to solve one team's problem in ideally we want to get away from that. I want to be able to empower every engineer to be able to go and diagnose problem in minutes without all of that extra overhead and delay, because it allows everybody to be more productive. You know, the cloud team has planned work as well. So every time there's unplanned work coming in from another team, it's disruptive for them.

Speaker 1:

You've mentioned a lot about unplanned work and earlier about maintainability. So kind of wondering if you've been measuring those over the past few years, have you seen any difference?

Speaker 2:

Yeah, I mean, we, you know, as a mature company, you tend to have a higher maintenance ratio than new development. You have just, it is the nature of the more code you write, the more code of it, more of the code isn't changing. So the more code is in maintenance, if you will. And so when you make changes, you have a higher degree of a regression. So your regressions, we built, you know, Jobvite and our telemetry product lines have both been around for about 15 years. So they are very mature code basis, fairly large, you know, so when you're talking about, you know, several million lines of code in each push, you know, you know, your regression is large because the change that is in the tens of thousands of lines out of those millions of change, as we break down into microservices and we get comfortable with the we're only going to push code, that changes, which is kind of, you know, on the other side of our Kubernetes transition, then you get the complexity of what, which versions have been tested together, but at least you're focused on your regression. I only need to worry about the smaller piece that contains the changes and its interfaces. And that's where observability plays a bigger role in the upstream environments. I should be able to detect those regressions and development quality staging before they get to production. And if I do see them in production, I can roll back, you know, in a minute on Kubernetes where, you know, it would take an hour or two to do on the ec2, right? So we have more fine grain control and we have what we release and we also have more control over the speed of release. And I think those things lend themselves very nicely to accelerated throughput.

Speaker 1:

For the past few months, we got to chat quite a bit. And one of the things we talked a lot about are some of the new features that came out of Rookout, stuff, such as Live Logger, Live Profiler, Ruby, and Golang. Kind of wondering from your perspective, what was the most exciting?

Speaker 2:

For us it was Ruby? You know, because we've been using the Java version for a w hile now and having a Ruby version available for our Ruby product line is incredibly exciting. That's the one where it's kind of mixed, you know, i t's like, w e've, we've got about 90% of the fleet deployed with Rookout. And the 10% that doesn't is the one where, you know, people are unhappy and they can't wait until the next release when that's supposed to go live. It's a transformation to watch, you know, people have been hearing about it and watching other team members work with it for a w hile now. But when you and you're the one who gets to go in there and set the breakpoint and start capturing data, it's transformative because that's the moment when it clicks for the engineer and they say, oh, I get why R on won't let me have access to the system anymore. I don't need it. And look at this, all of the stuff I'm not supposed to see is all redacted out. So, you know, I don't have to worry about, you know, customer data or, you know, secrets being exposed because it's been redacted before it's been displayed to the engineer. Now I can let the engineers in with, y ou k now, high degree of surety that it's going to be safe and that, you know, our customers are going to be pleased with the way we're approaching this. We get faster turnarounds and no data leakage.

Speaker 1:

It makes perfect sense. Before we wrap this up, there is one question I like asking all of my guests and I would love to hear your answer for. So what's the single bug in your career that you will remember the most?

Speaker 2:

Oh yeah. There are two that were big, but I'll take the one that I own. I guess I created a problem for one of my former companies, but I didn't create the bug. I just discovered it. So at supply solution, we provided inventory management visibility. So, imagine what you use for observability in production operations, we're doing the same thing for physical material. That's in the logistics supply chain and predominantly this was for north American automotive. It really was to some degree ahead of its time because we were trying to drive a vendor managed inventory management versus push, which is kind of the way the north American teams worked. So, you know, really taking that inspiration from the far east and saying, we can do this here too. And it doesn't mean you have to change your processes. You just have to use a tool and let your vendors do the right thing for you, you know, and trust that they're gonna get their stuff done on time. So obviously we had, you know, very large scale. We had some of the largest ERPs in the world, feeding us data on a 10 minute, just, you know, supply quantities in bin X on this part of the floor and this part of the plant across numerous plants worldwide. And so massive ingest rate. But the number of parts skews was more limited. You know, it's just like they were duplicate bends. And so what we were trying to do is, you know, create this aggregation, then the aggregation would feed back into suggested deliveries. And it's very important that the information be timely in order to make that feedback loop work. And we were doing an architectural change to improve our scale and we added an aggregation step. So all of this stuff comes in. I'm going to let it queue up. And then I'm going to just run the calculation once instead of making it purely event driven. It's a great way to collapse load. But one of the things that we chose do is it was a database, a long pole. And I said, well, it would make more sense if we move this to a queue, you know, and say, I can collapse this and put a message on the queue and then go back into the long loop. And then that way the code that's long polling the database, isn't the one that's doing the calculation. And I mean, timing error. And so it happened as events were going on to the queue faster than they were being consumed, because part of the throttling was the fact that the long pole thread was doing the calculation. So it couldn't long pole any faster than it could consume, but we didn't have enough capacity on the other side to do the calculations fast enough. This was on-prem again, right? Today, it would be just scale up and the problem would be gone in seconds, right? But when you're on finite Harper, you can't just scale up, you know, that's a four month effort to procure a new hardware and schedule it for installation and get it configured. So the result of all of that was that the queue got backlogged and the queue was backlogged so far that the suggested delivery quantities were significantly off the nut shell of all of this was it ended up costing the company almost$500,000 in error processing because people were building expensive things and shipping them when no, they didn't need to. It's like simple mistakes can have real world consequences and you have to have a good simulation environment. And that's the point. And when I said, you know, there's gotta be a better way to do this than what we're doing now with the on-prem from that moment on, you know, as I was watching the public clouds develop, it's like, that's where we need to be. Right. I want to get back to that nearly infinite compute capability that, you know, we used to be able to do things that were just amazing on the mainframe for the day, because the mainframe had so much amazing capacity compared to what we had locally. Right. And I think about that the clouds are just huge mainframes in essence, right? There's, we're using distributed technologies to manage them instead of mainframe technologies, but it's memory and disk and compute and network, and you can use as much of it as you're willing to spend. And that's really the goal. Right? And the nice thing about it is you don't have to buy, you can rent it so you can build a super computer and run it for two or three hours and then tear it all down quite cost-effectively if you want it to. And that's, I think, the area where, you know, I think the current state of the art is so interesting. Is it's a great price performance period.

Speaker 1:

Also grateful the environment, obviously. Not everybody having to buy their own computers.

Speaker 2:

Oh yeah. I think we're getting there, you know, it's just like, depending on how they're powered and how they're cooled, I think are the things that are, you know, we need to be a little bit more economically conscious about, but, you know, it's like, when I start seeing people using, you know, old coal generators to power Bitcoin, I'm just kind of scratching my head going, really? That, that doesn't feel good to me.

Speaker 1:

Yeah. Not cool guys. Don't do that. Don't use coal to power your Bitcoin mining. You're destroying the world for all of us.

Speaker 2:

It's not my favorite thing. I know that people are looking at it kind of restarting some nuclear plants as well. And it's like, yeah, there's a price point where Bitcoin makes sense, but I've never understood the economics of Bitcoin yet. It seems very speculative.

Speaker 1:

Maybe I'll do a podcast on that someday.

Speaker 2:

That would be fantastic. I would listen to it.

Speaker 1:

Great Ron. Thank you very much for joining us. It was great having you on the show.

Speaker 2:

Thank you, Leon. Again, I'm a huge fan and really pleased to be honored to join you today and, uh, you know, keep doing cool things with Rookout. We love it.

Speaker 1:

Thanks.

Speaker 3:

[inaudible]

Speaker 1:

So that's a wrap on another episode of the Production-First Mindset. Please remember to like, subscribe, and share this podcast. Let us know what you think of the show and reach out to me on LinkedIn or Twitter at@productionfirst. Thanks again for joining us.

Speaker 3:

[inaudible].

People on this episode