Benedikt Koller: DevOps Principles for ML

Benedikt discusses common problems faced by teams putting machine learning into production, bringing over best practices from DevOps to solve them, and building ZenML, an open source MLOps framework.

Guest Bio

Benedikt Koller is a self-professed "Ops guy", having spent over 12 years working in roles such as DevOps engineer, platform engineer, and infrastructure tech lead at companies like Stylight and Talentry in addition to his own consultancy KEMB. He's recently dove head first into the world of ML, where he hopes to bring his extensive ops knowledge into the field as the co-founder of Maiot, the company behind ZenML, an open source MLOps framework.

Links Mentioned

Show Notes

  • 02:15 How Ben got started in computer science
    • Fully self-taught
    • Started as a SysAdmin running a Linux datacenter
    • Worked at a few startups since then, privy to the evolution of how software has been deployed
  • 05:30 What the "DevOps revolution" was
    • Shift towards a philosophy of "You build it, you run it" from Werner Vogels at Amazon
      • Tools and practices to enable software engineers to own their code end-to-end
    • "Your responsibility does not end with writing the last semi-colon on your JavaScript code. Your responsibility ends when your JavaScript code is rendered in the browser of your customer and the behavior was successful."
    • "The value that you're trying to create is only created when you reach production, when your model reaches customers or when your code is deployed on the website. A feature that's not deployed is worthless."
  • 10:10 Bringing good Ops practices into ML projects
    • Feature stores eliminating duplication of transformation code / environment
    • Versioning data as similar to code
    • Current challenge: bringing observability to ML applications
      • "Developing your application with monitoring in mind so that [it] is communicating its state in a very coordinated way to the outside."
      • Recording standard metrics (reqs/sec, resource util., etc) and also ML/data specific ones
        • Inputs and outputs of models
        • Statistical tests to find data drift
    • Ensure reproducibility
      • How can data scientists make sure that a model is behaving the same in production as it was in training and evaluation?
    • Abstract away runtime environments
      • Ideally would be able to write code that easily scales to larger amounts of data
        • Both vertically (GPUs) and horizontally (distributed) if needed
    • Automate as much as you can
      • Forcing function for good practices
        • Can't just use hard-coded notebooks or scripts
  • 30:50 Pivoting from vehicle predictive analytics to open source ML tooling
    • Maiot was originally using ML to predict when commercial vehicles would need to be maintained
    • Discovered that it wasn't a problem that customers would actually pay enough to have solved
    • Along the way they had built out fairly good ML infrastructure that solved common problems faced by other ML teams/companies
    • Cleaned up the codebase, made it more generalizable, and recently open-sourced it
  • 34:35 Design decisions made in ZenML
    • Pipeline that provides high flexibility for what to be used at each step
      • Abstracted runtime environment
        • Can run locally or on powerful cloud instances
      • Caching intermediate results
    • Basic building blocks of an ML process: data transformations, splitting, training. evaluation, deployment
  • 39:20 Most common problems faced by applied ML teams
    • Over-specification that makes it harder to adapt to new situations later
      • Data source
      • Execution environment
      • Dependency versions
  • 49:00 The importance of separating configurations from code
    • Should be able to run the same code in different environments, on different data, with different data splits, evaluated on different metrics, etc.
    • Same config should produce the same result every time
    • A good declarative config should tell you what was executed, what actually happened
  • 55:25 Resources Ben recommends for learning Ops
    • Watch conference talks about how other companies and teams do things
  • 57:30 What to monitor in an ML pipelines
    • Don't reinvent the wheel: use open source tools if possible, only build your own if absolutely needed
    • Start by monitoring the classical app metrics, then move onto input data drift detection
  • 01:00:45 Why you should run experiments in automated pipelines
    • Makes it extremely easy to scale
      • Possible to manually keep track of maybe 10 parallel experiments, but definitely not 100
    • Can resume failed steps without re-running the entire pipeline
    • Analogy: feature stores give you trust in your input data, training pipelines give you trust in your model
  • 01:10:25 Building an open source business and what's next for ZenML
    • "Open source just gives us the ability to really provide the value of our vision in as many organizations as possible."
    • Biggest roadmap item is to add more integrations
  • 01:20:20 Rapid fire questions
    • For fun: training MMA
    • Books: Something Deeply Hidden, The Expanse, Three Body Problem, Extreme Ownership
    • Advice: Optimize for what works, stay curious about adjacent fields
    • Recently changed mind: statistics is really important