Loading…
strong> [clear filter]
Monday, October 28
 

17:00 GMT

Badge Pickup
Monday October 28, 2024 17:00 - 19:00 GMT
Monday October 28, 2024 17:00 - 19:00 GMT
Ground Floor Foyer

18:00 GMT

Welcome Get-Together
Monday October 28, 2024 18:00 - 19:00 GMT
Whether this is your first time at SREcon or your tenth, enjoy this opportunity to meet your fellow attendees over snacks and beverages before the conference program begins.
Monday October 28, 2024 18:00 - 19:00 GMT
Liffey Hall 1
 
Tuesday, October 29
 

07:30 GMT

Badge Pickup
Tuesday October 29, 2024 07:30 - 17:00 GMT
Tuesday October 29, 2024 07:30 - 17:00 GMT
Ground Floor Foyer

08:45 GMT

Morning Coffee and Tea
Tuesday October 29, 2024 08:45 - 08:45 GMT
Tuesday October 29, 2024 08:45 - 08:45 GMT
The Forum

08:45 GMT

Opening Remarks
Tuesday October 29, 2024 08:45 - 09:00 GMT
Tuesday October 29, 2024 08:45 - 09:00 GMT
The Liffey

09:00 GMT

Dude, You Forgot the Feedback: How Your Open Loop Control Planes Are Causing Outages
Tuesday October 29, 2024 09:00 - 09:45 GMT
Laura de Vesine, Datadog, Inc.


It's a strong principle of good UX design that users should get feedback about the results of their actions, to help prevent errors. Experienced SREs know to build in additional observability to systems to watch our systems change as we mutate them, but these are typically out-of-band and require a conscious, deliberate action to observe -- so getting good feedback into our actions requires constant vigilance and training of new users. What if we instead built control planes that tell us exactly what we've done, and what effect that is having?
This talk explores various patterns of "fire and forget" control planes in production systems, how each one contributes to outages, and some simple solutions to build better tools for operations.


https://www.usenix.org/conference/srecon24emea/presentation/de-vesine
Speakers
avatar for Laura de Vesine

Laura de Vesine

Datadog, Inc.
Laura de Vesine is a 20+ year software industry veteran. She has spent the last 8 years in SRE working in incident analysis and prevention, chaos engineering, and the intersection of technology and organizational culture. Laura is currently a staff engineer at Datadog, Inc. She also... Read More →
Tuesday October 29, 2024 09:00 - 09:45 GMT
The Liffey

09:45 GMT

You Depend on Time, This Is How It Works and You Won’t Believe It
Tuesday October 29, 2024 09:45 - 10:30 GMT
Philip Rowlands, Jane Street


This is a talk about calendars, clocks, and computers. We’ll look at the metrology of the second, from candles to atoms, and consider how your phone always seems to know the right time.

If you’ve ever wondered why is today Thursday? or how was the Gregorian calendar adopted? then come and learn the mistakes to avoid the next time you are the Pope.

If you’ve ever wondered why do these two clocks disagree? then come and learn about the challenges of finding the elusive perfect tick, and why it’s not at the top of Mount Everest.

And if you’ve ever wondered how calendars and clocks work together in modern computer systems, then come and learn about protocols and APIs for keeping clocks reliable and accurate.


https://www.usenix.org/conference/srecon24emea/presentation/rowlands
Speakers
avatar for Philip Rowlands

Philip Rowlands

Jane Street
Philip Rowlands has been an SRE since before he really understood what it meant. He has worked over the years on automated telephony, Google Production SRE, Mainframe Linux, and more recently for various financial firms, all of which had timekeeping challenges.
Tuesday October 29, 2024 09:45 - 10:30 GMT
The Liffey

10:30 GMT

Coffee and Tea Break
Tuesday October 29, 2024 10:30 - 11:00 GMT
Tuesday October 29, 2024 10:30 - 11:00 GMT
The Forum

11:00 GMT

SRE Saga: The Song of Heroes and Villains
Tuesday October 29, 2024 11:00 - 11:40 GMT
Daria Barteneva, Microsoft Azure


SRE team require a balance of technical and soft skills, creativity and teamwork to be successful. Drawing parallels between the roles, challenges and dynamics of Dungeons and Dragons party and an SRE team will help us to explore SRE journey from the team inception to developing ideal makeup in terms of tenure/seniority, skillset and align it with the context SRE team could be part of.

We will share practical examples that helps SRE teams building resiliency and effective collaboration while dealing with challenges. We will also explore different mechanisms that can channel "super hero" energy to make team stronger and nurture the talent, helping team to keep the balance of distributed knowledge and accountability.

In this talk we will discuss:


  • Examples of functional SRE team setups

  • Common challenges SRE team may encounter

  • Developing early in career SRE

  • Dealing with the change and building resilience

  • Identifying red flags and avoiding long term problems



https://www.usenix.org/conference/srecon24emea/presentation/barteneva
Speakers
avatar for Daria Barteneva

Daria Barteneva

Microsoft Azure
Daria is a Principal Site Reliability Engineer in Observability Engineering in Azure. With a background in Applied Mathematics, Artificial Intelligence, and Music, Daria is passionate about machine learning, diversity in tech, and opera. In her current role, Daria is focused on changing... Read More →
Tuesday October 29, 2024 11:00 - 11:40 GMT
The Liffey A

11:00 GMT

I Can OIDC You Clearly Now: How We Made Static Credentials a Thing of the Past
Tuesday October 29, 2024 11:00 - 11:40 GMT
Iain Lane and Dimitris Sotirakis, Grafana Labs


At Grafana Labs, we tackled a thorny problem: managing secrets in an open-source CI/CD pipeline. Our journey from static secrets to OIDC-based access wasn't just about better security—it was about empowering our engineers. We'll walk you through how we leveraged OIDC and GitHub Actions to create a "secretless" system for accessing cloud resources, complete with shared jobs and abstractions that make secure access simple. But it wasn't all smooth sailing. We'll share war stories, including a security hiccup that taught us valuable lessons. If you're drowning in a sea of secrets or just want to sleep better at night, come and learn how we boosted security while cutting operational headaches. You'll walk away with practical strategies for implementing OIDC-based access that'll make your engineers happy and your security team even happier.


https://www.usenix.org/conference/srecon24emea/presentation/lane
Speakers
avatar for Iain Lane

Iain Lane

Grafana Labs
Iain is a senior software engineer at Grafana Labs. A member of the Platform team, his focus is on maintaining the infrastructure - Kubernetes clusters - which runs Grafana Cloud, and helping build tools and processes for engineers to deploy their software into this environment with... Read More →
avatar for Dimitris Sotirakis

Dimitris Sotirakis

Grafana Labs
Dimitris is a Senior Software Engineer with background in Backend, DevOps, Release and Platform Engineering. Specialized in CI/CD architecture, he has spent most of his career tackling the challenges of delivering software, tools and frameworks with quality. Currently he’s a member... Read More →
Tuesday October 29, 2024 11:00 - 11:40 GMT
The Liffey B

11:00 GMT

Discussion: Managing Cost
Tuesday October 29, 2024 11:00 - 12:30 GMT
John Looney, Reddit, and James Beal


This session is an opportunity for people to come together and discuss managing cost, facilitated by our knowledgeable guides. This is not a prepared talk or workshop—expect a less-formal session with plenty of opportunity to ask questions and to talk to other attendees who are interested in managing cost.


https://www.usenix.org/conference/srecon24emea/presentation/discussion-managing-cost
Speakers
avatar for James Beal

James Beal

James started playing with computers with the ZX81, learned C for his A Levels, and has degrees in computer science and parallel and distributed systems. He has been using Linux originally with MCC Interim Linux and later with other distributions. He started volunteering at the OTW... Read More →
JL

John Looney

Reddit
John is a platform engineer who helps senior engineers tune their applications to cost less, and makes Kubernetes cost less to run. Both projects required making promises to product teams - “that the compute platform will be reliable enough that they don’t need to pad out resources... Read More →
Tuesday October 29, 2024 11:00 - 12:30 GMT
Liffey Hall 1

11:00 GMT

Workshop: Loadshedding and Isolation Using Envoy Proxy
Tuesday October 29, 2024 11:00 - 15:30 GMT
Laura Nolan; Niall Murphy, Stanza


Effective load management is a core aspect of the SRE role. In this workshop, participants will be introduced to a number of Envoy proxy features that are used for loadshedding and isolation, such as circuit breaking, adaptive concurrency, and ratelimiting. Participants will also use custom Go plugins to perform loadshedding. As part of the practical element of the workshop, participants will interact with Envoy configurations and status/control pages and endpoints, as well as Envoy’s telemetry.


https://www.usenix.org/conference/srecon24emea/presentation/nolan
Speakers
avatar for Niall Murphy

Niall Murphy

Stanza
Niall is the CEO of Stanza Systems, has occupied various engineering and leadership roles in Microsoft, Google, and Amazon, and is the instigator of the best-selling & prize-winning Site Reliability Engineering, which he hopes at some stage to live down. His most recent book is Reliable... Read More →
avatar for Laura Nolan

Laura Nolan

Laura Nolan has contributed to several books on SRE, such as the Site Reliability Engineering book, Seeking SRE, and 97 Things Every SRE Should Know, and is currently is completing her MSc in Human Factors and Systems Safety at Lund University. Laura is a member of the USENIX board... Read More →
Tuesday October 29, 2024 11:00 - 15:30 GMT
Liffey Hall 2

11:50 GMT

The Frontiers of Reliability Engineering
Tuesday October 29, 2024 11:50 - 12:30 GMT
Heinrich Hartmann, Zalando SE


We take the 10s anniversary of SRECon as an occasion to reflect over the past decade of advancements in Reliability Engineering and provide an overview about the Frontiers we are facing today. Within Zalando we followed major trends of the industry in outsourcing hardware provisioning to AWS, package applications into Docker images, fully automated deployments (CI/CD), and implemented Distributed Tracing for Microservice Observability. Despite these advances, many challenges remain in building reliable, observable software systems and new areas arose which require new methods and tools. In the talk we are proving a number of conceptual view that help to map out the larger Reliability Engineering landscape and zone-in on 3 specific frontiers that we are actively investing in at Zalando: (1) Data Operations and Monitoring Event Based Systems (2) Mobile Observability (3) Effective Management Practices for Reliability.


https://www.usenix.org/conference/srecon24emea/presentation/hartmann
Speakers
avatar for Heinrich Hartmann

Heinrich Hartmann

Zalando SE
Heinrich Hartmann is a seasoned expert with a decade of experience in Reliability Engineering. Currently, he serves as the Senior Principal SRE at Zalando, a leading European e-commerce company, where he oversees company-wide reliability practices. Before joining Zalando, Heinrich... Read More →
Tuesday October 29, 2024 11:50 - 12:30 GMT
The Liffey A

11:50 GMT

OMG WTF SSO: A Beginner’s Guide to Single Sign-On (Mis)configuration
Tuesday October 29, 2024 11:50 - 12:30 GMT
Adina Bogert-O'Brien


SSO protocols are just ways for an identity provider to share information about an authenticated identity with another service. Me having a way to tell my vendor “yeah, that’s Bob” doesn’t tell me what the vendor does with this information, or if the vendor always asks me who’s coming in the door. A bad SSO implementation can make you think you’re safer, while hiding all the new and fun things that have gone wrong.
To get the most out of implementing SSO, I need to know what I’m trying to accomplish and what steps I need to follow to get there. To illustrate why SSO needs to be set up carefully, for each of the things you need to do right, I’ll give you some fun examples of creative ways you and your vendor can do this wrong. We all learn from failure, right???


https://www.usenix.org/conference/srecon24emea/presentation/bogert-obrien
Speakers
avatar for Adina Bogert-O'Brien

Adina Bogert-O'Brien

I am incessantly curious, work in renewable energy, and sometimes find vulnerabilities when I’m bored. I co-founded a hackerspace over a decade ago but have only just accepted that security is more than a hobby. At work, I’m a business architect with security leanings working... Read More →
Tuesday October 29, 2024 11:50 - 12:30 GMT
The Liffey B

12:30 GMT

Luncheon
Tuesday October 29, 2024 12:30 - 14:00 GMT
Sponsors
avatar for Cortex

Cortex

Cortex helps engineering teams understand and improve their services. By aggregating data from tools like Datadog and Okta, we help teams understand their architecture at a glance – everything from ownership to runbooks. Using this data, we enable engineers to build report cards... Read More →
Tuesday October 29, 2024 12:30 - 14:00 GMT
The Forum

14:00 GMT

Sailing the Database Seas: Applying SRE Principles at Scale
Tuesday October 29, 2024 14:00 - 14:40 GMT
Ioannis Androulidakis and Martin Alderete, Booking.com


In this talk we will demonstrate how we apply core SRE principles in the field of Database Engineering. More specifically, we will talk about the challenges of operating large-scale database systems in multiple cloud environments and how adopting best SRE practices dramatically improved our daily workflows and operations.

We will share insights and concrete use cases around the following topics: Monitoring Distributed Systems, Eliminating Toil and Postmortem Culture.

This talk will equip attendees with ideas and guidelines to better understand and efficiently operate their database systems such as choosing the right SLIs and SLOs, automating capacity planning and embracing a postmortem culture after outages.


https://www.usenix.org/conference/srecon24emea/presentation/androulidakis
Speakers
avatar for Martin Alderete

Martin Alderete

Booking.com
Martin Alderete is a Principal Site Reliability Engineer with a long track record in Engineering, Distributed Systems and System Level Programming in both the academia where after getting his degree he worked as teacher assistant. And the industry where he led different teams building... Read More →
avatar for Ioannis Androulidakis

Ioannis Androulidakis

Booking.com
Ioannis Androulidakis is a Site Reliability Engineer with a strong background and multiple years of experience in Operating Systems, Observability Tools and Cloud Platforms. He is passionate about OSS technologies and has contributed to multiple open-source projects over the years.Ioannis... Read More →
Tuesday October 29, 2024 14:00 - 14:40 GMT
The Liffey A

14:00 GMT

Achieving Excellence: SLO Thresholds That Transform Service Quality
Tuesday October 29, 2024 14:00 - 14:40 GMT
Thiara Ortiz, Netflix


At Netflix, ensuring exceptional quality for our streaming platform is crucial. Every time a Netflix member sits down, reclines in their chair, and turns on their TV, it's a moment of truth. It's our opportunity to deliver a spectacular service with amazing quality of experience. Misses, errors, or high latency—whether due to ISP configuration changes, code deployment, or catastrophic fallback—impact how our service is perceived.

In this talk, I'll share methods for defining thresholds for SLOs, ranging from intuition and industry best practices to advanced techniques like A/B experimentation. At Netflix, properly defining SLOs allows us to ensure industry-leading quality of experience for our members.


https://www.usenix.org/conference/srecon24emea/presentation/ortiz
Speakers
avatar for Thiara Ortiz

Thiara Ortiz

Netflix
Thiara is a Staff CDN Reliability Engineer at Netflix. Over the last four years, Thiara has been working on Open Connect, improving the resilience of the Netflix service for members around the world. Most recently, Thiara has been heavily involved with the introduction of Cloud Gaming... Read More →
Tuesday October 29, 2024 14:00 - 14:40 GMT
The Liffey B

14:00 GMT

Discussion: eBPF
Tuesday October 29, 2024 14:00 - 15:30 GMT
Cameron Howes, Goldman Sachs, and Daniel Hodges


This session is an opportunity for people to come together and discuss eBPF, facilitated by our knowledgeable hosts. This is not a prepared talk or workshop—expect a less-formal session with plenty of opportunity to ask questions and to talk to other attendees who are interested in eBPF.


https://www.usenix.org/conference/srecon24emea/presentation/discussion-ebpf
Speakers
avatar for Cameron Howes

Cameron Howes

Goldman Sachs
Cameron Howes is an Analyst in the Market Data SRE team at Goldman Sachs, specialising in low-level development and performance instrumentation. When he's not ferociously avoiding a memory allocation, or reading about the latest CVEs, Cameron can be found writing black-box probers... Read More →
avatar for Daniel Hodges

Daniel Hodges

Meta
Daniel Hodges is a software engineer that works at Meta on profiling and scheduling. He has worked as a site reliability engineer, production engineer and has experience with observability, profiling and production deployments.
Tuesday October 29, 2024 14:00 - 15:30 GMT
Liffey Hall 1

14:45 GMT

Survivor: MySQL Island – Outwit, Outplay, Outlast Metadata Locking Challenges
Tuesday October 29, 2024 14:45 - 15:05 GMT
Julia Jablonska, Capsule CRM


Think you understand MySQL metadata locks? Join this interactive session to test your knowledge and take a deep dive into the intricacies of MySQL's locking mechanisms.

We'll explore real-world scenarios, such as creating tables with foreign key constraints and adding indexes, to see how metadata locks can impact performance and stability. Through live voting you'll gain insights into what's happening behind the scenes and learn practical tips for managing database migrations.


https://www.usenix.org/conference/srecon24emea/presentation/jablonska
Speakers
avatar for Julia Jablonska

Julia Jablonska

Capsule CRM
As an Infrastructure Engineer at Capsule CRM, Julia is responsible for keeping Capsule secure, fast and reliable for thousands of our business customers around the globe.
Tuesday October 29, 2024 14:45 - 15:05 GMT
The Liffey A

14:45 GMT

Selective Reliability Engineering: There Is No Single Source of Truth
Tuesday October 29, 2024 14:45 - 15:05 GMT
Elise Burke, Datadog, Inc.


As engineers we design distributed architectures, define project scopes, and ensure that we have a single "source of truth". But what, exactly, do we mean by the phrase? Do we really have only one source of truth - and for that matter, how do we decide what it is?

We'll look at some well-known ambiguities in system design and data modeling and then consider more philosophical questions about truth, the sources of truth we accept, and why this ambiguity matters.


https://www.usenix.org/conference/srecon24emea/presentation/burke
Speakers
avatar for Elise Burke

Elise Burke

Datadog, Inc.
Elise's sixteen year career as a software and site reliability engineer includes supporting Google's internal distributed storage systems and Datadog's organization-wide production practices. Her interests include exploring the interconnectedness of both technology and the people... Read More →
Tuesday October 29, 2024 14:45 - 15:05 GMT
The Liffey B

15:10 GMT

Fixing Your Noisy Pager in 500 Easy Steps
Tuesday October 29, 2024 15:10 - 15:30 GMT
Chris Sinjakli, PlanetScale


You're not sure when it happened, but your pager suddenly seems noisy. You've started dreading your on-call shifts before they begin. You breathe a sigh of relief every time you sleep without interruption. Sound familiar?

Noisy on-call rotas sneak up on us one page at a time - an edge case in a new feature, an alert with too many false positives, processes that get stuck and need restarting. Each of these is easy to tolerate alone, but they quickly add up, leaving you swamped in alert noise and tired from missed sleep.

In this talk we'll explore techniques for digging ourselves out of the hole. We'll look at how to demonstrate the scale of the issue to our colleagues, what to do when the list of problems seems insurmountable, and how to get started with automated remediation in a low-risk way - I promise it's less scary than it sounds.


https://www.usenix.org/conference/srecon24emea/presentation/sinjakli
Speakers
avatar for Chris Sinjakli

Chris Sinjakli

PlanetScale
Chris enjoys working on the strange parts of computing where software and systems meet. He especially likes the challenges of databases and distributed systems.All his programs are made from organic, hand-picked, artisanal keypresses.
Tuesday October 29, 2024 15:10 - 15:30 GMT
The Liffey A

15:10 GMT

Why You’re (Probably) Doing Service Catalogs Wrong
Tuesday October 29, 2024 15:10 - 15:30 GMT
Lisa Karlin Curtis, incident.io
Service catalogs promise a lot of things: powerful automations, insights into your technology estate.
But over the last few years, many of us have learned that setting up and maintaining a service catalog is really hard.
Building out a catalog from a standing start can take months, or even years. Too many people get stuck in a chicken-and-egg situation, where you can’t deliver value because you don’t have the data in your catalog, and you can’t convince anyone to spend time helping you because the catalog doesn’t do anything yet.
But there is another way...
https://www.usenix.org/conference/srecon24emea/presentation/curtis
Speakers
avatar for Lisa Karlin Curtis

Lisa Karlin Curtis

incident.io
Lisa started out as a consultant working with HMRC and then smart meters, before accidentally becoming a developer. She was a founding engineer at incident.io, building tooling to help your whole organization manage incidents better. She loves building stuff, but is also really interested... Read More →
Tuesday October 29, 2024 15:10 - 15:30 GMT
The Liffey B

15:30 GMT

Coffee and Tea Break
Tuesday October 29, 2024 15:30 - 16:00 GMT
Tuesday October 29, 2024 15:30 - 16:00 GMT
The Forum

16:00 GMT

Exploring the Unintended Consequences of Automation in Software
Tuesday October 29, 2024 16:00 - 16:40 GMT
Courtney Nash, The VOID


Automation is ubiquitous—it is entwined in our daily lives in ways that we aren’t always aware of. It has been woven into all aspects of modern software by being presented as a utopian vision: a way of making human lives easier, doing repetitive tasks faster and with fewer errors, freeing us fallible humans up to do other ostensibly more important work. But anyone who has worked directly with automated systems knows that we are still very far from such a dreamy reality.

This talk delves into detailed research about how automation is involved in software incidents. My focus on this area stems from the growing portrayal of automation as a panacea for various software incident issues, despite its limitations in effectively addressing these challenges, such as reliable detection and resolution of software issues or analyzing and disseminating learnings from these incidents back into the organization and its products and services.

Drawn directly from public incident reports (collected in the VOID), this research revealed multiple, often competing, roles that automation can play over the course of an incident, and most importantly underscored how important humans are at understanding, troubleshooting, and recovering from automated software issues. If you're struggling to convey the reality behind the hype of automation and AI to others on your team or at your organization, this is the talk for you.


https://www.usenix.org/conference/srecon24emea/presentation/nash
Speakers
avatar for Courtney Nash

Courtney Nash

The VOID
Courtney Nash is a researcher focused on system safety and failures in complex sociotechnical systems. An erstwhile cognitive neuroscientist, she has always been fascinated by how people learn, and the ways memory influences how they solve problems. Over the past two decades, she’s... Read More →
Tuesday October 29, 2024 16:00 - 16:40 GMT
The Liffey A

16:00 GMT

SRE Stakeholders: A Spotter’s Guide
Tuesday October 29, 2024 16:00 - 16:40 GMT
Dave O'Connor


For Every SRE or SRE-adjacent team in any organisation, there are many kinds of stakeholders; people who care (or don't care!) about how your team operates, and the outcomes of that. They differ massively in how they view your team, and in how they, in turn, should be viewed, and managed.

In a timeline that doesn't contain a canonical book setting out what SRE is here for and how it achieves that, the sad and annoying answer is that "it depends". Because of this, we need to get good (or remain good) at stakeholder management and communications about why we're here, and what we do.

While primarily useful to SRE leadership, the kinds of stakeholders you run into can be useful to know for any SRE. Learn to spot the different stakeholders in your life, what they (generally) care about, and how you can help reduce misunderstandings and tension, no matter where you're sitting.


https://www.usenix.org/conference/srecon24emea/presentation/oconnor
Speakers
avatar for Dave O'Connor

Dave O'Connor

Dave is an SRE Leadership practitioner, Advisor and Coach based in Dublin. He's been working on SRE and SRE-adjacent organisations for over 20 years, primarily as an SRE Lead at Google from 2004-2021. Since then, he has spent time leading SRE, Security and Infrastructure teams at... Read More →
Tuesday October 29, 2024 16:00 - 16:40 GMT
The Liffey B

16:00 GMT

Enhancing Elasticsearch Performance: Innovative Reindexing Strategies Using Dedicated Nodes and KEDA Autoscalers
Tuesday October 29, 2024 16:00 - 16:40 GMT
Leila Vayghan, Shopify


This talk is about enhancing the search infrastructure of Shopify, a large-scale ecommerce platform that supports over 3 million merchants and handles more than two petabytes of data.

This talk explains how we leverage Kubernetes on Google Cloud Platform to ensure high availability and performance, crucial for maintaining our platform's robust search functionality. It will also elaborate on our innovative approach using dedicated reindexing nodes within existing clusters, which significantly improves indexing and reindex performance while cutting infrastructure costs. We will explore the application of Kubernetes Event-Driven Autoscaling (KEDA) to dynamically manage resource allocation, enhancing operational efficiency and reducing on-call fatigue. This strategy not only supports seamless user experiences but also boosts Gross Merchandise Value (GMV) and revenue through improved system responsiveness.

This presentation is ideal for those involved in managing large-scale data systems or interested in advanced Elasticsearch optimizations.


https://www.usenix.org/conference/srecon24emea/presentation/vayghan
Speakers
avatar for Leila Vayghan

Leila Vayghan

Shopify
Leila is an engineer at Shopify, where she spends her days enabling millions of merchants to grow by making sure buyers are able to search and find their products. She does this by running a large-scale search infrastructure on Kubernetes in many regions of the world. Leila has completed... Read More →
Tuesday October 29, 2024 16:00 - 16:40 GMT
Liffey Hall 2

16:00 GMT

Discussion: Service Level Objectives
Tuesday October 29, 2024 16:00 - 17:30 GMT
Alex Hidalgo, Nobl9, and Heinrich Hartmann, Zalando SE


This session is an opportunity for people to come together and discuss SLOs, facilitated by our knowledgeable guides. This is not a prepared talk or workshop—expect a less-formal session with plenty of opportunity to ask questions and to talk to other attendees who are interested in SLOs.


https://www.usenix.org/conference/srecon24emea/presentation/discussion-slos
Speakers
avatar for Heinrich Hartmann

Heinrich Hartmann

Zalando SE
Heinrich Hartmann is a seasoned expert with a decade of experience in Reliability Engineering. Currently, he serves as the Senior Principal SRE at Zalando, a leading European e-commerce company, where he oversees company-wide reliability practices. Before joining Zalando, Heinrich... Read More →
AH

Alex Hidalgo

Nobl9
Alex Hidalgo is the Field CTO at Nobl9 and author of "Implementing Service Level Objectives." During his career he has developed a deep love for sustainable operations, proper observability, and using SLO data to drive discussions and make decisions. Alex's previous jobs have included... Read More →
Tuesday October 29, 2024 16:00 - 17:30 GMT
Liffey Hall 1

16:45 GMT

Rock around the Clock (Synchronization): Improve Performance with High Precision Time!
Tuesday October 29, 2024 16:45 - 17:05 GMT
Lerna Ekmekcioglu, Clockwork Systems


Is the app slow or the network lagging? When it comes to latency in distributed systems, it can be hard to identify where exactly the issue is. As businesses increasingly adopt diverse deployment environments —on-premises, cloud, or hybrid— the complexity grows, obscuring visibility into system health. Join me to hear why clock synchronization is key for identifying the true culprit when latency is due to contention in the network. I’ll demo how network contention impacts tail latencies followed by an overview of clock synchronization protocols to date, their pros and cons, and best practices in disciplining clocks, as well as recent algorithms from Stanford Research. With high precision clock synchronization at scale, we gain back visibility into useful one way delay metrics, which act as an early signal for network congestion that help us prevent impact to response times for our end users!


https://www.usenix.org/conference/srecon24emea/presentation/ekmekcioglu
Speakers
avatar for Lerna Ekmekcioglu

Lerna Ekmekcioglu

Clockwork Systems
Lerna is a Senior Solutions Engineer at Clockwork Systems where she helps customers meet their performance goals with software solutions built on Clockwork.io’s foundational research. Prior to this, she was a Senior Solutions Architect serving Global Financial Services customers... Read More →
Tuesday October 29, 2024 16:45 - 17:05 GMT
The Liffey A

16:50 GMT

Panel Discussion: Is Reliability a Luxury Good?
Tuesday October 29, 2024 16:50 - 17:30 GMT
Moderator: Emil Stolarsky
Panelists: Niall Murphy, Stanza
https://www.usenix.org/conference/srecon24emea/presentation/stolarsky
Moderators
avatar for Emil Stolarsky

Emil Stolarsky

Increase
Emil is an engineer at Increase where he works on building modern banking infrastructure. Before that, he was at companies such as Wave Mobile Money, DigitalOcean, and Shopify, working on everything from building data centres in Sub-Saharan Africa to caching & performance optimizations... Read More →
Speakers
avatar for Niall Murphy

Niall Murphy

Stanza
Niall is the CEO of Stanza Systems, has occupied various engineering and leadership roles in Microsoft, Google, and Amazon, and is the instigator of the best-selling & prize-winning Site Reliability Engineering, which he hopes at some stage to live down. His most recent book is Reliable... Read More →
Tuesday October 29, 2024 16:50 - 17:30 GMT
The Liffey B

16:50 GMT

Multi-tier Kubernetes Cluster Auto-Scaling
Tuesday October 29, 2024 16:50 - 17:30 GMT
Moeid Heidari


This research tackles the limitations of traditional autoscaling systems, which typically operate within a single cloud provider. We propose a new Kubernetes autoscaling operator that dynamically adjusts resources across multiple cloud platforms and on-premise systems. By integrating with various provisioning systems and allowing user-defined scaling strategies, this operator addresses the inefficiencies and vendor lock-in issues of conventional solutions. Our approach not only enhances scalability and system resilience but also improves cost-efficiency, as demonstrated by a significant increase in system availability. Metrics are collected and analyzed to predict scaling needs, ensuring optimal performance and resource utilization.


https://www.usenix.org/conference/srecon24emea/presentation/heidari
Speakers
avatar for Moeid Heidari

Moeid Heidari

With over 16 years of experience in the IT industry, I offer a broad and deep skill set in technology. I hold a Master’s degree in Computer Science and am currently pursuing a PhD focused on cloud computing, scalability, and high availability methods.In my current role as a Cloud... Read More →
Tuesday October 29, 2024 16:50 - 17:30 GMT
Liffey Hall 2

17:10 GMT

Mnemonic Rules for Eponymous Laws or: There’s a Law for That!
Tuesday October 29, 2024 17:10 - 17:30 GMT
Peter Burkholder, U.S. Government


As SREs, referencing named laws like Brook’s Law, Galls Law, or Jevons Paradox can help strengthen our arguments. But remembering which law applies when is challenging.

In this talk, I'll highlight the most useful tech and behavioral science laws for SRE work, offer mnemonic tips for recalling them, and share real-world examples. We'll finish with a quick quiz to ensure you're ready to apply these concepts in your role.


https://www.usenix.org/conference/srecon24emea/presentation/burkholder
Speakers
avatar for Peter Burkholder

Peter Burkholder

U.S. Government
Geophysicist turned SRE. Jobs include: US Gov, (18f/cloud.gov), GovReady, Chef, AARP, NCBI, NCAR, Univ. of Washington. In my own time, I make pizza, sing, and play guitar (not simultaneously).
Tuesday October 29, 2024 17:10 - 17:30 GMT
The Liffey A

17:30 GMT

Conference Reception at the Sponsor Showcase
Tuesday October 29, 2024 17:30 - 19:30 GMT
Enjoy dinner and beverages while networking with other attendees and visiting the exhibits as we close out the first day of sessions!
Tuesday October 29, 2024 17:30 - 19:30 GMT
The Forum
 
Wednesday, October 30
 

08:00 GMT

Morning Coffee and Tea
Wednesday October 30, 2024 08:00 - 09:00 GMT
Wednesday October 30, 2024 08:00 - 09:00 GMT
The Forum

08:00 GMT

Badge Pickup
Wednesday October 30, 2024 08:00 - 17:00 GMT
Wednesday October 30, 2024 08:00 - 17:00 GMT
Ground Floor Foyer

09:00 GMT

Lessons from Unix History
Wednesday October 30, 2024 09:00 - 09:45 GMT
Diomidis Spinellis, AUEB & TU Delft


Explore the timeless lessons of Unix’s evolution in a talk that examines its significant influence on modern computing. For over fifty years, Unix has been a cornerstone in shaping software technologies and development practices. This session will guide you through a historical narrative, illustrating key innovations from Unix's First Research Edition to modern FreeBSD releases, such as prototyping, portability, modular coding, and the importance of developer efficiency over machine time.

Discover the architectural philosophies embedded in Unix, such aggressive partitioning, composition, layering, and convention-based extensibility, as well as the strategic use of pipelines and filters for program composition. Based on extensive research and case studies, this talk is not just a technical retrospective but also a reminder of the enduring principles that continue to inform effective system and software development today. Perfect for developers, architects, and tech enthusiasts eager to enhance their programming ethos with proven, age-old wisdom.


https://www.usenix.org/conference/srecon24emea/presentation/spinellis
Speakers
avatar for Diomidis Spinellis

Diomidis Spinellis

AUEB & TU Delft
Diomidis Spinellis is a Professor of Software Engineering at AUEB and a Professor of Software Analytics in the Department of Software Technology at TUDelft. In previous lives he has served the Greek Government as Secretary General for Information Systems and has worked (briefly) as... Read More →
Wednesday October 30, 2024 09:00 - 09:45 GMT
The Liffey

09:45 GMT

Treat Your Code as a Crime Scene
Wednesday October 30, 2024 09:45 - 10:30 GMT
Adam Tornhill, CodeScene


We'll never be able to understand a software system from a single snapshot of the code. Instead we need to understand how the code evolved and how the people who work on it are organized. We also need strategies for finding bottlenecks and technical debt impairing our productivity, as well as uncovering hidden dependencies between code and people. Where do you find such strategies if not within the field of criminal psychology?


This session starts with a crash course in offender profiling before we quickly move on to adopt those principles to software development. You'll learn how easily obtained version-control data lets you uncover the behavior and patterns of the development organization. This language-neutral approach lets you prioritize the parts of your system that benefit the most from improvements so that you can balance short- and long-term goals guided by data. The presentation will change how you view code. Promise.


https://www.usenix.org/conference/srecon24emea/presentation/tornhill
Speakers
avatar for Adam Tornhill

Adam Tornhill

CodeScene
Adam Tornhill is a programmer who combines degrees in engineering and psychology. He’s the founder of CodeScene, where he designs tools for software analysis. He’s also the author of the best-selling Your Code as a Crime Scene, and three more technical books. Adam’s other interests... Read More →
Wednesday October 30, 2024 09:45 - 10:30 GMT
The Liffey

10:30 GMT

Coffee and Tea Break
Wednesday October 30, 2024 10:30 - 11:00 GMT
Wednesday October 30, 2024 10:30 - 11:00 GMT
The Forum

11:00 GMT

Finding the Capacity to Grieve Once More
Wednesday October 30, 2024 11:00 - 11:40 GMT
Alexandros Kosiaris, Wikimedia Foundation
At Wikipedia, we handle unpredictable traffic spikes, especially during notable deaths, which can cause severe outages. Despite believing we had mitigated this issue years ago, a major outage occurred in 2020 due to a notable death and a DDoS attack, leading to the realization that our platform needed further improvements. Over the years, we conducted investigations and implemented numerous fixes, educating new SREs about our platform's unique constraints. Two years ago, following the death of Elizabeth II, our system successfully handled unprecedented traffic without outages, demonstrating our platform's resilience. This story highlights the infrastructure improvements that allowed us to manage traffic surges and the emotional journey of regaining the capacity to properly grieve significant losses.
We heavily rely on open source, and our code is public, making our solutions accessible to everyone.
https://www.usenix.org/conference/srecon24emea/presentation/kosiaris
Speakers
avatar for Alexandros Kosiaris

Alexandros Kosiaris

Wikimedia Foundation
A Linux sysadmin, turned FreeBSD sysadmin, turned Linux sysadmin, turned systems engineer (somewhere along that path there’s a Devops hat as well), turned SRE, Alexandros has been in the space since 1999, starting as a hobbyist, then a professional. Currently working with the Wikimedia... Read More →
Wednesday October 30, 2024 11:00 - 11:40 GMT
The Liffey A

11:00 GMT

Anomaly Detection in Time Series from Scratch Using Statistical Analysis
Wednesday October 30, 2024 11:00 - 11:40 GMT
Ivan Shubin


Implementing anomaly detection for time series can be challenging, with many techniques and tools available. But can you achieve effective results without AI or Machine Learning? In this talk, we will demonstrate how basic statistical methods can effectively detect anomalies in time series data. We'll show you how to use Grafana to visualize these anomalies on graphs and ensure past incidents do not impact future predictions. Additionally, we will explore building Grafana dashboards as code as part of the anomaly detection solution and adjusting the detection for various events.


https://www.usenix.org/conference/srecon24emea/presentation/shubin
Speakers
avatar for Ivan Shubin

Ivan Shubin

Hi, my name is Ivan. I am a Senior Site Reliability Engineer at Booking.com. Before that I worked at TomTom and eBay. Throughout my career, I have explored various roles including Quality Assurance, Software Engineering, System Administration, and SRE. I have always been fascinated... Read More →
Wednesday October 30, 2024 11:00 - 11:40 GMT
The Liffey B

11:00 GMT

From PIDs to Pods: The Life Cycle of an eBPF-Autoinstrumented Application
Wednesday October 30, 2024 11:00 - 11:40 GMT
Marc Tudurí, Grafana Labs


eBPF allows to attach programs in the Linux Kernel and inspect the runtime memory of the Kernel and user programs at runtime. Join us in this session to discover how Grafana Beyla works, our eBPF-based instrumentation tool, and how is a Kubernetes a first-class citizen. We will describe how we match the low-level abstractions from eBPF with the Kubernetes metadata, allowing Kubernetes users to have out-of-the box observability for their running applications.


https://www.usenix.org/conference/srecon24emea/presentation/tudur%C3%AD
Speakers
avatar for Marc Tudurí

Marc Tudurí

Grafana Labs
Marc Tudurí is a Prometheus contributor, OpenTelemetry member and Software Engineer at Grafana.
Wednesday October 30, 2024 11:00 - 11:40 GMT
Liffey Hall 2

11:00 GMT

Discussion: SRE in Small Orgs
Wednesday October 30, 2024 11:00 - 12:30 GMT
Emil Stolarsky, Increase, and Joan O’Callaghan, Udemy


This session is an opportunity for people to come together and discuss running SRE teams in small organisations, facilitated by our knowledgeable guides. This is not a prepared talk or workshop—expect a less-formal session,with plenty of opportunity to ask questions and to talk to other attendees who are part of SRE teams in small organisation.


https://www.usenix.org/conference/srecon24emea/presentation/discussion-sre-in-small-orgs
Speakers
avatar for Emil Stolarsky

Emil Stolarsky

Increase
Emil is an engineer at Increase where he works on building modern banking infrastructure. Before that, he was at companies such as Wave Mobile Money, DigitalOcean, and Shopify, working on everything from building data centres in Sub-Saharan Africa to caching & performance optimizations... Read More →
Wednesday October 30, 2024 11:00 - 12:30 GMT
Liffey Hall 1

11:50 GMT

Incident Groundhog Day
Wednesday October 30, 2024 11:50 - 12:30 GMT
Hamed Silatani, Uptime Labs


Learning how to respond effectively to incidents is hard. One of the reasons is that we never see the same incident twice. While we can learn vital lessons during and after an incident, we can’t hop into a time machine, and apply these lessons to the same incident to discover their impact. What if we could experience the same incident over and over again? What might we learn? This talk describes a ‘staged world’ experiment in which 20 incident managers separately experienced the same simulated incident affecting a fictitious e-commerce company. We discuss what we noticed that differentiated some incident responders from others, and some surprising things that we expected to see, but didn’t.


https://www.usenix.org/conference/srecon24emea/presentation/silatani
Speakers
avatar for Hamed Silatani

Hamed Silatani

Uptime Labs
Hamed is co-founder and CEO of Uptime Labs, an incident learning & simulation platform. He has 20 years of experience in engineering leadership, reliability engineering, and IT operations. Having spent the majority of his career at the sharp end of incident response in financial services... Read More →
Wednesday October 30, 2024 11:50 - 12:30 GMT
The Liffey A

11:50 GMT

Generative AI: Beyond (Just) Hype
Wednesday October 30, 2024 11:50 - 12:30 GMT
Todd Underwood


Generative AI is one of the most hyped technologies in most of our careers. While it is driving a complete transformation of priorities some tech organizations many engineers remain deeply skeptical about any practical uses of Generative AI.

The skepticism is warranted and the hype is (for now) exaggerated, but not completely without merit. These technologies are not entirely useless for the kind of work we do. In this talk I will highlight a few emerging use cases that sidestep some of the weaknesses of GenAI (hallucination, errors), and still manage to provide value, specifically for production engineering.


https://www.usenix.org/conference/srecon24emea/presentation/underwood
Speakers
avatar for Todd Underwood

Todd Underwood

Todd Underwood recently lead reliability for the Research Platform at Open AI. Previously he was a Senior Engineering Director at Google leading ML capacity engineering in the office of the CFO at Alphabet. Before that, he founded and led ML Site Reliability Engineering and was the... Read More →
Wednesday October 30, 2024 11:50 - 12:30 GMT
The Liffey B

11:50 GMT

Scheduling at Scale: eBPF Schedulers with Sched_ext
Wednesday October 30, 2024 11:50 - 12:30 GMT
Daniel Hodges, Meta


This talk will discuss how eBPF-based schedulers can be used to enhance application performance at scale. The presentation will begin by explaining the fundamental eBPF capabilities necessary for constructing schedulers, providing a foundation for understanding their design. Following this introduction a discussion of schedulers and their design will be presented. Finally, some practical lessons for deploying schedulers in production environments will be given.


https://www.usenix.org/conference/srecon24emea/presentation/hodges
Speakers
avatar for Daniel Hodges

Daniel Hodges

Meta
Daniel Hodges is a software engineer that works at Meta on profiling and scheduling. He has worked as a site reliability engineer, production engineer and has experience with observability, profiling and production deployments.
Wednesday October 30, 2024 11:50 - 12:30 GMT
Liffey Hall 2

12:30 GMT

Luncheon
Wednesday October 30, 2024 12:30 - 14:00 GMT
Wednesday October 30, 2024 12:30 - 14:00 GMT
The Forum

14:00 GMT

When Your SaaS Provider Goes out of Business – Lessons from an Averted Crisis
Wednesday October 30, 2024 14:00 - 14:40 GMT
Raphael Seebacher and Christof Gerber, Open Systems AG


What do you do when your SaaS provider unexpectedly goes out of business?

It's the early days of 2023 when the provider of a critical component in our Web Proxy service announces that it just went out of business. With services used by 100 customers with 300K daily users across 3'500 locations worldwide at stake, we knew that it was time for swift action.

Join us in this talk as Raphi, the crisis lead, and Christof, engineer on the Web Security team, recount their experience handling this crisis. We will take you from the dizziness of the initial shock to our first hour, first day, first week, and first month actions, detailing leadership, communication, and technical responses and the trade-off decisions we faced.

You'll leave this talk with another tale from production and practical ideas to make your own organisation better prepared for a similar unexpected crisis.


https://www.usenix.org/conference/srecon24emea/presentation/seebacher
Speakers
avatar for Christof Gerber

Christof Gerber

Open Systems AG
Christof is an engineer who develops, maintains, and operates Software as a Service to secure corporate web traffic worldwide. Working at the intersection of computer networks, IT security, and software engineering, he is passionate about building reliable systems for Linux servers... Read More →
avatar for Raphael Seebacher

Raphael Seebacher

Open Systems AG
Raphael is a systems engineer who spent the last decade exploring the Engineer/Manager pendulum at Open Systems. He holds a MSc in electrical engineering and information technology, a MAS in Management, Technology and Economics and is a captain in the Swiss Armed Forces. His interests... Read More →
Wednesday October 30, 2024 14:00 - 14:40 GMT
The Liffey A

14:00 GMT

Noisy Neighbors, through Networking
Wednesday October 30, 2024 14:00 - 14:40 GMT
René Treffer and Ben Kochie, Reddit


When operating multi-tenant environments, like in Kubernetes, you can have "noisy neighbors". Resources like CPU and network can have contention which can lead to service degradation. But the causes of contention are not always what you would think. In this talk we will look at some surprising instances of "noisy neighbors", how they unfolded, how we discovered them, and how we mitigated the effects.


https://www.usenix.org/conference/srecon24emea/presentation/treffer
Speakers
avatar for René Treffer

René Treffer

Reddit
René Treffer is an infrastructure software engineer at Reddit.
avatar for Ben Kochie

Ben Kochie

Reddit
Ben Kochie is a principal software engineer at Reddit.
Wednesday October 30, 2024 14:00 - 14:40 GMT
The Liffey B

14:00 GMT

NVMe/TCP Makes iSCSI Look like Fortran
Wednesday October 30, 2024 14:00 - 14:40 GMT
Chris Engelbert, simplyblock GmbH


For more than two decades, iSCSI was the go-to protocol standard for remote block storage over commodity network hardware, utilizing normal Ethernet networks, hence mitigating specialist hardware, saving cost, and providing a much lower entry barrier than Fibre Channel or Infiniband.

However, the underlying storage technologies made leaps during that time, and today iSCSI is often a bottleneck for high-performance storage deployments, backed by SSDs or NVMe. Therefore, the NVMe Express group defined the NVMe over Fabrics protocol family, with NVMe over TCP being at the forefront to replace iSCSI, while offering lower latency, higher throughput, and less protocol overhead.

Let’s dive into NVMe, NVMe over TCP, and how it’s superior to iSCSI, as well as the support landscape.


https://www.usenix.org/conference/srecon24emea/presentation/engelbert
Speakers
avatar for Chris Engelbert

Chris Engelbert

simplyblock GmbH
Christoph Engelbert is a developer by heart, with strong bonds to the open source world. As a seasoned speaker on international conferences, he loves to share his experience and ideas, especially in the areas of scalable system architectures and back-end technologies, as well as all... Read More →
Wednesday October 30, 2024 14:00 - 14:40 GMT
Liffey Hall 2

14:00 GMT

Discussion: Monitoring and Alerting
Wednesday October 30, 2024 14:00 - 15:30 GMT
Daria Barteneva, Microsoft Azure, and Niall Murphy, Stanza

This session is an opportunity for people to come together and discuss monitoring and alerting, facilitated by our knowledgeable guides. This is not a prepared talk or workshop—expect a less-formal session with plenty of opportunity to ask questions and to talk to other attendees who are interested in monitoring and alerting.


https://www.usenix.org/conference/srecon24emea/presentation/discussion-monitoring-alerting
Speakers
avatar for Daria Barteneva

Daria Barteneva

Microsoft Azure
Daria is a Principal Site Reliability Engineer in Observability Engineering in Azure. With a background in Applied Mathematics, Artificial Intelligence, and Music, Daria is passionate about machine learning, diversity in tech, and opera. In her current role, Daria is focused on changing... Read More →
avatar for Niall Murphy

Niall Murphy

Stanza
Niall is the CEO of Stanza Systems, has occupied various engineering and leadership roles in Microsoft, Google, and Amazon, and is the instigator of the best-selling & prize-winning Site Reliability Engineering, which he hopes at some stage to live down. His most recent book is Reliable... Read More →
Wednesday October 30, 2024 14:00 - 15:30 GMT
Liffey Hall 1

14:45 GMT

Configuration Languages Are the Bane of Our Existence
Wednesday October 30, 2024 14:45 - 15:05 GMT
Paul Komkoff


It is probably a good idea to make it possible to change some constants in your program without recompiling it. So why it then gets incredibly hard to control these configurations? At which point configuration becomes a program with no tests, written in untyped language, which requires a lot of compute to evaluate and can't be checked in advance? Is it at all possible (and enough) to get rid of these languages and go back to ini files?

If you are like me, you want to know the answers to these questions, and this is what I'm going to talk about. Plus:


  • sendmail.cf was an early sign everybody ignored

  • if you use regular expressions for matching and selecting in your configuration, start writing a premortem

  • when your configuration is more complicated than your program, who is your program now?



https://www.usenix.org/conference/srecon24emea/presentation/komkoff
Speakers
avatar for Paul Komkoff

Paul Komkoff

Out of 33 years of working with computers and networks, Paul spent 17 in SRE organization. He believes that complexity needs to be actively managed and to know the better ways to fix things we need to explore the depths of failure.
Wednesday October 30, 2024 14:45 - 15:05 GMT
The Liffey A

14:45 GMT

Taming Noisy Benchmark Results Using Change Point Detection
Wednesday October 30, 2024 14:45 - 15:05 GMT
Matt Fleming, Cloudflare


Modern systems are inherently nondeterministic and that leads to noisy benchmark results. Change Point Detection has emerged as a helpful technique for detecting significant changes in performance results even when those results are noisy and unstable. This talk will explain how Change Point Detection works and the open source projects available for developers to use CPD with noisy benchmark results.


https://www.usenix.org/conference/srecon24emea/presentation/fleming
Speakers
avatar for Matt Fleming

Matt Fleming

Cloudflare
Matt is co-founder of Nyrkiö and a Systems Engineer at Cloudflare. He has spent over 15 years working on low-level, high-performance systems and was previously the maintainer for the Linux kernel EFI subsystem. He has co-authored papers on performance change detection and distributed... Read More →
Wednesday October 30, 2024 14:45 - 15:05 GMT
The Liffey B

14:45 GMT

The Silent Performance Killers: BIOS and Firmware Updates
Wednesday October 30, 2024 14:45 - 15:05 GMT
Darin E. Langone


In the ever-changing landscape of CVEs, bug fixes, enhancements, etc., vendors are taking a more rigid stance when it comes to applying patches and security fixes that they have provided. If you are not careful and do as they say without implementing any pre- and post-patch testing and analysis, you open your hardware and systems up to potentially significant performance impact.


https://www.usenix.org/conference/srecon24emea/presentation/langone
Speakers
avatar for Darin E. Langone

Darin E. Langone

Darin Langone is a software engineer at Bloomberg. As a member of the Compute Platform engineering team, his focus is on performance testing and benchmarking servers before and after BIOS and firmware updates have been applied. Since joining Bloomberg 25 years ago, he has worked on... Read More →
Wednesday October 30, 2024 14:45 - 15:05 GMT
Liffey Hall 2

15:10 GMT

Just Buy the Printer: Resilience in Action
Wednesday October 30, 2024 15:10 - 15:30 GMT
Cail Young, Octopus Deploy


A retelling of a recent near-miss at Octopus Deploy involving code signing certificates, multiple teams responding on an incident, and everybody's favourite piece of security hardware - the humble printer. After the story, we'll reflect on what the story says about the resilience factors already in the organisation, and what the telling of the story itself might be able to do for resilience across organisations.


https://www.usenix.org/conference/srecon24emea/presentation/young
Speakers
avatar for Cail Young

Cail Young

Octopus Deploy
Cail has spent the last couple of decades working at the intersection of people and technology: in the performing arts, in the motion picture industry, and now in the field of software operations. He is fascinated by learning from incidents - large and small - and will gladly trade... Read More →
Wednesday October 30, 2024 15:10 - 15:30 GMT
The Liffey A

15:10 GMT

Enabling Product Scalability through Load Testing
Wednesday October 30, 2024 15:10 - 15:30 GMT
Monica Baluna and Ehab Tawfik, Bloomberg


One of Bloomberg's flagship products, Instant Bloomberg (IB), is used by financial professionals around the globe for instant messaging. This system is powered by a multitude of microservices, databases and UIs that interact through synchronous or asynchronous API calls and queueing mechanisms.

We recently released Forums in IB. This new form of group chat introduced exciting features. With our clients needing increasingly larger group chats, we took the opportunity to ask how to make sure the new system and the existing one can scale up with the extra load without affecting the existing user workflows.

This talk explores the different load testing strategies we adopted while enabling support for chats ten times larger than before, while also migrating existing group chats to become Forums. We will focus on two elements: (i) creating a realistic representation of production traffic in a test environment, and (ii) how to efficiently gather insightful metrics.


https://www.usenix.org/conference/srecon24emea/presentation/baluna
Speakers
avatar for Monica Baluna

Monica Baluna

Bloomberg
Monica Baluna is a software engineer at Bloomberg in London, where she has worked for the past six years. Her main interests include distributed systems, as well as building reliable software and robust APIs. She has had an opportunity to explore these interests, as her team manages... Read More →
avatar for Ehab Tawfik

Ehab Tawfik

Bloomberg
Ehab Tawfik is a software engineer who loves problem solving, technology, and business. He works in Core Products Engineering at Bloomberg in London. He is passionate about back-end systems and distributed computing. Ehab earned a bachelor's degree in computer science and engineering... Read More →
Wednesday October 30, 2024 15:10 - 15:30 GMT
The Liffey B

15:10 GMT

How a Single API Endpoint Saved Us 3000 CPU
Wednesday October 30, 2024 15:10 - 15:30 GMT
Lasse Hels, Maersk


How do you run a time series database exclusively on spot nodes? With great difficulty!

Grafana Mimir is the centrepiece of our observability platform at Maersk. For a long time, rollouts of Mimir's most crucial component would consistently trigger significant performance degradations in the platform. Getting to the root cause of the issue proved laborious and took us deep into the internals of Mimir.

Join us as we go through the issue postmortem and reflect on how to create consistency in a chaotic environment. The talk touches on topics such as CPU throttling, hash rings, compute utilisation analysis and metric series cardinality.


https://www.usenix.org/conference/srecon24emea/presentation/hels
Speakers
avatar for Lasse Hels

Lasse Hels

Maersk
Lasse is a software engineer at Maersk. As a member of the telemetry team, he took part in building the Maersk Observability Platform, and now spends much of his time keeping it running. Outside of work, his interests include speedrunning, powerlifting, etymology, and camels.
Wednesday October 30, 2024 15:10 - 15:30 GMT
Liffey Hall 2

15:30 GMT

Coffee and Tea Break
Wednesday October 30, 2024 15:30 - 16:00 GMT
Wednesday October 30, 2024 15:30 - 16:00 GMT
The Forum

16:00 GMT

Managing the Risk of Software Supply Chain Attacks
Wednesday October 30, 2024 16:00 - 16:40 GMT
Mark Hahn, Qualys


Open-Source Software (OSS) are flourishing and are getting used by at least 90% of companies. Modern applications are built on webs of open-source code, APIs, and third-party integrations.

Because of this hackers are now compromising weak links in existing software supply chains. Software supply chain (SSC) threats include tampering with updates (tainted updates), compromised third-party libraries, vulnerabilities in open-source packages, malicious code or malware in packages etc. Software Supply Chain attacks have an average increase of 742% per year.

This talk covers ways to prevent software supply chain attacks and how to respond when the ecosystem has been tainted.


https://www.usenix.org/conference/srecon24emea/presentation/hahn
Speakers
avatar for Mark Hahn

Mark Hahn

Qualys
Mark Hahn is the Solutions Architect for Cloud and DevOps Security at Qualys. He uses DevSecOps and Site Reliability Engineering practices to ensure that software and applications are deployed with high velocity and with the utmost security. He shows clients how to build security... Read More →
Wednesday October 30, 2024 16:00 - 16:40 GMT
The Liffey A

16:00 GMT

How to Host a (Very) Popular Website for 30 Altairian Dollars a Day
Wednesday October 30, 2024 16:00 - 16:40 GMT
James Beal


For 15 years, the Archive of Our Own (AO3) has provided a safe haven for fanworks while refusing to implement paid accounts, sell user data, or restrict fans' creativity. We're completely donor-funded and volunteer-run and currently serve about 34 billion pages a year—using servers that we own in order to reduce the likelihood of deplatforming due to our commitment to creative freedom.


We know a thing or two about getting the most out of an Altairian dollar without compromising user privacy or free expression. Even if your project has different constraints, our approach might just help you stretch your project's budget.


https://www.usenix.org/conference/srecon24emea/presentation/beal
Speakers
avatar for James Beal

James Beal

James started playing with computers with the ZX81, learned C for his A Levels, and has degrees in computer science and parallel and distributed systems. He has been using Linux originally with MCC Interim Linux and later with other distributions. He started volunteering at the OTW... Read More →
Wednesday October 30, 2024 16:00 - 16:40 GMT
The Liffey B

16:00 GMT

What If We Ask Linux to Do Cryptography for Us?
Wednesday October 30, 2024 16:00 - 16:40 GMT
Oxana Kharitonova, Cloudflare


It's difficult to imagine the modern world without cryptography. We use cryptography to encrypt data before transmitting it over the Internet or storing it on a disk. But we don't think much about how it works, we just pick the most popular cryptographic user space library for our next application and let it do the work for us. What if it's not as secure as we hope? There is another way to do it with the Linux Kernel itself. It can encrypt & decrypt data in the same way as user space libraries do it but in a much more secure way. Through the talk we will explore how to integrate this feature in user space applications written in Golang and Rust languages. You don’t need to be a Linux kernel ninja to start using it.


https://www.usenix.org/conference/srecon24emea/presentation/kharitonova
Speakers
avatar for Oxana Kharitonova

Oxana Kharitonova

Cloudflare
Oxana Kharitonova is a systems engineer at Cloudflare. Having worked mostly with high-level languages ​​in the past, her passion for low-level programming led to a career change to work primarily on Linux.
Wednesday October 30, 2024 16:00 - 16:40 GMT
Liffey Hall 2

16:00 GMT

Discussion: Scaling Databases
Wednesday October 30, 2024 16:00 - 17:30 GMT
Chris Sinjakli, PlanetScale, and Martin Alderete, Booking.com


This session is an opportunity for people to come together and discuss the challenges inherent in scaling databases, facilitated by our knowledgeable host. This is not a prepared talk or workshop—expect a less-formal session with plenty of opportunity to ask questions and to talk to other attendees who are interested in scaling databases.


https://www.usenix.org/conference/srecon24emea/presentation/discussion-scaling-databases
Speakers
avatar for Chris Sinjakli

Chris Sinjakli

PlanetScale
Chris enjoys working on the strange parts of computing where software and systems meet. He especially likes the challenges of databases and distributed systems.All his programs are made from organic, hand-picked, artisanal keypresses.
avatar for Martin Alderete

Martin Alderete

Booking.com
Martin Alderete is a Principal Site Reliability Engineer with a long track record in Engineering, Distributed Systems and System Level Programming in both the academia where after getting his degree he worked as teacher assistant. And the industry where he led different teams building... Read More →
Wednesday October 30, 2024 16:00 - 17:30 GMT
Liffey Hall 1

16:50 GMT

When SRE and Security Teams Meet to Face a Crisis
Wednesday October 30, 2024 16:50 - 17:30 GMT
JR Aquino


For SREs, Security is at the same time a priority and not a priority; prioritization highly depends on the environment, the size, and the goals of each org.

This talk aims to give SREs - through real life examples - insights of:

What to expect (and how to be good neighbors) when they are called in to work with Security teams to manage a security incident
Inter-organizational unexpected challenges that might occur
What to keep an eye on in the future


https://www.usenix.org/conference/srecon24emea/presentation/aquino
Speakers
avatar for JR Aquino

JR Aquino

Redfin | Rent - Head of Information SecurityFormer Microsoft and Citrix Security LeaderCreated centralized SUDO for Fedora’s FreeIPAFreeBSD port maintainer for Metasploit and UnrealIRCDOpenBSD port maintainer for Nmap
Wednesday October 30, 2024 16:50 - 17:30 GMT
The Liffey A

16:50 GMT

How Snowflake Migrated All Alerts and Dashboards to a Prometheus-Based Metrics System in 3 Months
Wednesday October 30, 2024 16:50 - 17:30 GMT
Carlos Mendizabal, Snowflake


This talk goes over how Snowflake migrated its alerts and dashboards in 3 months, a migration that included rewriting all alerts and dashboards used for system monitoring. We'll go over the tooling that enabled us to complete this migration successfully, which included configuration-as-code through Jsonnet and an unit testing framework, and share some important take-aways from this effort.


https://www.usenix.org/conference/srecon24emea/presentation/mendizabal
Speakers
avatar for Carlos Mendizabal

Carlos Mendizabal

Snowflake
Carlos Mendizabal is a software engineer at Snowflake. He is part of the Observability team and loves to build things (and to ensure they're well monitored!). Previously at Meta, he's also passionate about meeting folks across the industry and keeping up with the latest and greatest... Read More →
Wednesday October 30, 2024 16:50 - 17:30 GMT
The Liffey B

16:50 GMT

Synthetic Monitoring and E2E Testing: 2 Sides of the Same Coin
Wednesday October 30, 2024 16:50 - 17:30 GMT
Carly Richmond, Elastic


Despite the emergency of DevOps to unite development, support and SRE factions together using common processes, we still face cultural and tooling challenges that create the Dev and SRE silos. Specifically, we often use different tools to achieve similar testing: case in point validating the user experience in production using Synthetic Monitoring and in development using E2E testing.

By joining forces around common tooling, we can use the same tool for both production monitoring and testing within CI. In this talk, I will discuss how Synthetic Monitoring and E2E Testing are two sides of the same coin. Furthermore, I shall show how production monitoring and development testing can be achieved using Playwright, GitHub Actions and Elastic Synthetics.


https://www.usenix.org/conference/srecon24emea/presentation/richmond
Speakers
avatar for Carly Richmond

Carly Richmond

Elastic
Carly is a Principal Developer Advocate and Manager at Elastic, based in London, UK. Before joining Elastic in 2022, she spent over 10 years working as a technologist at a large investment bank, specialising in front-end web development and agility. She is a UI developer, who occasionally... Read More →
Wednesday October 30, 2024 16:50 - 17:30 GMT
Liffey Hall 2

17:45 GMT

Lightning Talks
Wednesday October 30, 2024 17:45 - 18:45 GMT
Lightning Talks are four-minute talks by different speakers addressing a variety of SRE-relevant topics.

The lightning talks session will conclude with Slide Karaoke; a chance for any attendee to show off their improv skills by presenting a slide deck that they have never seen before.
Wednesday October 30, 2024 17:45 - 18:45 GMT
The Liffey A
 
Thursday, October 31
 

08:00 GMT

Morning Coffee and Tea
Thursday October 31, 2024 08:00 - 09:00 GMT
Thursday October 31, 2024 08:00 - 09:00 GMT
The Forum

08:00 GMT

Badge Pickup
Thursday October 31, 2024 08:00 - 12:00 GMT
Thursday October 31, 2024 08:00 - 12:00 GMT
Ground Floor Foyer

09:00 GMT

Monitoring Systems as a Service – Walking the Line between Giving Your Devs Good M&O and Setting All Your Money on Fire
Thursday October 31, 2024 09:00 - 09:40 GMT
Joan O'Callaghan, Udemy


Monitoring-as-a-Service products, like Datadog and Honeycomb are amazing products for implementing monitoring & observability with minimal effort, but like Anything-as-a-Service, it comes at a cost.

We are a very normal company, with all the tech debt and orphaned code that any company over a certain age has. Like everyone else, we had staff that heard, "measure everything!" but they didn't know what the monitoring bill looked like and that "everything" included a lot of junk.

In the talk I'll discuss how we managed to reduce cost wastage, enable extra vendor features, improve M&O knowledge within the engineering organisation and keep the bill the same or lower, despite a 60% growth in infrastructure at our company.

Notes re the vendor - I won't say who the Vendor is, but I think our experience was universal enough that our fixes and techniques will be helpful to other companies.


https://www.usenix.org/conference/srecon24emea/presentation/ocallaghan
Speakers
avatar for Joan O'Callaghan

Joan O'Callaghan

Udemy
Joan O'Callaghan is a Monitoring and Observability Director at Udemy. She has worked in SRE and Incident Management and M&O (in one form or another), for many, many years. She likes to host and write blameless incident reviews and take long walks on the beach where she has imaginary... Read More →
Thursday October 31, 2024 09:00 - 09:40 GMT
The Liffey A

09:00 GMT

Opening the Box: Diagnosing Operating-System Task-Scheduler Behavior on Highly Multicore Machines
Thursday October 31, 2024 09:00 - 09:40 GMT
Julia Lawall, Inria-Paris


Getting unexpectedly poor performance from your multicore application? Maybe the operating system task scheduler is at fault. The task scheduler is responsible for placing tasks on cores and for selecting which task is allowed to run, at what time, and for how long. As such, the scheduler is a critical component of any operating system and has a major impact on application performance. Still, scheduling decisions are buried deep within the operating system code, making it challenging to diagnose performance problems (or even performance improvements) to determine whether the scheduler is responsible and, if so, in what way. These challenges are compounded for highly multithreaded applications, running on large multicore machines, due to the huge amount of information available.

In this talk, we present some tools that we have developed for visualizing the behavior of the Linux kernel task scheduler, and illustrate how these tools can be used to help diagnose performance problems. The tools presented are freely available at https://gitlab.inria.fr/schedgraph/schedgraph


https://www.usenix.org/conference/srecon24emea/presentation/lawall
Speakers
avatar for Julia Lawall

Julia Lawall

Inria-Paris
Julia Lawall is a senior researcher at Inria Paris. Prior to joining Inria, she completed a PhD at Indiana University and was on the faculty at the University of Copenhagen. Her work focuses on issues around the correctness and performance of operating systems. She develops and maintains... Read More →
Thursday October 31, 2024 09:00 - 09:40 GMT
The Liffey B

09:00 GMT

Discussion: Wrangling your Management Chain
Thursday October 31, 2024 09:00 - 10:30 GMT
Dave O’Connor and Todd Underwood


This session is an opportunity for people to come together and discuss managing your management chain, facilitated by our knowledgeable guides. This is not a prepared talk or workshop—expect a less-formal session with plenty of opportunity to ask questions and to talk to other attendees who are interested in wrangling your management.


https://www.usenix.org/conference/srecon24emea/presentation/discussion-wrangling-management
Speakers
avatar for Dave O'Connor

Dave O'Connor

Dave is an SRE Leadership practitioner, Advisor and Coach based in Dublin. He's been working on SRE and SRE-adjacent organisations for over 20 years, primarily as an SRE Lead at Google from 2004-2021. Since then, he has spent time leading SRE, Security and Infrastructure teams at... Read More →
avatar for Todd Underwood

Todd Underwood

Todd Underwood recently lead reliability for the Research Platform at Open AI. Previously he was a Senior Engineering Director at Google leading ML capacity engineering in the office of the CFO at Alphabet. Before that, he founded and led ML Site Reliability Engineering and was the... Read More →
Thursday October 31, 2024 09:00 - 10:30 GMT
Liffey Hall 1

09:00 GMT

Workshop: Guided Journey into the Heart of Systemd
Thursday October 31, 2024 09:00 - 12:30 GMT
Alvaro Leiva Geisse and Anita Zhang, Meta
IMPORTANT: If you are attending this workshop, please work through the Getting Started section in order to download the image and set up your environment.
systemd (with lowercase S and D) remains up until this day, both one of the most critical pieces of a system, and the least understood one. This workshop is designed to touch upon the beginner features of systemd and explain how you can use systemd to solve common problems, including some that you didn't even know you had. What problems do you ask? You’ll have to come and see.
https://www.usenix.org/conference/srecon24emea/presentation/geisse
Speakers
avatar for Alvaro Leiva Geisse

Alvaro Leiva Geisse

Meta
I love Python, I grew up in a small town in Chile and one weekend, over 16 years ago, I had the flu and could not go out. I decided to learn how to code in Python and that was the beginning of the road that would move us all to Northern California so that I could join the Production... Read More →
avatar for Anita Zhang

Anita Zhang

Meta
Anita Zhang is the software engineering manager of Meta's Linux Umbrella family of teams. Her teams connect Meta's low-level infrastructure with the open source community. She is known for being a part of the systemd community and continues to support systemd at Meta as part of their... Read More →
Thursday October 31, 2024 09:00 - 12:30 GMT
Liffey Hall 2

09:50 GMT

An Exploration in Storing Telemetry in Cloud Object Storage
Thursday October 31, 2024 09:50 - 10:30 GMT
Mike Heffner and Ray Jenkins, Streamfold


Modern web application architectures require extensive telemetry data to function efficiently at scale. Traditional methods for collecting, storing, and processing this data have become increasingly expensive and challenging to maintain. Conversely, the prevalence of cloud object storage has given rise to the data lake. This has led some organizations to explore telemetry data lakes, which enable cost-efficient storage of large volumes of telemetry data.

We will explore various data storage formats used in constructing telemetry data lakes and discuss the tradeoffs associated with each approach. We will delve into common formats such as JSON, Parquet, ORC, and Apache Iceberg, examining how they can be utilized to store telemetry data like logs, metrics, and traces at scale. These formats will be empirically evaluated using real-world datasets. Additionally, we will review recent literature that highlights areas for design improvements in storage formats to better align them with modern computing hardware.


https://www.usenix.org/conference/srecon24emea/presentation/heffner
Speakers
avatar for Mike Heffner

Mike Heffner

Streamfold
Mike Heffner is co-founder of Streamfold, where they are creating the first telemetry pipeline built for developers. Prior to Streamfold, Mike was a backend engineer at Netlify helping scale their delivery network, and at Librato building one of the first monitoring SaaS products... Read More →
avatar for Ray Jenkins

Ray Jenkins

Streamfold
Ray Jenkins is co-founder of Streamfold, where they are creating the first telemetry pipeline built for developers. Prior to founding Streamfold, he led software engineering efforts at Snowflake, on the observability and performance of FoundationDB and at Segment on development of... Read More →
Thursday October 31, 2024 09:50 - 10:30 GMT
The Liffey A

09:50 GMT

Granular CPU Capacity Management at Scale with eBPF
Thursday October 31, 2024 09:50 - 10:30 GMT
George Brighton and Cameron Howes, Goldman Sachs


Real-time market data is exceptionally bursty, with update rates in the busiest seconds of the day regularly exceeding 10x the average. User experience is predicated on maintaining sufficient CPU headroom to prevent full buffers and the resulting client disconnects. Sampling cumulative CPU time at a typical scrape interval hides microbursts, and sub-second polling from user space induces unacceptable overhead, so a different approach is needed.

This talk will cover how Market Data SRE at Goldman Sachs uplifted CPU monitoring of our market data distribution infrastructure in an unintrusive way, achieving 10x the granularity with 5% of the original monitoring overhead. We will cover the journey from deciding to use eBPF, through trials using bpftrace and making the leap to BPF C, to collecting and aggregating the metrics effectively. It will be most relevant to those interested in capacity management across a heterogeneous estate, and those looking to implement eBPF for the first time in their organisations.


https://www.usenix.org/conference/srecon24emea/presentation/brighton
Speakers
avatar for George Brighton

George Brighton

Goldman Sachs
George Brighton is a Vice President at Goldman Sachs, where he leads the Market Data SRE team. A Prometheus and OTel committer, he is responsible for uplifting observability and operational practices. George presented "Market Data: Applying SRE Techniques to Legacy Designs" at SREcon22... Read More →
avatar for Cameron Howes

Cameron Howes

Goldman Sachs
Cameron Howes is an Analyst in the Market Data SRE team at Goldman Sachs, specialising in low-level development and performance instrumentation. When he's not ferociously avoiding a memory allocation, or reading about the latest CVEs, Cameron can be found writing black-box probers... Read More →
Thursday October 31, 2024 09:50 - 10:30 GMT
The Liffey B

10:30 GMT

Coffee and Tea Break
Thursday October 31, 2024 10:30 - 11:00 GMT
Thursday October 31, 2024 10:30 - 11:00 GMT
The Forum

11:00 GMT

Embrace Fleet Reboots and Make Them Boring
Thursday October 31, 2024 11:00 - 11:40 GMT
Everton Didone Foscarini, Cloudflare


Server reboots bring up mixed sentiments. Some want to say “My kernel is stable, it does not crash with a thousand days uptime”, others understand that you are running a system with a thousand days of accumulated vulnerabilities.

In Cloudflare we believe that high uptimes are bad, and while the reboot automation was being developed, we were hit by a kernel+BIOS bug that caused a high rate of node crashes, and encouraged the quick adoption of reboot automation, prompting us to implement better tooling to deploy fleet changes over reboots, creating multiple reboot queues for different workloads, load-based maintenance windows and more.

We achieved monthly reboots for our edge fleet while keeping the clusters online and serving customer-facing traffic, unlocking our ability to iterate fast on Linux Kernel versions and OS releases, ensuring we are not running outdated library versions in hosts not rebooted for a thousand days.


https://www.usenix.org/conference/srecon24emea/presentation/foscarini
Speakers
avatar for Everton Didone Foscarini

Everton Didone Foscarini

Cloudflare
Working on Internet-based services using Linux since 2003, joined Cloudflare in 2017 and helped to scale Edge location operations from 102 to 320 cities, creating tooling to manage services lifecycle and server reboots.
Thursday October 31, 2024 11:00 - 11:40 GMT
The Liffey A

11:00 GMT

Riot Games: Evolution of Observability at the Gaming Company
Thursday October 31, 2024 11:00 - 11:40 GMT
Erick Moreira and Kirill Mikhailov, Riot Games


The video game industry is growing year-by-year, and it is projected that the market size for video games will double in the coming 10 years. The number of people playing video games will also grow substantially. All of these produce a lot of challenges for tech teams to make sure that the games are not only fun to play but also offer stable, accessible gameplay. This is even more important for online competitive games, as they demand increased stability and performance.

Our presentation is focused on a review of the Riot Games journey through observability and specifically on the latest iteration of global-scale changes we made to introduce SRE and the new observability pipeline in the company.


https://www.usenix.org/conference/srecon24emea/presentation/moreira
Speakers
avatar for Erick Moreira

Erick Moreira

Riot Games
I am Erick Moreira, a 32-year-old Brazilian from Rio, working and living for 5 years in Dublin. I grew up modding and creating simple things for games. Now, I am focused on the backend, cross-cutting concerns, and the developer experience. I still find space in my heart to build front-end... Read More →
avatar for Kirill Mikhailov

Kirill Mikhailov

Riot Games
I started my journey as an engineer while in school, building servers for online games. I then switched to traditional software engineering, working for large tech companies. But at the end of the day, I still landed in the gaming industry, where I have worked for the LiveOps organisation... Read More →
Thursday October 31, 2024 11:00 - 11:40 GMT
The Liffey B

11:00 GMT

Discussion: Building New SRE Teams
Thursday October 31, 2024 11:00 - 12:30 GMT
Avleen Vig and Stephane Dudzinski


This session is an opportunity for people to come together and discuss building new SRE teams, facilitated by our knowledgeable guides. This is not a prepared talk or workshop—expect a less-formal session with plenty of opportunity to ask questions and to talk to other attendees who are interested in building SRE teams from scratch.


https://www.usenix.org/conference/srecon24emea/presentation/discussion-building-new-sre-teams
Speakers
AV

Avleen Vig

Twilio
Avleen is one of Twilio’s Architects for SRE. Over his luminous 20+ year career he has shone a light on the importance of making reliability a core part of the work done by all software engineering teams. When he isn’t working on improving systems designs and reviewing code, you... Read More →
avatar for Stephane Dudzinski

Stephane Dudzinski

Reddit
Stephane Dudzinski is a seasoned veteran with over 20 years of experience in the tech industry, specializing in observability, SRE, and systems. With a decade of leadership experience, he has managed and mentored high-performing teams, improving system reliability. Stephane currently... Read More →
Thursday October 31, 2024 11:00 - 12:30 GMT
Liffey Hall 1

11:45 GMT

A Brief History of Release Engineering
Thursday October 31, 2024 11:45 - 12:05 GMT
Dinah McNutt, MongoDB


TL;DR This talk is a humorous (hopefully) retrospective on release engineering. How did we get from building binaries using a command line to all the fancy CI/CD systems we have today?

Things we used to do seem ridiculous today. Can looking back help us move forward? What’s the evolution and career path of a release engineer? Has the role become diluted through overuse and misuse?

Please join in the fun and include your anecdotes and experiences in the slack channel.


https://www.usenix.org/conference/srecon24emea/presentation/mcnutt
Speakers
avatar for Dinah McNutt

Dinah McNutt

MongoDB
Dinah McNutt is a TPM for MongoDB and based in Dublin, Ireland. She has over 35 years of experience in systems administration, release engineering and software development. She has written for various publications over the years including the Daemons and Dragons column for UNIX Review... Read More →
Thursday October 31, 2024 11:45 - 12:05 GMT
The Liffey A

11:45 GMT

A Powerful Logs Management Solution We All Have and Use but We Underestimate: systemd-journal
Thursday October 31, 2024 11:45 - 12:05 GMT
Costa Tsaousis, Netdata


This talk aims to unearth the potent features of systemd-journal that have remained mostly underutilized and largely underappreciated within the SRE community. The focus will be on its ability to handle dynamically structured log entries, its inherent support for centralized logging, and its robust security features including log sealing.


Systemd-journal offers dynamic field management, allowing flexible log annotation and querying without predefined schemas, along with decentralized log management that enables seamless analysis across systems. Its sealing feature ensures log integrity, critical for incident response and forensics. There’s a tooling gap for converting plain logs into structured entries, however, we will show examples of how this can be achieved.



https://www.usenix.org/conference/srecon24emea/presentation/tsaousis
Speakers
avatar for Costa Tsaousis

Costa Tsaousis

Netdata
Costa Tsaousis, is the Founder and CEO of Netdata. Since 1995, Costa has been actively working on internet related startups. He has been a co-founder and C-level executive of many successful projects, including Internet Service Providers, Cloud Hosting Providers and Fintech startups... Read More →
Thursday October 31, 2024 11:45 - 12:05 GMT
The Liffey B

12:10 GMT

Red Tide Revert
Thursday October 31, 2024 12:10 - 12:30 GMT
David Newman, Automattic


Explore the challenges of managing unexpected production errors in high-frequency deployment environments and introduce an innovative AI-driven solution for rapid error detection and resolution. The speaker will discuss how their team developed and refined an automated system that analyzes error logs, identifies problematic code commits, and streamlines the incident response process. This approach aims to reduce on-call stress, minimize user impact, and pave the way for fully automated error mitigation in complex, fast-paced development ecosystems.


https://www.usenix.org/conference/srecon24emea/presentation/newman
Speakers
DN

David Newman

Automattic
With a diverse background in platform engineering, distributed systems, and artificial intelligence, our speaker brings a trove of experience driving innovation from startup to enterprise environments. As a technical founder in companies ranging from retail intelligence to digital... Read More →
Thursday October 31, 2024 12:10 - 12:30 GMT
The Liffey A

12:10 GMT

Blast Radius Reduction for Large-Scale Distributed Systems
Thursday October 31, 2024 12:10 - 12:30 GMT
Linhua Tang, Huawei Ireland Research Centre


The construction of large-scale distributed systems poses significant challenges due to inherent complexities and the inevitability of failures across various levels, from hardware malfunctions to software bugs. Embracing the 'design for failure' philosophy, this paper delves into advanced isolation techniques aimed at reducing the blast radius—both spatially and temporally—thereby enhancing system resilience. Spatial containment strategies, such as cell-based architecture, compartmentalize failures to localized areas, preventing cascading effects. Temporal mitigation focuses on rapid recovery and self-healing mechanisms, which aim to restore system health promptly after a failure occurs. Furthermore, the paper explores the application of formal methods in verifying the robustness of these designs, providing a rigorous approach to ensure the reliability and effectiveness of implemented solutions. This research underscores the importance of proactive architectural planning and continuous verification in maintaining the stability of complex distributed systems.


https://www.usenix.org/conference/srecon24emea/presentation/tang
Speakers
avatar for Linhua Tang

Linhua Tang

Huawei Ireland Research Centre
Linhua Tang (also known as James) is a software engineer and tech lead for global server load balancing and formal methods at Huawei Ireland Research Center. Before that, he worked at Microsoft and Amazon in different distributed systems.
Thursday October 31, 2024 12:10 - 12:30 GMT
The Liffey B

12:30 GMT

Luncheon
Thursday October 31, 2024 12:30 - 14:00 GMT
Thursday October 31, 2024 12:30 - 14:00 GMT
The Forum

14:00 GMT

AppStack: An Open Source Cloud Native Platform for Running Digital Public Services
Thursday October 31, 2024 14:00 - 14:40 GMT
Dimitris Mitropoulos, National Infrastructures for Research and Technology – GRNET and University of Athens; Alex Kiousis, National Infrastructures for Research and Technology – GRNET


GRNET is Greece's National Infrastructures for Research and Technology (NREN) organisation, which acts as a network and services provider for research and education communities. Since 2019, GRNET is responsible for the development, operation and maintenance of several governmental services, thus playing an important role in Greece's digital transformation. To address the different challenges related to this role, GRNET teams developed AppStack, a cloud-native platform, based on production-ready open source software, for running government-related services such as the gov.gr portal, the electronic issuance of documents signed by the Greek state, and gov wallet, among others.

AppStack provides an environment for integrating open-source and in-house software components, where DevOps can incorporate suitable tools to tackle scalability and security issues.

Currently, AppStack hosts workloads that serve more than 8 million Greek citizens, are able to handle more than 20K requests per second, and can generate hundreds of digital documents signed by the Greek state per second.

In this talk we will present AppStack, its numerous components, and how open source made it possible. Finally, we will describe some key experiences from production.


https://www.usenix.org/conference/srecon24emea/presentation/mitropoulos
Speakers
avatar for Alex Kiousis

Alex Kiousis

National Infrastructures for Research and Technology – GRNET
Alex Kiousis is a Site Reliability Engineer in GRNET in Greece. His team handles GRNET's on-premise infrastructure and services, delivering GRNET's custom Cloud service to Greece's Research and Academic communities and several user-facing Government-related Digital Transformation... Read More →
avatar for Dimitris Mitropoulos

Dimitris Mitropoulos

National Infrastructures for Research and Technology – GRNET and University of Athens
Dimitris Mitropoulos is an Assistant Professor at the National and Kapodistrian University of Athens and the Head of Reliability Engineering at the Greek National Infrastructures for Research and Technology (GRNET). Previously, he has been a postdoctoral researcher at the Computer... Read More →
Thursday October 31, 2024 14:00 - 14:40 GMT
The Liffey A

14:00 GMT

Get Your Non-SREs Oncall Ready!
Thursday October 31, 2024 14:00 - 14:40 GMT
JC van Winkel and Brad Lipinski, Google


Hands on learning is best for adults, and we've used this principle in Google SRE since 2017. However, many oncall engineers aren't SREs and haven't gone through a full week-long SRE onboarding program. How can they learn the same skills and go oncall with confidence, but without the week-long curriculum?

We cherry picked our SRE onboarding program to create a succinct, scalable program for this audience that includes the best of orientation: the breakage exercises. This program is called "Oncall Ready!" and is completely self-service, requiring no operational work from the SRE EDU team. In this talk we will discuss the development, the behind the scenes, and the outcomes of this project. Best comment we got from a participant: "Oh wow, this is like going through a [production] escape room without having to pay for it".


https://www.usenix.org/conference/srecon24emea/presentation/van-winkel
Speakers
avatar for JC van Winkel

JC van Winkel

Google
JC has been teaching UNIX and programming languages since 1992, working for AT Computing, a small courseware spin-off of the University of Nijmegen, the Netherlands. JC joined Google's Site Reliability Engineering team in 2010 and is both a founding member and lead educator of the... Read More →
avatar for Brad Lipinski

Brad Lipinski

Google
Brad joined Google SRE in 2013 and worked on datacenter software. He's taught for SRE EDU from the beginning and contributed to many of the team's automation efforts. In 2019, he joined SRE EDU full time and is now the team's tech lead.
Thursday October 31, 2024 14:00 - 14:40 GMT
The Liffey B

14:00 GMT

Discussion: Learning from Incidents
Thursday October 31, 2024 14:00 - 15:30 GMT
Laura de Vesine, Datadog, Inc., and Cail Young, Octopus Deploy


This session is an opportunity for people to come together and discuss getting the most out of your incident review process, facilitated by our knowledgeable guides. This is not a prepared talk or workshop—expect a less-formal session with plenty of opportunity to ask questions and to talk to other attendees who are interested in learning from incidents.


https://www.usenix.org/conference/srecon24emea/presentation/discussion-learning-from-incidents
Speakers
avatar for Laura de Vesine

Laura de Vesine

Datadog, Inc.
Laura de Vesine is a 20+ year software industry veteran. She has spent the last 8 years in SRE working in incident analysis and prevention, chaos engineering, and the intersection of technology and organizational culture. Laura is currently a staff engineer at Datadog, Inc. She also... Read More →
avatar for Cail Young

Cail Young

Octopus Deploy
Cail has spent the last couple of decades working at the intersection of people and technology: in the performing arts, in the motion picture industry, and now in the field of software operations. He is fascinated by learning from incidents - large and small - and will gladly trade... Read More →
Thursday October 31, 2024 14:00 - 15:30 GMT
Liffey Hall 1

14:00 GMT

Discussion: System Performance and Scaling
Thursday October 31, 2024 14:00 - 15:30 GMT
Leila Vayghan, Shopify, and Abbas Soltanian, OpsGuru


Join us for an interactive Q&A session on System Performance and Scaling, where our expert panel, featuring a senior infrastructure engineer and a senior cloud solutions architect, will address your most pressing questions. This session is designed to provide practical insights and real-world solutions to help you optimize your systems for performance and scalability. Whether you're dealing with cloud architecture challenges, Kubernetes orchestration, or scaling complex infrastructures, bring your questions and engage with industry experts to enhance your understanding and capabilities.


https://www.usenix.org/conference/srecon24emea/presentation/discussion-system-performance-scaling
Speakers
avatar for Leila Vayghan

Leila Vayghan

Shopify
Leila is an engineer at Shopify, where she spends her days enabling millions of merchants to grow by making sure buyers are able to search and find their products. She does this by running a large-scale search infrastructure on Kubernetes in many regions of the world. Leila has completed... Read More →
avatar for Abbas Soltanian

Abbas Soltanian

OpsGuru
Dr. Abbas Soltanian, a Senior Cloud Solutions Architect at OpsGuru (Canada), holds a Ph.D. in Cloud Computing and has presented his work at numerous conferences. With over thirteen years of experience in both academia and industry, he helps companies migrate to the cloud and modernize... Read More →
Thursday October 31, 2024 14:00 - 15:30 GMT
Liffey Hall 1

14:50 GMT

Science Reliability Engineering for High Performance Computing
Thursday October 31, 2024 14:50 - 15:30 GMT
Nicholas Jones, LANL


High Performance Computing (HPC) as an industry has long stood on very human facing operational workflows. These workflows exist because HPC systems are generally purpose built machines for small sets of code bases with very specific performance metrics. This purpose built nature has resulted in HPC having very bespoke one-off systems, resulting in process and infrastructure that benefit a small set of code bases well, but aren't resilient to generational churn. To combat the difficulty from generational churn we've adopted an SRE mindset for our new administrative stack OpenCHAMI. This lets us keep our figures of merit (exact reproducibility, parallel bandwidth, and compute time to solution) aligned with what benefits our customer base the most.


https://www.usenix.org/conference/srecon24emea/presentation/jones
Speakers
avatar for Nicholas Jones

Nicholas Jones

LANL
Nick is a scientist at Los Alamos National Lab, where he works on system security architecture, CI/CD infrastructure, and shared computing environments and strategies across the National Nuclear Security Administration Laboratories.
Thursday October 31, 2024 14:50 - 15:30 GMT
The Liffey A

14:50 GMT

Transforming Production Readiness
Thursday October 31, 2024 14:50 - 15:30 GMT
Panagiotis Moustafellos, Elastic


Join Panagiotis Moustafellos, Distinguished Engineer at Elastic, as he shares Elastic's transformative journey of integrating development teams into on-call rotations.
This talk highlights the creation of an SLO observability product capable of monitoring hundreds of thousands of SLIs globally, amidst a significant infrastructure and software platform re-architecture.
Learn about the phased rollout of Elastic's new serverless offering and the delicate processes involved in getting all software engineers on call. Discover best practices in production readiness, incident management, self-service observability, and software release tools that empower teams to own their services. Gain valuable insights and actionable strategies to enhance production readiness and service reliability in your organization.


https://www.usenix.org/conference/srecon24emea/presentation/moustafellos
Speakers
avatar for Panagiotis Moustafellos

Panagiotis Moustafellos

Elastic
Panagiotis Moustafellos is a Distinguished Engineer at Elastic, the Search AI company. He brings over 15 years of experience in diverse tech environments and specializes in systems architecture, observability, and security, with a focus on scaling software systems, infrastructure... Read More →
Thursday October 31, 2024 14:50 - 15:30 GMT
The Liffey B

15:30 GMT

Coffee and Tea Break
Thursday October 31, 2024 15:30 - 16:00 GMT
Thursday October 31, 2024 15:30 - 16:00 GMT
The Forum

16:00 GMT

Energy Consumption of Datacenters
Thursday October 31, 2024 16:00 - 16:45 GMT
Thomas Fricke


Let us have look into the resource consumption of data centers and collect the current state of knowledge. There will be more questions than answers but predictions can be made because all resources have their limits.

The increase has already been exponential for years. With the AI hype, the demand for energy, cooling, water and other resources has increased dramatically.

The existing GPU based computing paradigm cuts hard into the standard design of data centers and demands other ways of cooling.


https://www.usenix.org/conference/srecon24emea/presentation/fricke
Speakers
avatar for Thomas Fricke

Thomas Fricke

Thomas main focus is cloud and Kubernetes security. He plans private clouds and delivers applications in highly critical infrastucture. His customers are delivering serivices for transmission grids, healthcare, traffic and the German administration.He is cofounder of two companies... Read More →
Thursday October 31, 2024 16:00 - 16:45 GMT
The Liffey

16:45 GMT

Are We Really Engineers?
Thursday October 31, 2024 16:45 - 17:30 GMT
Hillel Wayne


What makes software engineering different from “traditional” engineering? To find out, I interviewed 17 “crossovers”: people who have worked professionally as both a software and a traditional engineer. In aggregate, we learn three things: we are in fact engineers, we’re not actually that different as a field, and there’s a lot we can both teach and learn.


https://www.usenix.org/conference/srecon24emea/presentation/wayne
Speakers
avatar for Hillel Wayne

Hillel Wayne

Hillel is a formal methods consultant and the author of Logic for Programmers and Practical TLA+. His other work includes Computer Things, a weekly newsletter on the history and theory of software engineering, and Let's Prove Leftpad. In his free time, he juggles and makes chocolate... Read More →
Thursday October 31, 2024 16:45 - 17:30 GMT
The Liffey

17:30 GMT

Closing Remarks
Thursday October 31, 2024 17:30 - 17:40 GMT
Thursday October 31, 2024 17:30 - 17:40 GMT
The Liffey
 
Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.