Loading…
strong>Track 2 [clear filter]
Tuesday, October 29
 

11:00 GMT

I Can OIDC You Clearly Now: How We Made Static Credentials a Thing of the Past
Tuesday October 29, 2024 11:00 - 11:40 GMT
Iain Lane and Dimitris Sotirakis, Grafana Labs


At Grafana Labs, we tackled a thorny problem: managing secrets in an open-source CI/CD pipeline. Our journey from static secrets to OIDC-based access wasn't just about better security—it was about empowering our engineers. We'll walk you through how we leveraged OIDC and GitHub Actions to create a "secretless" system for accessing cloud resources, complete with shared jobs and abstractions that make secure access simple. But it wasn't all smooth sailing. We'll share war stories, including a security hiccup that taught us valuable lessons. If you're drowning in a sea of secrets or just want to sleep better at night, come and learn how we boosted security while cutting operational headaches. You'll walk away with practical strategies for implementing OIDC-based access that'll make your engineers happy and your security team even happier.


https://www.usenix.org/conference/srecon24emea/presentation/lane
Speakers
avatar for Iain Lane

Iain Lane

Grafana Labs
Iain is a senior software engineer at Grafana Labs. A member of the Platform team, his focus is on maintaining the infrastructure - Kubernetes clusters - which runs Grafana Cloud, and helping build tools and processes for engineers to deploy their software into this environment with... Read More →
avatar for Dimitris Sotirakis

Dimitris Sotirakis

Grafana Labs
Dimitris is a Senior Software Engineer with background in Backend, DevOps, Release and Platform Engineering. Specialized in CI/CD architecture, he has spent most of his career tackling the challenges of delivering software, tools and frameworks with quality. Currently he’s a member... Read More →
Tuesday October 29, 2024 11:00 - 11:40 GMT
The Liffey B

11:50 GMT

OMG WTF SSO: A Beginner’s Guide to Single Sign-On (Mis)configuration
Tuesday October 29, 2024 11:50 - 12:30 GMT
Adina Bogert-O'Brien


SSO protocols are just ways for an identity provider to share information about an authenticated identity with another service. Me having a way to tell my vendor “yeah, that’s Bob” doesn’t tell me what the vendor does with this information, or if the vendor always asks me who’s coming in the door. A bad SSO implementation can make you think you’re safer, while hiding all the new and fun things that have gone wrong.
To get the most out of implementing SSO, I need to know what I’m trying to accomplish and what steps I need to follow to get there. To illustrate why SSO needs to be set up carefully, for each of the things you need to do right, I’ll give you some fun examples of creative ways you and your vendor can do this wrong. We all learn from failure, right???


https://www.usenix.org/conference/srecon24emea/presentation/bogert-obrien
Speakers
avatar for Adina Bogert-O'Brien

Adina Bogert-O'Brien

I am incessantly curious, work in renewable energy, and sometimes find vulnerabilities when I’m bored. I co-founded a hackerspace over a decade ago but have only just accepted that security is more than a hobby. At work, I’m a business architect with security leanings working... Read More →
Tuesday October 29, 2024 11:50 - 12:30 GMT
The Liffey B

14:00 GMT

Achieving Excellence: SLO Thresholds That Transform Service Quality
Tuesday October 29, 2024 14:00 - 14:40 GMT
Thiara Ortiz, Netflix


At Netflix, ensuring exceptional quality for our streaming platform is crucial. Every time a Netflix member sits down, reclines in their chair, and turns on their TV, it's a moment of truth. It's our opportunity to deliver a spectacular service with amazing quality of experience. Misses, errors, or high latency—whether due to ISP configuration changes, code deployment, or catastrophic fallback—impact how our service is perceived.

In this talk, I'll share methods for defining thresholds for SLOs, ranging from intuition and industry best practices to advanced techniques like A/B experimentation. At Netflix, properly defining SLOs allows us to ensure industry-leading quality of experience for our members.


https://www.usenix.org/conference/srecon24emea/presentation/ortiz
Speakers
avatar for Thiara Ortiz

Thiara Ortiz

Netflix
Thiara is a Staff CDN Reliability Engineer at Netflix. Over the last four years, Thiara has been working on Open Connect, improving the resilience of the Netflix service for members around the world. Most recently, Thiara has been heavily involved with the introduction of Cloud Gaming... Read More →
Tuesday October 29, 2024 14:00 - 14:40 GMT
The Liffey B

14:45 GMT

Selective Reliability Engineering: There Is No Single Source of Truth
Tuesday October 29, 2024 14:45 - 15:05 GMT
Elise Burke, Datadog, Inc.


As engineers we design distributed architectures, define project scopes, and ensure that we have a single "source of truth". But what, exactly, do we mean by the phrase? Do we really have only one source of truth - and for that matter, how do we decide what it is?

We'll look at some well-known ambiguities in system design and data modeling and then consider more philosophical questions about truth, the sources of truth we accept, and why this ambiguity matters.


https://www.usenix.org/conference/srecon24emea/presentation/burke
Speakers
avatar for Elise Burke

Elise Burke

Datadog, Inc.
Elise's sixteen year career as a software and site reliability engineer includes supporting Google's internal distributed storage systems and Datadog's organization-wide production practices. Her interests include exploring the interconnectedness of both technology and the people... Read More →
Tuesday October 29, 2024 14:45 - 15:05 GMT
The Liffey B

15:10 GMT

Why You’re (Probably) Doing Service Catalogs Wrong
Tuesday October 29, 2024 15:10 - 15:30 GMT
Lisa Karlin Curtis, incident.io
Service catalogs promise a lot of things: powerful automations, insights into your technology estate.
But over the last few years, many of us have learned that setting up and maintaining a service catalog is really hard.
Building out a catalog from a standing start can take months, or even years. Too many people get stuck in a chicken-and-egg situation, where you can’t deliver value because you don’t have the data in your catalog, and you can’t convince anyone to spend time helping you because the catalog doesn’t do anything yet.
But there is another way...
https://www.usenix.org/conference/srecon24emea/presentation/curtis
Speakers
avatar for Lisa Karlin Curtis

Lisa Karlin Curtis

incident.io
Lisa started out as a consultant working with HMRC and then smart meters, before accidentally becoming a developer. She was a founding engineer at incident.io, building tooling to help your whole organization manage incidents better. She loves building stuff, but is also really interested... Read More →
Tuesday October 29, 2024 15:10 - 15:30 GMT
The Liffey B

16:00 GMT

SRE Stakeholders: A Spotter’s Guide
Tuesday October 29, 2024 16:00 - 16:40 GMT
Dave O'Connor


For Every SRE or SRE-adjacent team in any organisation, there are many kinds of stakeholders; people who care (or don't care!) about how your team operates, and the outcomes of that. They differ massively in how they view your team, and in how they, in turn, should be viewed, and managed.

In a timeline that doesn't contain a canonical book setting out what SRE is here for and how it achieves that, the sad and annoying answer is that "it depends". Because of this, we need to get good (or remain good) at stakeholder management and communications about why we're here, and what we do.

While primarily useful to SRE leadership, the kinds of stakeholders you run into can be useful to know for any SRE. Learn to spot the different stakeholders in your life, what they (generally) care about, and how you can help reduce misunderstandings and tension, no matter where you're sitting.


https://www.usenix.org/conference/srecon24emea/presentation/oconnor
Speakers
avatar for Dave O'Connor

Dave O'Connor

Dave is an SRE Leadership practitioner, Advisor and Coach based in Dublin. He's been working on SRE and SRE-adjacent organisations for over 20 years, primarily as an SRE Lead at Google from 2004-2021. Since then, he has spent time leading SRE, Security and Infrastructure teams at... Read More →
Tuesday October 29, 2024 16:00 - 16:40 GMT
The Liffey B

16:50 GMT

Panel Discussion: Is Reliability a Luxury Good?
Tuesday October 29, 2024 16:50 - 17:30 GMT
Moderator: Emil Stolarsky
Panelists: Niall Murphy, Stanza
https://www.usenix.org/conference/srecon24emea/presentation/stolarsky
Moderators
avatar for Emil Stolarsky

Emil Stolarsky

Increase
Emil is an engineer at Increase where he works on building modern banking infrastructure. Before that, he was at companies such as Wave Mobile Money, DigitalOcean, and Shopify, working on everything from building data centres in Sub-Saharan Africa to caching & performance optimizations... Read More →
Speakers
avatar for Niall Murphy

Niall Murphy

Stanza
Niall is the CEO of Stanza Systems, has occupied various engineering and leadership roles in Microsoft, Google, and Amazon, and is the instigator of the best-selling & prize-winning Site Reliability Engineering, which he hopes at some stage to live down. His most recent book is Reliable... Read More →
Tuesday October 29, 2024 16:50 - 17:30 GMT
The Liffey B
 
Wednesday, October 30
 

11:00 GMT

Anomaly Detection in Time Series from Scratch Using Statistical Analysis
Wednesday October 30, 2024 11:00 - 11:40 GMT
Ivan Shubin


Implementing anomaly detection for time series can be challenging, with many techniques and tools available. But can you achieve effective results without AI or Machine Learning? In this talk, we will demonstrate how basic statistical methods can effectively detect anomalies in time series data. We'll show you how to use Grafana to visualize these anomalies on graphs and ensure past incidents do not impact future predictions. Additionally, we will explore building Grafana dashboards as code as part of the anomaly detection solution and adjusting the detection for various events.


https://www.usenix.org/conference/srecon24emea/presentation/shubin
Speakers
avatar for Ivan Shubin

Ivan Shubin

Hi, my name is Ivan. I am a Senior Site Reliability Engineer at Booking.com. Before that I worked at TomTom and eBay. Throughout my career, I have explored various roles including Quality Assurance, Software Engineering, System Administration, and SRE. I have always been fascinated... Read More →
Wednesday October 30, 2024 11:00 - 11:40 GMT
The Liffey B

11:50 GMT

Generative AI: Beyond (Just) Hype
Wednesday October 30, 2024 11:50 - 12:30 GMT
Todd Underwood


Generative AI is one of the most hyped technologies in most of our careers. While it is driving a complete transformation of priorities some tech organizations many engineers remain deeply skeptical about any practical uses of Generative AI.

The skepticism is warranted and the hype is (for now) exaggerated, but not completely without merit. These technologies are not entirely useless for the kind of work we do. In this talk I will highlight a few emerging use cases that sidestep some of the weaknesses of GenAI (hallucination, errors), and still manage to provide value, specifically for production engineering.


https://www.usenix.org/conference/srecon24emea/presentation/underwood
Speakers
avatar for Todd Underwood

Todd Underwood

Todd Underwood recently lead reliability for the Research Platform at Open AI. Previously he was a Senior Engineering Director at Google leading ML capacity engineering in the office of the CFO at Alphabet. Before that, he founded and led ML Site Reliability Engineering and was the... Read More →
Wednesday October 30, 2024 11:50 - 12:30 GMT
The Liffey B

14:00 GMT

Noisy Neighbors, through Networking
Wednesday October 30, 2024 14:00 - 14:40 GMT
René Treffer and Ben Kochie, Reddit


When operating multi-tenant environments, like in Kubernetes, you can have "noisy neighbors". Resources like CPU and network can have contention which can lead to service degradation. But the causes of contention are not always what you would think. In this talk we will look at some surprising instances of "noisy neighbors", how they unfolded, how we discovered them, and how we mitigated the effects.


https://www.usenix.org/conference/srecon24emea/presentation/treffer
Speakers
avatar for René Treffer

René Treffer

Reddit
René Treffer is an infrastructure software engineer at Reddit.
avatar for Ben Kochie

Ben Kochie

Reddit
Ben Kochie is a principal software engineer at Reddit.
Wednesday October 30, 2024 14:00 - 14:40 GMT
The Liffey B

14:45 GMT

Taming Noisy Benchmark Results Using Change Point Detection
Wednesday October 30, 2024 14:45 - 15:05 GMT
Matt Fleming, Cloudflare


Modern systems are inherently nondeterministic and that leads to noisy benchmark results. Change Point Detection has emerged as a helpful technique for detecting significant changes in performance results even when those results are noisy and unstable. This talk will explain how Change Point Detection works and the open source projects available for developers to use CPD with noisy benchmark results.


https://www.usenix.org/conference/srecon24emea/presentation/fleming
Speakers
avatar for Matt Fleming

Matt Fleming

Cloudflare
Matt is co-founder of Nyrkiö and a Systems Engineer at Cloudflare. He has spent over 15 years working on low-level, high-performance systems and was previously the maintainer for the Linux kernel EFI subsystem. He has co-authored papers on performance change detection and distributed... Read More →
Wednesday October 30, 2024 14:45 - 15:05 GMT
The Liffey B

15:10 GMT

Enabling Product Scalability through Load Testing
Wednesday October 30, 2024 15:10 - 15:30 GMT
Monica Baluna and Ehab Tawfik, Bloomberg


One of Bloomberg's flagship products, Instant Bloomberg (IB), is used by financial professionals around the globe for instant messaging. This system is powered by a multitude of microservices, databases and UIs that interact through synchronous or asynchronous API calls and queueing mechanisms.

We recently released Forums in IB. This new form of group chat introduced exciting features. With our clients needing increasingly larger group chats, we took the opportunity to ask how to make sure the new system and the existing one can scale up with the extra load without affecting the existing user workflows.

This talk explores the different load testing strategies we adopted while enabling support for chats ten times larger than before, while also migrating existing group chats to become Forums. We will focus on two elements: (i) creating a realistic representation of production traffic in a test environment, and (ii) how to efficiently gather insightful metrics.


https://www.usenix.org/conference/srecon24emea/presentation/baluna
Speakers
avatar for Monica Baluna

Monica Baluna

Bloomberg
Monica Baluna is a software engineer at Bloomberg in London, where she has worked for the past six years. Her main interests include distributed systems, as well as building reliable software and robust APIs. She has had an opportunity to explore these interests, as her team manages... Read More →
avatar for Ehab Tawfik

Ehab Tawfik

Bloomberg
Ehab Tawfik is a software engineer who loves problem solving, technology, and business. He works in Core Products Engineering at Bloomberg in London. He is passionate about back-end systems and distributed computing. Ehab earned a bachelor's degree in computer science and engineering... Read More →
Wednesday October 30, 2024 15:10 - 15:30 GMT
The Liffey B

16:00 GMT

How to Host a (Very) Popular Website for 30 Altairian Dollars a Day
Wednesday October 30, 2024 16:00 - 16:40 GMT
James Beal


For 15 years, the Archive of Our Own (AO3) has provided a safe haven for fanworks while refusing to implement paid accounts, sell user data, or restrict fans' creativity. We're completely donor-funded and volunteer-run and currently serve about 34 billion pages a year—using servers that we own in order to reduce the likelihood of deplatforming due to our commitment to creative freedom.


We know a thing or two about getting the most out of an Altairian dollar without compromising user privacy or free expression. Even if your project has different constraints, our approach might just help you stretch your project's budget.


https://www.usenix.org/conference/srecon24emea/presentation/beal
Speakers
avatar for James Beal

James Beal

James started playing with computers with the ZX81, learned C for his A Levels, and has degrees in computer science and parallel and distributed systems. He has been using Linux originally with MCC Interim Linux and later with other distributions. He started volunteering at the OTW... Read More →
Wednesday October 30, 2024 16:00 - 16:40 GMT
The Liffey B

16:50 GMT

How Snowflake Migrated All Alerts and Dashboards to a Prometheus-Based Metrics System in 3 Months
Wednesday October 30, 2024 16:50 - 17:30 GMT
Carlos Mendizabal, Snowflake


This talk goes over how Snowflake migrated its alerts and dashboards in 3 months, a migration that included rewriting all alerts and dashboards used for system monitoring. We'll go over the tooling that enabled us to complete this migration successfully, which included configuration-as-code through Jsonnet and an unit testing framework, and share some important take-aways from this effort.


https://www.usenix.org/conference/srecon24emea/presentation/mendizabal
Speakers
avatar for Carlos Mendizabal

Carlos Mendizabal

Snowflake
Carlos Mendizabal is a software engineer at Snowflake. He is part of the Observability team and loves to build things (and to ensure they're well monitored!). Previously at Meta, he's also passionate about meeting folks across the industry and keeping up with the latest and greatest... Read More →
Wednesday October 30, 2024 16:50 - 17:30 GMT
The Liffey B
 
Thursday, October 31
 

09:00 GMT

Opening the Box: Diagnosing Operating-System Task-Scheduler Behavior on Highly Multicore Machines
Thursday October 31, 2024 09:00 - 09:40 GMT
Julia Lawall, Inria-Paris


Getting unexpectedly poor performance from your multicore application? Maybe the operating system task scheduler is at fault. The task scheduler is responsible for placing tasks on cores and for selecting which task is allowed to run, at what time, and for how long. As such, the scheduler is a critical component of any operating system and has a major impact on application performance. Still, scheduling decisions are buried deep within the operating system code, making it challenging to diagnose performance problems (or even performance improvements) to determine whether the scheduler is responsible and, if so, in what way. These challenges are compounded for highly multithreaded applications, running on large multicore machines, due to the huge amount of information available.

In this talk, we present some tools that we have developed for visualizing the behavior of the Linux kernel task scheduler, and illustrate how these tools can be used to help diagnose performance problems. The tools presented are freely available at https://gitlab.inria.fr/schedgraph/schedgraph


https://www.usenix.org/conference/srecon24emea/presentation/lawall
Speakers
avatar for Julia Lawall

Julia Lawall

Inria-Paris
Julia Lawall is a senior researcher at Inria Paris. Prior to joining Inria, she completed a PhD at Indiana University and was on the faculty at the University of Copenhagen. Her work focuses on issues around the correctness and performance of operating systems. She develops and maintains... Read More →
Thursday October 31, 2024 09:00 - 09:40 GMT
The Liffey B

09:50 GMT

Granular CPU Capacity Management at Scale with eBPF
Thursday October 31, 2024 09:50 - 10:30 GMT
George Brighton and Cameron Howes, Goldman Sachs


Real-time market data is exceptionally bursty, with update rates in the busiest seconds of the day regularly exceeding 10x the average. User experience is predicated on maintaining sufficient CPU headroom to prevent full buffers and the resulting client disconnects. Sampling cumulative CPU time at a typical scrape interval hides microbursts, and sub-second polling from user space induces unacceptable overhead, so a different approach is needed.

This talk will cover how Market Data SRE at Goldman Sachs uplifted CPU monitoring of our market data distribution infrastructure in an unintrusive way, achieving 10x the granularity with 5% of the original monitoring overhead. We will cover the journey from deciding to use eBPF, through trials using bpftrace and making the leap to BPF C, to collecting and aggregating the metrics effectively. It will be most relevant to those interested in capacity management across a heterogeneous estate, and those looking to implement eBPF for the first time in their organisations.


https://www.usenix.org/conference/srecon24emea/presentation/brighton
Speakers
avatar for George Brighton

George Brighton

Goldman Sachs
George Brighton is a Vice President at Goldman Sachs, where he leads the Market Data SRE team. A Prometheus and OTel committer, he is responsible for uplifting observability and operational practices. George presented "Market Data: Applying SRE Techniques to Legacy Designs" at SREcon22... Read More →
avatar for Cameron Howes

Cameron Howes

Goldman Sachs
Cameron Howes is an Analyst in the Market Data SRE team at Goldman Sachs, specialising in low-level development and performance instrumentation. When he's not ferociously avoiding a memory allocation, or reading about the latest CVEs, Cameron can be found writing black-box probers... Read More →
Thursday October 31, 2024 09:50 - 10:30 GMT
The Liffey B

11:00 GMT

Riot Games: Evolution of Observability at the Gaming Company
Thursday October 31, 2024 11:00 - 11:40 GMT
Erick Moreira and Kirill Mikhailov, Riot Games


The video game industry is growing year-by-year, and it is projected that the market size for video games will double in the coming 10 years. The number of people playing video games will also grow substantially. All of these produce a lot of challenges for tech teams to make sure that the games are not only fun to play but also offer stable, accessible gameplay. This is even more important for online competitive games, as they demand increased stability and performance.

Our presentation is focused on a review of the Riot Games journey through observability and specifically on the latest iteration of global-scale changes we made to introduce SRE and the new observability pipeline in the company.


https://www.usenix.org/conference/srecon24emea/presentation/moreira
Speakers
avatar for Erick Moreira

Erick Moreira

Riot Games
I am Erick Moreira, a 32-year-old Brazilian from Rio, working and living for 5 years in Dublin. I grew up modding and creating simple things for games. Now, I am focused on the backend, cross-cutting concerns, and the developer experience. I still find space in my heart to build front-end... Read More →
avatar for Kirill Mikhailov

Kirill Mikhailov

Riot Games
I started my journey as an engineer while in school, building servers for online games. I then switched to traditional software engineering, working for large tech companies. But at the end of the day, I still landed in the gaming industry, where I have worked for the LiveOps organisation... Read More →
Thursday October 31, 2024 11:00 - 11:40 GMT
The Liffey B

11:45 GMT

A Powerful Logs Management Solution We All Have and Use but We Underestimate: systemd-journal
Thursday October 31, 2024 11:45 - 12:05 GMT
Costa Tsaousis, Netdata


This talk aims to unearth the potent features of systemd-journal that have remained mostly underutilized and largely underappreciated within the SRE community. The focus will be on its ability to handle dynamically structured log entries, its inherent support for centralized logging, and its robust security features including log sealing.


Systemd-journal offers dynamic field management, allowing flexible log annotation and querying without predefined schemas, along with decentralized log management that enables seamless analysis across systems. Its sealing feature ensures log integrity, critical for incident response and forensics. There’s a tooling gap for converting plain logs into structured entries, however, we will show examples of how this can be achieved.



https://www.usenix.org/conference/srecon24emea/presentation/tsaousis
Speakers
avatar for Costa Tsaousis

Costa Tsaousis

Netdata
Costa Tsaousis, is the Founder and CEO of Netdata. Since 1995, Costa has been actively working on internet related startups. He has been a co-founder and C-level executive of many successful projects, including Internet Service Providers, Cloud Hosting Providers and Fintech startups... Read More →
Thursday October 31, 2024 11:45 - 12:05 GMT
The Liffey B

12:10 GMT

Blast Radius Reduction for Large-Scale Distributed Systems
Thursday October 31, 2024 12:10 - 12:30 GMT
Linhua Tang, Huawei Ireland Research Centre


The construction of large-scale distributed systems poses significant challenges due to inherent complexities and the inevitability of failures across various levels, from hardware malfunctions to software bugs. Embracing the 'design for failure' philosophy, this paper delves into advanced isolation techniques aimed at reducing the blast radius—both spatially and temporally—thereby enhancing system resilience. Spatial containment strategies, such as cell-based architecture, compartmentalize failures to localized areas, preventing cascading effects. Temporal mitigation focuses on rapid recovery and self-healing mechanisms, which aim to restore system health promptly after a failure occurs. Furthermore, the paper explores the application of formal methods in verifying the robustness of these designs, providing a rigorous approach to ensure the reliability and effectiveness of implemented solutions. This research underscores the importance of proactive architectural planning and continuous verification in maintaining the stability of complex distributed systems.


https://www.usenix.org/conference/srecon24emea/presentation/tang
Speakers
avatar for Linhua Tang

Linhua Tang

Huawei Ireland Research Centre
Linhua Tang (also known as James) is a software engineer and tech lead for global server load balancing and formal methods at Huawei Ireland Research Center. Before that, he worked at Microsoft and Amazon in different distributed systems.
Thursday October 31, 2024 12:10 - 12:30 GMT
The Liffey B

14:00 GMT

Get Your Non-SREs Oncall Ready!
Thursday October 31, 2024 14:00 - 14:40 GMT
JC van Winkel and Brad Lipinski, Google


Hands on learning is best for adults, and we've used this principle in Google SRE since 2017. However, many oncall engineers aren't SREs and haven't gone through a full week-long SRE onboarding program. How can they learn the same skills and go oncall with confidence, but without the week-long curriculum?

We cherry picked our SRE onboarding program to create a succinct, scalable program for this audience that includes the best of orientation: the breakage exercises. This program is called "Oncall Ready!" and is completely self-service, requiring no operational work from the SRE EDU team. In this talk we will discuss the development, the behind the scenes, and the outcomes of this project. Best comment we got from a participant: "Oh wow, this is like going through a [production] escape room without having to pay for it".


https://www.usenix.org/conference/srecon24emea/presentation/van-winkel
Speakers
avatar for JC van Winkel

JC van Winkel

Google
JC has been teaching UNIX and programming languages since 1992, working for AT Computing, a small courseware spin-off of the University of Nijmegen, the Netherlands. JC joined Google's Site Reliability Engineering team in 2010 and is both a founding member and lead educator of the... Read More →
avatar for Brad Lipinski

Brad Lipinski

Google
Brad joined Google SRE in 2013 and worked on datacenter software. He's taught for SRE EDU from the beginning and contributed to many of the team's automation efforts. In 2019, he joined SRE EDU full time and is now the team's tech lead.
Thursday October 31, 2024 14:00 - 14:40 GMT
The Liffey B

14:50 GMT

Transforming Production Readiness
Thursday October 31, 2024 14:50 - 15:30 GMT
Panagiotis Moustafellos, Elastic


Join Panagiotis Moustafellos, Distinguished Engineer at Elastic, as he shares Elastic's transformative journey of integrating development teams into on-call rotations.
This talk highlights the creation of an SLO observability product capable of monitoring hundreds of thousands of SLIs globally, amidst a significant infrastructure and software platform re-architecture.
Learn about the phased rollout of Elastic's new serverless offering and the delicate processes involved in getting all software engineers on call. Discover best practices in production readiness, incident management, self-service observability, and software release tools that empower teams to own their services. Gain valuable insights and actionable strategies to enhance production readiness and service reliability in your organization.


https://www.usenix.org/conference/srecon24emea/presentation/moustafellos
Speakers
avatar for Panagiotis Moustafellos

Panagiotis Moustafellos

Elastic
Panagiotis Moustafellos is a Distinguished Engineer at Elastic, the Search AI company. He brings over 15 years of experience in diverse tech environments and specializes in systems architecture, observability, and security, with a focus on scaling software systems, infrastructure... Read More →
Thursday October 31, 2024 14:50 - 15:30 GMT
The Liffey B
 
Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.