Loading…
strong>Track 1 [clear filter]
arrow_back View All Dates
Tuesday, October 29
 

11:00 GMT

SRE Saga: The Song of Heroes and Villains
Tuesday October 29, 2024 11:00 - 11:40 GMT
Daria Barteneva, Microsoft Azure


SRE team require a balance of technical and soft skills, creativity and teamwork to be successful. Drawing parallels between the roles, challenges and dynamics of Dungeons and Dragons party and an SRE team will help us to explore SRE journey from the team inception to developing ideal makeup in terms of tenure/seniority, skillset and align it with the context SRE team could be part of.

We will share practical examples that helps SRE teams building resiliency and effective collaboration while dealing with challenges. We will also explore different mechanisms that can channel "super hero" energy to make team stronger and nurture the talent, helping team to keep the balance of distributed knowledge and accountability.

In this talk we will discuss:


  • Examples of functional SRE team setups

  • Common challenges SRE team may encounter

  • Developing early in career SRE

  • Dealing with the change and building resilience

  • Identifying red flags and avoiding long term problems



https://www.usenix.org/conference/srecon24emea/presentation/barteneva
Speakers
avatar for Daria Barteneva

Daria Barteneva

Microsoft Azure
Daria is a Principal Site Reliability Engineer in Observability Engineering in Azure. With a background in Applied Mathematics, Artificial Intelligence, and Music, Daria is passionate about machine learning, diversity in tech, and opera. In her current role, Daria is focused on changing... Read More →
Tuesday October 29, 2024 11:00 - 11:40 GMT
The Liffey A

11:50 GMT

The Frontiers of Reliability Engineering
Tuesday October 29, 2024 11:50 - 12:30 GMT
Heinrich Hartmann, Zalando SE


We take the 10s anniversary of SRECon as an occasion to reflect over the past decade of advancements in Reliability Engineering and provide an overview about the Frontiers we are facing today. Within Zalando we followed major trends of the industry in outsourcing hardware provisioning to AWS, package applications into Docker images, fully automated deployments (CI/CD), and implemented Distributed Tracing for Microservice Observability. Despite these advances, many challenges remain in building reliable, observable software systems and new areas arose which require new methods and tools. In the talk we are proving a number of conceptual view that help to map out the larger Reliability Engineering landscape and zone-in on 3 specific frontiers that we are actively investing in at Zalando: (1) Data Operations and Monitoring Event Based Systems (2) Mobile Observability (3) Effective Management Practices for Reliability.


https://www.usenix.org/conference/srecon24emea/presentation/hartmann
Speakers
avatar for Heinrich Hartmann

Heinrich Hartmann

Zalando SE
Heinrich Hartmann is a seasoned expert with a decade of experience in Reliability Engineering. Currently, he serves as the Senior Principal SRE at Zalando, a leading European e-commerce company, where he oversees company-wide reliability practices. Before joining Zalando, Heinrich... Read More →
Tuesday October 29, 2024 11:50 - 12:30 GMT
The Liffey A

14:00 GMT

Sailing the Database Seas: Applying SRE Principles at Scale
Tuesday October 29, 2024 14:00 - 14:40 GMT
Ioannis Androulidakis and Martin Alderete, Booking.com


In this talk we will demonstrate how we apply core SRE principles in the field of Database Engineering. More specifically, we will talk about the challenges of operating large-scale database systems in multiple cloud environments and how adopting best SRE practices dramatically improved our daily workflows and operations.

We will share insights and concrete use cases around the following topics: Monitoring Distributed Systems, Eliminating Toil and Postmortem Culture.

This talk will equip attendees with ideas and guidelines to better understand and efficiently operate their database systems such as choosing the right SLIs and SLOs, automating capacity planning and embracing a postmortem culture after outages.


https://www.usenix.org/conference/srecon24emea/presentation/androulidakis
Speakers
avatar for Ioannis Androulidakis

Ioannis Androulidakis

Booking.com
Ioannis Androulidakis is a Site Reliability Engineer with a strong background and multiple years of experience in Operating Systems, Observability Tools and Cloud Platforms. He is passionate about OSS technologies and has contributed to multiple open-source projects over the years.Ioannis... Read More →
avatar for Martin Alderete

Martin Alderete

Booking.com
Martin Alderete is a Principal Site Reliability Engineer with a long track record in Engineering, Distributed Systems and System Level Programming in both the academia where after getting his degree he worked as teacher assistant. And the industry where he led different teams building... Read More →
Tuesday October 29, 2024 14:00 - 14:40 GMT
The Liffey A

14:45 GMT

Survivor: MySQL Island – Outwit, Outplay, Outlast Metadata Locking Challenges
Tuesday October 29, 2024 14:45 - 15:05 GMT
Julia Jablonska, Capsule CRM


Think you understand MySQL metadata locks? Join this interactive session to test your knowledge and take a deep dive into the intricacies of MySQL's locking mechanisms.

We'll explore real-world scenarios, such as creating tables with foreign key constraints and adding indexes, to see how metadata locks can impact performance and stability. Through live voting you'll gain insights into what's happening behind the scenes and learn practical tips for managing database migrations.


https://www.usenix.org/conference/srecon24emea/presentation/jablonska
Speakers
avatar for Julia Jablonska

Julia Jablonska

Capsule CRM
As an Infrastructure Engineer at Capsule CRM, Julia is responsible for keeping Capsule secure, fast and reliable for thousands of our business customers around the globe.
Tuesday October 29, 2024 14:45 - 15:05 GMT
The Liffey A

15:10 GMT

Fixing Your Noisy Pager in 500 Easy Steps
Tuesday October 29, 2024 15:10 - 15:30 GMT
Chris Sinjakli, PlanetScale


You're not sure when it happened, but your pager suddenly seems noisy. You've started dreading your on-call shifts before they begin. You breathe a sigh of relief every time you sleep without interruption. Sound familiar?

Noisy on-call rotas sneak up on us one page at a time - an edge case in a new feature, an alert with too many false positives, processes that get stuck and need restarting. Each of these is easy to tolerate alone, but they quickly add up, leaving you swamped in alert noise and tired from missed sleep.

In this talk we'll explore techniques for digging ourselves out of the hole. We'll look at how to demonstrate the scale of the issue to our colleagues, what to do when the list of problems seems insurmountable, and how to get started with automated remediation in a low-risk way - I promise it's less scary than it sounds.


https://www.usenix.org/conference/srecon24emea/presentation/sinjakli
Speakers
avatar for Chris Sinjakli

Chris Sinjakli

PlanetScale
Chris enjoys working on the strange parts of computing where software and systems meet. He especially likes the challenges of databases and distributed systems.All his programs are made from organic, hand-picked, artisanal keypresses.
Tuesday October 29, 2024 15:10 - 15:30 GMT
The Liffey A

16:00 GMT

Exploring the Unintended Consequences of Automation in Software
Tuesday October 29, 2024 16:00 - 16:40 GMT
Courtney Nash, The VOID


Automation is ubiquitous—it is entwined in our daily lives in ways that we aren’t always aware of. It has been woven into all aspects of modern software by being presented as a utopian vision: a way of making human lives easier, doing repetitive tasks faster and with fewer errors, freeing us fallible humans up to do other ostensibly more important work. But anyone who has worked directly with automated systems knows that we are still very far from such a dreamy reality.

This talk delves into detailed research about how automation is involved in software incidents. My focus on this area stems from the growing portrayal of automation as a panacea for various software incident issues, despite its limitations in effectively addressing these challenges, such as reliable detection and resolution of software issues or analyzing and disseminating learnings from these incidents back into the organization and its products and services.

Drawn directly from public incident reports (collected in the VOID), this research revealed multiple, often competing, roles that automation can play over the course of an incident, and most importantly underscored how important humans are at understanding, troubleshooting, and recovering from automated software issues. If you're struggling to convey the reality behind the hype of automation and AI to others on your team or at your organization, this is the talk for you.


https://www.usenix.org/conference/srecon24emea/presentation/nash
Speakers
avatar for Courtney Nash

Courtney Nash

The VOID
Courtney Nash is a researcher focused on system safety and failures in complex sociotechnical systems. An erstwhile cognitive neuroscientist, she has always been fascinated by how people learn, and the ways memory influences how they solve problems. Over the past two decades, she’s... Read More →
Tuesday October 29, 2024 16:00 - 16:40 GMT
The Liffey A

16:45 GMT

Rock around the Clock (Synchronization): Improve Performance with High Precision Time!
Tuesday October 29, 2024 16:45 - 17:05 GMT
Lerna Ekmekcioglu, Clockwork Systems


Is the app slow or the network lagging? When it comes to latency in distributed systems, it can be hard to identify where exactly the issue is. As businesses increasingly adopt diverse deployment environments —on-premises, cloud, or hybrid— the complexity grows, obscuring visibility into system health. Join me to hear why clock synchronization is key for identifying the true culprit when latency is due to contention in the network. I’ll demo how network contention impacts tail latencies followed by an overview of clock synchronization protocols to date, their pros and cons, and best practices in disciplining clocks, as well as recent algorithms from Stanford Research. With high precision clock synchronization at scale, we gain back visibility into useful one way delay metrics, which act as an early signal for network congestion that help us prevent impact to response times for our end users!


https://www.usenix.org/conference/srecon24emea/presentation/ekmekcioglu
Speakers
avatar for Lerna Ekmekcioglu

Lerna Ekmekcioglu

Clockwork Systems
Lerna is a Senior Solutions Engineer at Clockwork Systems where she helps customers meet their performance goals with software solutions built on Clockwork.io’s foundational research. Prior to this, she was a Senior Solutions Architect serving Global Financial Services customers... Read More →
Tuesday October 29, 2024 16:45 - 17:05 GMT
The Liffey A

17:10 GMT

Mnemonic Rules for Eponymous Laws or: There’s a Law for That!
Tuesday October 29, 2024 17:10 - 17:30 GMT
Peter Burkholder, U.S. Government


As SREs, referencing named laws like Brook’s Law, Galls Law, or Jevons Paradox can help strengthen our arguments. But remembering which law applies when is challenging.

In this talk, I'll highlight the most useful tech and behavioral science laws for SRE work, offer mnemonic tips for recalling them, and share real-world examples. We'll finish with a quick quiz to ensure you're ready to apply these concepts in your role.


https://www.usenix.org/conference/srecon24emea/presentation/burkholder
Speakers
avatar for Peter Burkholder

Peter Burkholder

U.S. Government
Geophysicist turned SRE. Jobs include: US Gov, (18f/cloud.gov), GovReady, Chef, AARP, NCBI, NCAR, Univ. of Washington. In my own time, I make pizza, sing, and play guitar (not simultaneously).
Tuesday October 29, 2024 17:10 - 17:30 GMT
The Liffey A
 
Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.
Filtered by Date -