Loading…
strong>Track 2 [clear filter]
arrow_back View All Dates
Thursday, October 31
 

09:00 GMT

Opening the Box: Diagnosing Operating-System Task-Scheduler Behavior on Highly Multicore Machines
Thursday October 31, 2024 09:00 - 09:40 GMT
Julia Lawall, Inria-Paris


Getting unexpectedly poor performance from your multicore application? Maybe the operating system task scheduler is at fault. The task scheduler is responsible for placing tasks on cores and for selecting which task is allowed to run, at what time, and for how long. As such, the scheduler is a critical component of any operating system and has a major impact on application performance. Still, scheduling decisions are buried deep within the operating system code, making it challenging to diagnose performance problems (or even performance improvements) to determine whether the scheduler is responsible and, if so, in what way. These challenges are compounded for highly multithreaded applications, running on large multicore machines, due to the huge amount of information available.

In this talk, we present some tools that we have developed for visualizing the behavior of the Linux kernel task scheduler, and illustrate how these tools can be used to help diagnose performance problems. The tools presented are freely available at https://gitlab.inria.fr/schedgraph/schedgraph


https://www.usenix.org/conference/srecon24emea/presentation/lawall
Speakers
avatar for Julia Lawall

Julia Lawall

Inria-Paris
Julia Lawall is a senior researcher at Inria Paris. Prior to joining Inria, she completed a PhD at Indiana University and was on the faculty at the University of Copenhagen. Her work focuses on issues around the correctness and performance of operating systems. She develops and maintains... Read More →
Thursday October 31, 2024 09:00 - 09:40 GMT
The Liffey B

09:50 GMT

Granular CPU Capacity Management at Scale with eBPF
Thursday October 31, 2024 09:50 - 10:30 GMT
George Brighton and Cameron Howes, Goldman Sachs


Real-time market data is exceptionally bursty, with update rates in the busiest seconds of the day regularly exceeding 10x the average. User experience is predicated on maintaining sufficient CPU headroom to prevent full buffers and the resulting client disconnects. Sampling cumulative CPU time at a typical scrape interval hides microbursts, and sub-second polling from user space induces unacceptable overhead, so a different approach is needed.

This talk will cover how Market Data SRE at Goldman Sachs uplifted CPU monitoring of our market data distribution infrastructure in an unintrusive way, achieving 10x the granularity with 5% of the original monitoring overhead. We will cover the journey from deciding to use eBPF, through trials using bpftrace and making the leap to BPF C, to collecting and aggregating the metrics effectively. It will be most relevant to those interested in capacity management across a heterogeneous estate, and those looking to implement eBPF for the first time in their organisations.


https://www.usenix.org/conference/srecon24emea/presentation/brighton
Speakers
avatar for George Brighton

George Brighton

Goldman Sachs
George Brighton is a Vice President at Goldman Sachs, where he leads the Market Data SRE team. A Prometheus and OTel committer, he is responsible for uplifting observability and operational practices. George presented "Market Data: Applying SRE Techniques to Legacy Designs" at SREcon22... Read More →
avatar for Cameron Howes

Cameron Howes

Goldman Sachs
Cameron Howes is an Analyst in the Market Data SRE team at Goldman Sachs, specialising in low-level development and performance instrumentation. When he's not ferociously avoiding a memory allocation, or reading about the latest CVEs, Cameron can be found writing black-box probers... Read More →
Thursday October 31, 2024 09:50 - 10:30 GMT
The Liffey B

11:00 GMT

Riot Games: Evolution of Observability at the Gaming Company
Thursday October 31, 2024 11:00 - 11:40 GMT
Erick Moreira and Kirill Mikhailov, Riot Games


The video game industry is growing year-by-year, and it is projected that the market size for video games will double in the coming 10 years. The number of people playing video games will also grow substantially. All of these produce a lot of challenges for tech teams to make sure that the games are not only fun to play but also offer stable, accessible gameplay. This is even more important for online competitive games, as they demand increased stability and performance.

Our presentation is focused on a review of the Riot Games journey through observability and specifically on the latest iteration of global-scale changes we made to introduce SRE and the new observability pipeline in the company.


https://www.usenix.org/conference/srecon24emea/presentation/moreira
Speakers
avatar for Erick Moreira

Erick Moreira

Riot Games
I am Erick Moreira, a 32-year-old Brazilian from Rio, working and living for 5 years in Dublin. I grew up modding and creating simple things for games. Now, I am focused on the backend, cross-cutting concerns, and the developer experience. I still find space in my heart to build front-end... Read More →
avatar for Kirill Mikhailov

Kirill Mikhailov

Riot Games
I started my journey as an engineer while in school, building servers for online games. I then switched to traditional software engineering, working for large tech companies. But at the end of the day, I still landed in the gaming industry, where I have worked for the LiveOps organisation... Read More →
Thursday October 31, 2024 11:00 - 11:40 GMT
The Liffey B

11:45 GMT

A Powerful Logs Management Solution We All Have and Use but We Underestimate: systemd-journal
Thursday October 31, 2024 11:45 - 12:05 GMT
Costa Tsaousis, Netdata


This talk aims to unearth the potent features of systemd-journal that have remained mostly underutilized and largely underappreciated within the SRE community. The focus will be on its ability to handle dynamically structured log entries, its inherent support for centralized logging, and its robust security features including log sealing.


Systemd-journal offers dynamic field management, allowing flexible log annotation and querying without predefined schemas, along with decentralized log management that enables seamless analysis across systems. Its sealing feature ensures log integrity, critical for incident response and forensics. There’s a tooling gap for converting plain logs into structured entries, however, we will show examples of how this can be achieved.



https://www.usenix.org/conference/srecon24emea/presentation/tsaousis
Speakers
avatar for Costa Tsaousis

Costa Tsaousis

Netdata
Costa Tsaousis, is the Founder and CEO of Netdata. Since 1995, Costa has been actively working on internet related startups. He has been a co-founder and C-level executive of many successful projects, including Internet Service Providers, Cloud Hosting Providers and Fintech startups... Read More →
Thursday October 31, 2024 11:45 - 12:05 GMT
The Liffey B

12:10 GMT

Blast Radius Reduction for Large-Scale Distributed Systems
Thursday October 31, 2024 12:10 - 12:30 GMT
Linhua Tang, Huawei Ireland Research Centre


The construction of large-scale distributed systems poses significant challenges due to inherent complexities and the inevitability of failures across various levels, from hardware malfunctions to software bugs. Embracing the 'design for failure' philosophy, this paper delves into advanced isolation techniques aimed at reducing the blast radius—both spatially and temporally—thereby enhancing system resilience. Spatial containment strategies, such as cell-based architecture, compartmentalize failures to localized areas, preventing cascading effects. Temporal mitigation focuses on rapid recovery and self-healing mechanisms, which aim to restore system health promptly after a failure occurs. Furthermore, the paper explores the application of formal methods in verifying the robustness of these designs, providing a rigorous approach to ensure the reliability and effectiveness of implemented solutions. This research underscores the importance of proactive architectural planning and continuous verification in maintaining the stability of complex distributed systems.


https://www.usenix.org/conference/srecon24emea/presentation/tang
Speakers
avatar for Linhua Tang

Linhua Tang

Huawei Ireland Research Centre
Linhua Tang (also known as James) is a software engineer and tech lead for global server load balancing and formal methods at Huawei Ireland Research Center. Before that, he worked at Microsoft and Amazon in different distributed systems.
Thursday October 31, 2024 12:10 - 12:30 GMT
The Liffey B

14:00 GMT

Get Your Non-SREs Oncall Ready!
Thursday October 31, 2024 14:00 - 14:40 GMT
JC van Winkel and Brad Lipinski, Google


Hands on learning is best for adults, and we've used this principle in Google SRE since 2017. However, many oncall engineers aren't SREs and haven't gone through a full week-long SRE onboarding program. How can they learn the same skills and go oncall with confidence, but without the week-long curriculum?

We cherry picked our SRE onboarding program to create a succinct, scalable program for this audience that includes the best of orientation: the breakage exercises. This program is called "Oncall Ready!" and is completely self-service, requiring no operational work from the SRE EDU team. In this talk we will discuss the development, the behind the scenes, and the outcomes of this project. Best comment we got from a participant: "Oh wow, this is like going through a [production] escape room without having to pay for it".


https://www.usenix.org/conference/srecon24emea/presentation/van-winkel
Speakers
avatar for JC van Winkel

JC van Winkel

Google
JC has been teaching UNIX and programming languages since 1992, working for AT Computing, a small courseware spin-off of the University of Nijmegen, the Netherlands. JC joined Google's Site Reliability Engineering team in 2010 and is both a founding member and lead educator of the... Read More →
avatar for Brad Lipinski

Brad Lipinski

Google
Brad joined Google SRE in 2013 and worked on datacenter software. He's taught for SRE EDU from the beginning and contributed to many of the team's automation efforts. In 2019, he joined SRE EDU full time and is now the team's tech lead.
Thursday October 31, 2024 14:00 - 14:40 GMT
The Liffey B

14:50 GMT

Transforming Production Readiness
Thursday October 31, 2024 14:50 - 15:30 GMT
Panagiotis Moustafellos, Elastic


Join Panagiotis Moustafellos, Distinguished Engineer at Elastic, as he shares Elastic's transformative journey of integrating development teams into on-call rotations.
This talk highlights the creation of an SLO observability product capable of monitoring hundreds of thousands of SLIs globally, amidst a significant infrastructure and software platform re-architecture.
Learn about the phased rollout of Elastic's new serverless offering and the delicate processes involved in getting all software engineers on call. Discover best practices in production readiness, incident management, self-service observability, and software release tools that empower teams to own their services. Gain valuable insights and actionable strategies to enhance production readiness and service reliability in your organization.


https://www.usenix.org/conference/srecon24emea/presentation/moustafellos
Speakers
avatar for Panagiotis Moustafellos

Panagiotis Moustafellos

Elastic
Panagiotis Moustafellos is a Distinguished Engineer at Elastic, the Search AI company. He brings over 15 years of experience in diverse tech environments and specializes in systems architecture, observability, and security, with a focus on scaling software systems, infrastructure... Read More →
Thursday October 31, 2024 14:50 - 15:30 GMT
The Liffey B
 
Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.
Filtered by Date -