Loading…
strong>Track 1 [clear filter]
arrow_back View All Dates
Thursday, October 31
 

09:00 GMT

Monitoring Systems as a Service – Walking the Line between Giving Your Devs Good M&O and Setting All Your Money on Fire
Thursday October 31, 2024 09:00 - 09:40 GMT
Joan O'Callaghan, Udemy


Monitoring-as-a-Service products, like Datadog and Honeycomb are amazing products for implementing monitoring & observability with minimal effort, but like Anything-as-a-Service, it comes at a cost.

We are a very normal company, with all the tech debt and orphaned code that any company over a certain age has. Like everyone else, we had staff that heard, "measure everything!" but they didn't know what the monitoring bill looked like and that "everything" included a lot of junk.

In the talk I'll discuss how we managed to reduce cost wastage, enable extra vendor features, improve M&O knowledge within the engineering organisation and keep the bill the same or lower, despite a 60% growth in infrastructure at our company.

Notes re the vendor - I won't say who the Vendor is, but I think our experience was universal enough that our fixes and techniques will be helpful to other companies.


https://www.usenix.org/conference/srecon24emea/presentation/ocallaghan
Speakers
avatar for Joan O'Callaghan

Joan O'Callaghan

Udemy
Joan O'Callaghan is a Monitoring and Observability Director at Udemy. She has worked in SRE and Incident Management and M&O (in one form or another), for many, many years. She likes to host and write blameless incident reviews and take long walks on the beach where she has imaginary... Read More →
Thursday October 31, 2024 09:00 - 09:40 GMT
The Liffey A

09:50 GMT

An Exploration in Storing Telemetry in Cloud Object Storage
Thursday October 31, 2024 09:50 - 10:30 GMT
Mike Heffner and Ray Jenkins, Streamfold


Modern web application architectures require extensive telemetry data to function efficiently at scale. Traditional methods for collecting, storing, and processing this data have become increasingly expensive and challenging to maintain. Conversely, the prevalence of cloud object storage has given rise to the data lake. This has led some organizations to explore telemetry data lakes, which enable cost-efficient storage of large volumes of telemetry data.

We will explore various data storage formats used in constructing telemetry data lakes and discuss the tradeoffs associated with each approach. We will delve into common formats such as JSON, Parquet, ORC, and Apache Iceberg, examining how they can be utilized to store telemetry data like logs, metrics, and traces at scale. These formats will be empirically evaluated using real-world datasets. Additionally, we will review recent literature that highlights areas for design improvements in storage formats to better align them with modern computing hardware.


https://www.usenix.org/conference/srecon24emea/presentation/heffner
Speakers
avatar for Mike Heffner

Mike Heffner

Streamfold
Mike Heffner is co-founder of Streamfold, where they are creating the first telemetry pipeline built for developers. Prior to Streamfold, Mike was a backend engineer at Netlify helping scale their delivery network, and at Librato building one of the first monitoring SaaS products... Read More →
avatar for Ray Jenkins

Ray Jenkins

Streamfold
Ray Jenkins is co-founder of Streamfold, where they are creating the first telemetry pipeline built for developers. Prior to founding Streamfold, he led software engineering efforts at Snowflake, on the observability and performance of FoundationDB and at Segment on development of... Read More →
Thursday October 31, 2024 09:50 - 10:30 GMT
The Liffey A

11:00 GMT

Embrace Fleet Reboots and Make Them Boring
Thursday October 31, 2024 11:00 - 11:40 GMT
Everton Didone Foscarini, Cloudflare


Server reboots bring up mixed sentiments. Some want to say “My kernel is stable, it does not crash with a thousand days uptime”, others understand that you are running a system with a thousand days of accumulated vulnerabilities.

In Cloudflare we believe that high uptimes are bad, and while the reboot automation was being developed, we were hit by a kernel+BIOS bug that caused a high rate of node crashes, and encouraged the quick adoption of reboot automation, prompting us to implement better tooling to deploy fleet changes over reboots, creating multiple reboot queues for different workloads, load-based maintenance windows and more.

We achieved monthly reboots for our edge fleet while keeping the clusters online and serving customer-facing traffic, unlocking our ability to iterate fast on Linux Kernel versions and OS releases, ensuring we are not running outdated library versions in hosts not rebooted for a thousand days.


https://www.usenix.org/conference/srecon24emea/presentation/foscarini
Speakers
avatar for Everton Didone Foscarini

Everton Didone Foscarini

Cloudflare
Working on Internet-based services using Linux since 2003, joined Cloudflare in 2017 and helped to scale Edge location operations from 102 to 320 cities, creating tooling to manage services lifecycle and server reboots.
Thursday October 31, 2024 11:00 - 11:40 GMT
The Liffey A

11:45 GMT

A Brief History of Release Engineering
Thursday October 31, 2024 11:45 - 12:05 GMT
Dinah McNutt, MongoDB


TL;DR This talk is a humorous (hopefully) retrospective on release engineering. How did we get from building binaries using a command line to all the fancy CI/CD systems we have today?

Things we used to do seem ridiculous today. Can looking back help us move forward? What’s the evolution and career path of a release engineer? Has the role become diluted through overuse and misuse?

Please join in the fun and include your anecdotes and experiences in the slack channel.


https://www.usenix.org/conference/srecon24emea/presentation/mcnutt
Speakers
avatar for Dinah McNutt

Dinah McNutt

MongoDB
Dinah McNutt is a TPM for MongoDB and based in Dublin, Ireland. She has over 35 years of experience in systems administration, release engineering and software development. She has written for various publications over the years including the Daemons and Dragons column for UNIX Review... Read More →
Thursday October 31, 2024 11:45 - 12:05 GMT
The Liffey A

12:10 GMT

Red Tide Revert
Thursday October 31, 2024 12:10 - 12:30 GMT
David Newman, Automattic


Explore the challenges of managing unexpected production errors in high-frequency deployment environments and introduce an innovative AI-driven solution for rapid error detection and resolution. The speaker will discuss how their team developed and refined an automated system that analyzes error logs, identifies problematic code commits, and streamlines the incident response process. This approach aims to reduce on-call stress, minimize user impact, and pave the way for fully automated error mitigation in complex, fast-paced development ecosystems.


https://www.usenix.org/conference/srecon24emea/presentation/newman
Speakers
DN

David Newman

Automattic
With a diverse background in platform engineering, distributed systems, and artificial intelligence, our speaker brings a trove of experience driving innovation from startup to enterprise environments. As a technical founder in companies ranging from retail intelligence to digital... Read More →
Thursday October 31, 2024 12:10 - 12:30 GMT
The Liffey A

14:00 GMT

AppStack: An Open Source Cloud Native Platform for Running Digital Public Services
Thursday October 31, 2024 14:00 - 14:40 GMT
Dimitris Mitropoulos, National Infrastructures for Research and Technology – GRNET and University of Athens; Alex Kiousis, National Infrastructures for Research and Technology – GRNET


GRNET is Greece's National Infrastructures for Research and Technology (NREN) organisation, which acts as a network and services provider for research and education communities. Since 2019, GRNET is responsible for the development, operation and maintenance of several governmental services, thus playing an important role in Greece's digital transformation. To address the different challenges related to this role, GRNET teams developed AppStack, a cloud-native platform, based on production-ready open source software, for running government-related services such as the gov.gr portal, the electronic issuance of documents signed by the Greek state, and gov wallet, among others.

AppStack provides an environment for integrating open-source and in-house software components, where DevOps can incorporate suitable tools to tackle scalability and security issues.

Currently, AppStack hosts workloads that serve more than 8 million Greek citizens, are able to handle more than 20K requests per second, and can generate hundreds of digital documents signed by the Greek state per second.

In this talk we will present AppStack, its numerous components, and how open source made it possible. Finally, we will describe some key experiences from production.


https://www.usenix.org/conference/srecon24emea/presentation/mitropoulos
Speakers
avatar for Dimitris Mitropoulos

Dimitris Mitropoulos

National Infrastructures for Research and Technology – GRNET and University of Athens
Dimitris Mitropoulos is an Assistant Professor at the National and Kapodistrian University of Athens and the Head of Reliability Engineering at the Greek National Infrastructures for Research and Technology (GRNET). Previously, he has been a postdoctoral researcher at the Computer... Read More →
avatar for Alex Kiousis

Alex Kiousis

National Infrastructures for Research and Technology – GRNET
Alex Kiousis is a Site Reliability Engineer in GRNET in Greece. His team handles GRNET's on-premise infrastructure and services, delivering GRNET's custom Cloud service to Greece's Research and Academic communities and several user-facing Government-related Digital Transformation... Read More →
Thursday October 31, 2024 14:00 - 14:40 GMT
The Liffey A

14:50 GMT

Science Reliability Engineering for High Performance Computing
Thursday October 31, 2024 14:50 - 15:30 GMT
Nicholas Jones, LANL


High Performance Computing (HPC) as an industry has long stood on very human facing operational workflows. These workflows exist because HPC systems are generally purpose built machines for small sets of code bases with very specific performance metrics. This purpose built nature has resulted in HPC having very bespoke one-off systems, resulting in process and infrastructure that benefit a small set of code bases well, but aren't resilient to generational churn. To combat the difficulty from generational churn we've adopted an SRE mindset for our new administrative stack OpenCHAMI. This lets us keep our figures of merit (exact reproducibility, parallel bandwidth, and compute time to solution) aligned with what benefits our customer base the most.


https://www.usenix.org/conference/srecon24emea/presentation/jones
Speakers
avatar for Nicholas Jones

Nicholas Jones

LANL
Nick is a scientist at Los Alamos National Lab, where he works on system security architecture, CI/CD infrastructure, and shared computing environments and strategies across the National Nuclear Security Administration Laboratories.
Thursday October 31, 2024 14:50 - 15:30 GMT
The Liffey A
 
Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.
Filtered by Date -