Loading…
strong>The Liffey A [clear filter]
Tuesday, October 29
 

11:00 GMT

SRE Saga: The Song of Heroes and Villains
Tuesday October 29, 2024 11:00 - 11:40 GMT
Daria Barteneva, Microsoft Azure


SRE team require a balance of technical and soft skills, creativity and teamwork to be successful. Drawing parallels between the roles, challenges and dynamics of Dungeons and Dragons party and an SRE team will help us to explore SRE journey from the team inception to developing ideal makeup in terms of tenure/seniority, skillset and align it with the context SRE team could be part of.

We will share practical examples that helps SRE teams building resiliency and effective collaboration while dealing with challenges. We will also explore different mechanisms that can channel "super hero" energy to make team stronger and nurture the talent, helping team to keep the balance of distributed knowledge and accountability.

In this talk we will discuss:


  • Examples of functional SRE team setups

  • Common challenges SRE team may encounter

  • Developing early in career SRE

  • Dealing with the change and building resilience

  • Identifying red flags and avoiding long term problems



https://www.usenix.org/conference/srecon24emea/presentation/barteneva
Speakers
avatar for Daria Barteneva

Daria Barteneva

Microsoft Azure
Daria is a Principal Site Reliability Engineer in Observability Engineering in Azure. With a background in Applied Mathematics, Artificial Intelligence, and Music, Daria is passionate about machine learning, diversity in tech, and opera. In her current role, Daria is focused on changing... Read More →
Tuesday October 29, 2024 11:00 - 11:40 GMT
The Liffey A

11:50 GMT

The Frontiers of Reliability Engineering
Tuesday October 29, 2024 11:50 - 12:30 GMT
Heinrich Hartmann, Zalando SE


We take the 10s anniversary of SRECon as an occasion to reflect over the past decade of advancements in Reliability Engineering and provide an overview about the Frontiers we are facing today. Within Zalando we followed major trends of the industry in outsourcing hardware provisioning to AWS, package applications into Docker images, fully automated deployments (CI/CD), and implemented Distributed Tracing for Microservice Observability. Despite these advances, many challenges remain in building reliable, observable software systems and new areas arose which require new methods and tools. In the talk we are proving a number of conceptual view that help to map out the larger Reliability Engineering landscape and zone-in on 3 specific frontiers that we are actively investing in at Zalando: (1) Data Operations and Monitoring Event Based Systems (2) Mobile Observability (3) Effective Management Practices for Reliability.


https://www.usenix.org/conference/srecon24emea/presentation/hartmann
Speakers
avatar for Heinrich Hartmann

Heinrich Hartmann

Zalando SE
Heinrich Hartmann is a seasoned expert with a decade of experience in Reliability Engineering. Currently, he serves as the Senior Principal SRE at Zalando, a leading European e-commerce company, where he oversees company-wide reliability practices. Before joining Zalando, Heinrich... Read More →
Tuesday October 29, 2024 11:50 - 12:30 GMT
The Liffey A

14:00 GMT

Sailing the Database Seas: Applying SRE Principles at Scale
Tuesday October 29, 2024 14:00 - 14:40 GMT
Ioannis Androulidakis and Martin Alderete, Booking.com


In this talk we will demonstrate how we apply core SRE principles in the field of Database Engineering. More specifically, we will talk about the challenges of operating large-scale database systems in multiple cloud environments and how adopting best SRE practices dramatically improved our daily workflows and operations.

We will share insights and concrete use cases around the following topics: Monitoring Distributed Systems, Eliminating Toil and Postmortem Culture.

This talk will equip attendees with ideas and guidelines to better understand and efficiently operate their database systems such as choosing the right SLIs and SLOs, automating capacity planning and embracing a postmortem culture after outages.


https://www.usenix.org/conference/srecon24emea/presentation/androulidakis
Speakers
avatar for Martin Alderete

Martin Alderete

Booking.com
Martin Alderete is a Principal Site Reliability Engineer with a long track record in Engineering, Distributed Systems and System Level Programming in both the academia where after getting his degree he worked as teacher assistant. And the industry where he led different teams building... Read More →
avatar for Ioannis Androulidakis

Ioannis Androulidakis

Booking.com
Ioannis Androulidakis is a Site Reliability Engineer with a strong background and multiple years of experience in Operating Systems, Observability Tools and Cloud Platforms. He is passionate about OSS technologies and has contributed to multiple open-source projects over the years.Ioannis... Read More →
Tuesday October 29, 2024 14:00 - 14:40 GMT
The Liffey A

14:45 GMT

Survivor: MySQL Island – Outwit, Outplay, Outlast Metadata Locking Challenges
Tuesday October 29, 2024 14:45 - 15:05 GMT
Julia Jablonska, Capsule CRM


Think you understand MySQL metadata locks? Join this interactive session to test your knowledge and take a deep dive into the intricacies of MySQL's locking mechanisms.

We'll explore real-world scenarios, such as creating tables with foreign key constraints and adding indexes, to see how metadata locks can impact performance and stability. Through live voting you'll gain insights into what's happening behind the scenes and learn practical tips for managing database migrations.


https://www.usenix.org/conference/srecon24emea/presentation/jablonska
Speakers
avatar for Julia Jablonska

Julia Jablonska

Capsule CRM
As an Infrastructure Engineer at Capsule CRM, Julia is responsible for keeping Capsule secure, fast and reliable for thousands of our business customers around the globe.
Tuesday October 29, 2024 14:45 - 15:05 GMT
The Liffey A

15:10 GMT

Fixing Your Noisy Pager in 500 Easy Steps
Tuesday October 29, 2024 15:10 - 15:30 GMT
Chris Sinjakli, PlanetScale


You're not sure when it happened, but your pager suddenly seems noisy. You've started dreading your on-call shifts before they begin. You breathe a sigh of relief every time you sleep without interruption. Sound familiar?

Noisy on-call rotas sneak up on us one page at a time - an edge case in a new feature, an alert with too many false positives, processes that get stuck and need restarting. Each of these is easy to tolerate alone, but they quickly add up, leaving you swamped in alert noise and tired from missed sleep.

In this talk we'll explore techniques for digging ourselves out of the hole. We'll look at how to demonstrate the scale of the issue to our colleagues, what to do when the list of problems seems insurmountable, and how to get started with automated remediation in a low-risk way - I promise it's less scary than it sounds.


https://www.usenix.org/conference/srecon24emea/presentation/sinjakli
Speakers
avatar for Chris Sinjakli

Chris Sinjakli

PlanetScale
Chris enjoys working on the strange parts of computing where software and systems meet. He especially likes the challenges of databases and distributed systems.All his programs are made from organic, hand-picked, artisanal keypresses.
Tuesday October 29, 2024 15:10 - 15:30 GMT
The Liffey A

16:00 GMT

Exploring the Unintended Consequences of Automation in Software
Tuesday October 29, 2024 16:00 - 16:40 GMT
Courtney Nash, The VOID


Automation is ubiquitous—it is entwined in our daily lives in ways that we aren’t always aware of. It has been woven into all aspects of modern software by being presented as a utopian vision: a way of making human lives easier, doing repetitive tasks faster and with fewer errors, freeing us fallible humans up to do other ostensibly more important work. But anyone who has worked directly with automated systems knows that we are still very far from such a dreamy reality.

This talk delves into detailed research about how automation is involved in software incidents. My focus on this area stems from the growing portrayal of automation as a panacea for various software incident issues, despite its limitations in effectively addressing these challenges, such as reliable detection and resolution of software issues or analyzing and disseminating learnings from these incidents back into the organization and its products and services.

Drawn directly from public incident reports (collected in the VOID), this research revealed multiple, often competing, roles that automation can play over the course of an incident, and most importantly underscored how important humans are at understanding, troubleshooting, and recovering from automated software issues. If you're struggling to convey the reality behind the hype of automation and AI to others on your team or at your organization, this is the talk for you.


https://www.usenix.org/conference/srecon24emea/presentation/nash
Speakers
avatar for Courtney Nash

Courtney Nash

The VOID
Courtney Nash is a researcher focused on system safety and failures in complex sociotechnical systems. An erstwhile cognitive neuroscientist, she has always been fascinated by how people learn, and the ways memory influences how they solve problems. Over the past two decades, she’s... Read More →
Tuesday October 29, 2024 16:00 - 16:40 GMT
The Liffey A

16:45 GMT

Rock around the Clock (Synchronization): Improve Performance with High Precision Time!
Tuesday October 29, 2024 16:45 - 17:05 GMT
Lerna Ekmekcioglu, Clockwork Systems


Is the app slow or the network lagging? When it comes to latency in distributed systems, it can be hard to identify where exactly the issue is. As businesses increasingly adopt diverse deployment environments —on-premises, cloud, or hybrid— the complexity grows, obscuring visibility into system health. Join me to hear why clock synchronization is key for identifying the true culprit when latency is due to contention in the network. I’ll demo how network contention impacts tail latencies followed by an overview of clock synchronization protocols to date, their pros and cons, and best practices in disciplining clocks, as well as recent algorithms from Stanford Research. With high precision clock synchronization at scale, we gain back visibility into useful one way delay metrics, which act as an early signal for network congestion that help us prevent impact to response times for our end users!


https://www.usenix.org/conference/srecon24emea/presentation/ekmekcioglu
Speakers
avatar for Lerna Ekmekcioglu

Lerna Ekmekcioglu

Clockwork Systems
Lerna is a Senior Solutions Engineer at Clockwork Systems where she helps customers meet their performance goals with software solutions built on Clockwork.io’s foundational research. Prior to this, she was a Senior Solutions Architect serving Global Financial Services customers... Read More →
Tuesday October 29, 2024 16:45 - 17:05 GMT
The Liffey A

17:10 GMT

Mnemonic Rules for Eponymous Laws or: There’s a Law for That!
Tuesday October 29, 2024 17:10 - 17:30 GMT
Peter Burkholder, U.S. Government


As SREs, referencing named laws like Brook’s Law, Galls Law, or Jevons Paradox can help strengthen our arguments. But remembering which law applies when is challenging.

In this talk, I'll highlight the most useful tech and behavioral science laws for SRE work, offer mnemonic tips for recalling them, and share real-world examples. We'll finish with a quick quiz to ensure you're ready to apply these concepts in your role.


https://www.usenix.org/conference/srecon24emea/presentation/burkholder
Speakers
avatar for Peter Burkholder

Peter Burkholder

U.S. Government
Geophysicist turned SRE. Jobs include: US Gov, (18f/cloud.gov), GovReady, Chef, AARP, NCBI, NCAR, Univ. of Washington. In my own time, I make pizza, sing, and play guitar (not simultaneously).
Tuesday October 29, 2024 17:10 - 17:30 GMT
The Liffey A
 
Wednesday, October 30
 

11:00 GMT

Finding the Capacity to Grieve Once More
Wednesday October 30, 2024 11:00 - 11:40 GMT
Alexandros Kosiaris, Wikimedia Foundation
At Wikipedia, we handle unpredictable traffic spikes, especially during notable deaths, which can cause severe outages. Despite believing we had mitigated this issue years ago, a major outage occurred in 2020 due to a notable death and a DDoS attack, leading to the realization that our platform needed further improvements. Over the years, we conducted investigations and implemented numerous fixes, educating new SREs about our platform's unique constraints. Two years ago, following the death of Elizabeth II, our system successfully handled unprecedented traffic without outages, demonstrating our platform's resilience. This story highlights the infrastructure improvements that allowed us to manage traffic surges and the emotional journey of regaining the capacity to properly grieve significant losses.
We heavily rely on open source, and our code is public, making our solutions accessible to everyone.
https://www.usenix.org/conference/srecon24emea/presentation/kosiaris
Speakers
avatar for Alexandros Kosiaris

Alexandros Kosiaris

Wikimedia Foundation
A Linux sysadmin, turned FreeBSD sysadmin, turned Linux sysadmin, turned systems engineer (somewhere along that path there’s a Devops hat as well), turned SRE, Alexandros has been in the space since 1999, starting as a hobbyist, then a professional. Currently working with the Wikimedia... Read More →
Wednesday October 30, 2024 11:00 - 11:40 GMT
The Liffey A

11:50 GMT

Incident Groundhog Day
Wednesday October 30, 2024 11:50 - 12:30 GMT
Hamed Silatani, Uptime Labs


Learning how to respond effectively to incidents is hard. One of the reasons is that we never see the same incident twice. While we can learn vital lessons during and after an incident, we can’t hop into a time machine, and apply these lessons to the same incident to discover their impact. What if we could experience the same incident over and over again? What might we learn? This talk describes a ‘staged world’ experiment in which 20 incident managers separately experienced the same simulated incident affecting a fictitious e-commerce company. We discuss what we noticed that differentiated some incident responders from others, and some surprising things that we expected to see, but didn’t.


https://www.usenix.org/conference/srecon24emea/presentation/silatani
Speakers
avatar for Hamed Silatani

Hamed Silatani

Uptime Labs
Hamed is co-founder and CEO of Uptime Labs, an incident learning & simulation platform. He has 20 years of experience in engineering leadership, reliability engineering, and IT operations. Having spent the majority of his career at the sharp end of incident response in financial services... Read More →
Wednesday October 30, 2024 11:50 - 12:30 GMT
The Liffey A

14:00 GMT

When Your SaaS Provider Goes out of Business – Lessons from an Averted Crisis
Wednesday October 30, 2024 14:00 - 14:40 GMT
Raphael Seebacher and Christof Gerber, Open Systems AG


What do you do when your SaaS provider unexpectedly goes out of business?

It's the early days of 2023 when the provider of a critical component in our Web Proxy service announces that it just went out of business. With services used by 100 customers with 300K daily users across 3'500 locations worldwide at stake, we knew that it was time for swift action.

Join us in this talk as Raphi, the crisis lead, and Christof, engineer on the Web Security team, recount their experience handling this crisis. We will take you from the dizziness of the initial shock to our first hour, first day, first week, and first month actions, detailing leadership, communication, and technical responses and the trade-off decisions we faced.

You'll leave this talk with another tale from production and practical ideas to make your own organisation better prepared for a similar unexpected crisis.


https://www.usenix.org/conference/srecon24emea/presentation/seebacher
Speakers
avatar for Christof Gerber

Christof Gerber

Open Systems AG
Christof is an engineer who develops, maintains, and operates Software as a Service to secure corporate web traffic worldwide. Working at the intersection of computer networks, IT security, and software engineering, he is passionate about building reliable systems for Linux servers... Read More →
avatar for Raphael Seebacher

Raphael Seebacher

Open Systems AG
Raphael is a systems engineer who spent the last decade exploring the Engineer/Manager pendulum at Open Systems. He holds a MSc in electrical engineering and information technology, a MAS in Management, Technology and Economics and is a captain in the Swiss Armed Forces. His interests... Read More →
Wednesday October 30, 2024 14:00 - 14:40 GMT
The Liffey A

14:45 GMT

Configuration Languages Are the Bane of Our Existence
Wednesday October 30, 2024 14:45 - 15:05 GMT
Paul Komkoff


It is probably a good idea to make it possible to change some constants in your program without recompiling it. So why it then gets incredibly hard to control these configurations? At which point configuration becomes a program with no tests, written in untyped language, which requires a lot of compute to evaluate and can't be checked in advance? Is it at all possible (and enough) to get rid of these languages and go back to ini files?

If you are like me, you want to know the answers to these questions, and this is what I'm going to talk about. Plus:


  • sendmail.cf was an early sign everybody ignored

  • if you use regular expressions for matching and selecting in your configuration, start writing a premortem

  • when your configuration is more complicated than your program, who is your program now?



https://www.usenix.org/conference/srecon24emea/presentation/komkoff
Speakers
avatar for Paul Komkoff

Paul Komkoff

Out of 33 years of working with computers and networks, Paul spent 17 in SRE organization. He believes that complexity needs to be actively managed and to know the better ways to fix things we need to explore the depths of failure.
Wednesday October 30, 2024 14:45 - 15:05 GMT
The Liffey A

15:10 GMT

Just Buy the Printer: Resilience in Action
Wednesday October 30, 2024 15:10 - 15:30 GMT
Cail Young, Octopus Deploy


A retelling of a recent near-miss at Octopus Deploy involving code signing certificates, multiple teams responding on an incident, and everybody's favourite piece of security hardware - the humble printer. After the story, we'll reflect on what the story says about the resilience factors already in the organisation, and what the telling of the story itself might be able to do for resilience across organisations.


https://www.usenix.org/conference/srecon24emea/presentation/young
Speakers
avatar for Cail Young

Cail Young

Octopus Deploy
Cail has spent the last couple of decades working at the intersection of people and technology: in the performing arts, in the motion picture industry, and now in the field of software operations. He is fascinated by learning from incidents - large and small - and will gladly trade... Read More →
Wednesday October 30, 2024 15:10 - 15:30 GMT
The Liffey A

16:00 GMT

Managing the Risk of Software Supply Chain Attacks
Wednesday October 30, 2024 16:00 - 16:40 GMT
Mark Hahn, Qualys


Open-Source Software (OSS) are flourishing and are getting used by at least 90% of companies. Modern applications are built on webs of open-source code, APIs, and third-party integrations.

Because of this hackers are now compromising weak links in existing software supply chains. Software supply chain (SSC) threats include tampering with updates (tainted updates), compromised third-party libraries, vulnerabilities in open-source packages, malicious code or malware in packages etc. Software Supply Chain attacks have an average increase of 742% per year.

This talk covers ways to prevent software supply chain attacks and how to respond when the ecosystem has been tainted.


https://www.usenix.org/conference/srecon24emea/presentation/hahn
Speakers
avatar for Mark Hahn

Mark Hahn

Qualys
Mark Hahn is the Solutions Architect for Cloud and DevOps Security at Qualys. He uses DevSecOps and Site Reliability Engineering practices to ensure that software and applications are deployed with high velocity and with the utmost security. He shows clients how to build security... Read More →
Wednesday October 30, 2024 16:00 - 16:40 GMT
The Liffey A

16:50 GMT

When SRE and Security Teams Meet to Face a Crisis
Wednesday October 30, 2024 16:50 - 17:30 GMT
JR Aquino


For SREs, Security is at the same time a priority and not a priority; prioritization highly depends on the environment, the size, and the goals of each org.

This talk aims to give SREs - through real life examples - insights of:

What to expect (and how to be good neighbors) when they are called in to work with Security teams to manage a security incident
Inter-organizational unexpected challenges that might occur
What to keep an eye on in the future


https://www.usenix.org/conference/srecon24emea/presentation/aquino
Speakers
avatar for JR Aquino

JR Aquino

Redfin | Rent - Head of Information SecurityFormer Microsoft and Citrix Security LeaderCreated centralized SUDO for Fedora’s FreeIPAFreeBSD port maintainer for Metasploit and UnrealIRCDOpenBSD port maintainer for Nmap
Wednesday October 30, 2024 16:50 - 17:30 GMT
The Liffey A

17:45 GMT

Lightning Talks
Wednesday October 30, 2024 17:45 - 18:45 GMT
Lightning Talks are four-minute talks by different speakers addressing a variety of SRE-relevant topics.

The lightning talks session will conclude with Slide Karaoke; a chance for any attendee to show off their improv skills by presenting a slide deck that they have never seen before.
Wednesday October 30, 2024 17:45 - 18:45 GMT
The Liffey A
 
Thursday, October 31
 

09:00 GMT

Monitoring Systems as a Service – Walking the Line between Giving Your Devs Good M&O and Setting All Your Money on Fire
Thursday October 31, 2024 09:00 - 09:40 GMT
Joan O'Callaghan, Udemy


Monitoring-as-a-Service products, like Datadog and Honeycomb are amazing products for implementing monitoring & observability with minimal effort, but like Anything-as-a-Service, it comes at a cost.

We are a very normal company, with all the tech debt and orphaned code that any company over a certain age has. Like everyone else, we had staff that heard, "measure everything!" but they didn't know what the monitoring bill looked like and that "everything" included a lot of junk.

In the talk I'll discuss how we managed to reduce cost wastage, enable extra vendor features, improve M&O knowledge within the engineering organisation and keep the bill the same or lower, despite a 60% growth in infrastructure at our company.

Notes re the vendor - I won't say who the Vendor is, but I think our experience was universal enough that our fixes and techniques will be helpful to other companies.


https://www.usenix.org/conference/srecon24emea/presentation/ocallaghan
Speakers
avatar for Joan O'Callaghan

Joan O'Callaghan

Udemy
Joan O'Callaghan is a Monitoring and Observability Director at Udemy. She has worked in SRE and Incident Management and M&O (in one form or another), for many, many years. She likes to host and write blameless incident reviews and take long walks on the beach where she has imaginary... Read More →
Thursday October 31, 2024 09:00 - 09:40 GMT
The Liffey A

09:50 GMT

An Exploration in Storing Telemetry in Cloud Object Storage
Thursday October 31, 2024 09:50 - 10:30 GMT
Mike Heffner and Ray Jenkins, Streamfold


Modern web application architectures require extensive telemetry data to function efficiently at scale. Traditional methods for collecting, storing, and processing this data have become increasingly expensive and challenging to maintain. Conversely, the prevalence of cloud object storage has given rise to the data lake. This has led some organizations to explore telemetry data lakes, which enable cost-efficient storage of large volumes of telemetry data.

We will explore various data storage formats used in constructing telemetry data lakes and discuss the tradeoffs associated with each approach. We will delve into common formats such as JSON, Parquet, ORC, and Apache Iceberg, examining how they can be utilized to store telemetry data like logs, metrics, and traces at scale. These formats will be empirically evaluated using real-world datasets. Additionally, we will review recent literature that highlights areas for design improvements in storage formats to better align them with modern computing hardware.


https://www.usenix.org/conference/srecon24emea/presentation/heffner
Speakers
avatar for Mike Heffner

Mike Heffner

Streamfold
Mike Heffner is co-founder of Streamfold, where they are creating the first telemetry pipeline built for developers. Prior to Streamfold, Mike was a backend engineer at Netlify helping scale their delivery network, and at Librato building one of the first monitoring SaaS products... Read More →
avatar for Ray Jenkins

Ray Jenkins

Streamfold
Ray Jenkins is co-founder of Streamfold, where they are creating the first telemetry pipeline built for developers. Prior to founding Streamfold, he led software engineering efforts at Snowflake, on the observability and performance of FoundationDB and at Segment on development of... Read More →
Thursday October 31, 2024 09:50 - 10:30 GMT
The Liffey A

11:00 GMT

Embrace Fleet Reboots and Make Them Boring
Thursday October 31, 2024 11:00 - 11:40 GMT
Everton Didone Foscarini, Cloudflare


Server reboots bring up mixed sentiments. Some want to say “My kernel is stable, it does not crash with a thousand days uptime”, others understand that you are running a system with a thousand days of accumulated vulnerabilities.

In Cloudflare we believe that high uptimes are bad, and while the reboot automation was being developed, we were hit by a kernel+BIOS bug that caused a high rate of node crashes, and encouraged the quick adoption of reboot automation, prompting us to implement better tooling to deploy fleet changes over reboots, creating multiple reboot queues for different workloads, load-based maintenance windows and more.

We achieved monthly reboots for our edge fleet while keeping the clusters online and serving customer-facing traffic, unlocking our ability to iterate fast on Linux Kernel versions and OS releases, ensuring we are not running outdated library versions in hosts not rebooted for a thousand days.


https://www.usenix.org/conference/srecon24emea/presentation/foscarini
Speakers
avatar for Everton Didone Foscarini

Everton Didone Foscarini

Cloudflare
Working on Internet-based services using Linux since 2003, joined Cloudflare in 2017 and helped to scale Edge location operations from 102 to 320 cities, creating tooling to manage services lifecycle and server reboots.
Thursday October 31, 2024 11:00 - 11:40 GMT
The Liffey A

11:45 GMT

A Brief History of Release Engineering
Thursday October 31, 2024 11:45 - 12:05 GMT
Dinah McNutt, MongoDB


TL;DR This talk is a humorous (hopefully) retrospective on release engineering. How did we get from building binaries using a command line to all the fancy CI/CD systems we have today?

Things we used to do seem ridiculous today. Can looking back help us move forward? What’s the evolution and career path of a release engineer? Has the role become diluted through overuse and misuse?

Please join in the fun and include your anecdotes and experiences in the slack channel.


https://www.usenix.org/conference/srecon24emea/presentation/mcnutt
Speakers
avatar for Dinah McNutt

Dinah McNutt

MongoDB
Dinah McNutt is a TPM for MongoDB and based in Dublin, Ireland. She has over 35 years of experience in systems administration, release engineering and software development. She has written for various publications over the years including the Daemons and Dragons column for UNIX Review... Read More →
Thursday October 31, 2024 11:45 - 12:05 GMT
The Liffey A

12:10 GMT

Red Tide Revert
Thursday October 31, 2024 12:10 - 12:30 GMT
David Newman, Automattic


Explore the challenges of managing unexpected production errors in high-frequency deployment environments and introduce an innovative AI-driven solution for rapid error detection and resolution. The speaker will discuss how their team developed and refined an automated system that analyzes error logs, identifies problematic code commits, and streamlines the incident response process. This approach aims to reduce on-call stress, minimize user impact, and pave the way for fully automated error mitigation in complex, fast-paced development ecosystems.


https://www.usenix.org/conference/srecon24emea/presentation/newman
Speakers
DN

David Newman

Automattic
With a diverse background in platform engineering, distributed systems, and artificial intelligence, our speaker brings a trove of experience driving innovation from startup to enterprise environments. As a technical founder in companies ranging from retail intelligence to digital... Read More →
Thursday October 31, 2024 12:10 - 12:30 GMT
The Liffey A

14:00 GMT

AppStack: An Open Source Cloud Native Platform for Running Digital Public Services
Thursday October 31, 2024 14:00 - 14:40 GMT
Dimitris Mitropoulos, National Infrastructures for Research and Technology – GRNET and University of Athens; Alex Kiousis, National Infrastructures for Research and Technology – GRNET


GRNET is Greece's National Infrastructures for Research and Technology (NREN) organisation, which acts as a network and services provider for research and education communities. Since 2019, GRNET is responsible for the development, operation and maintenance of several governmental services, thus playing an important role in Greece's digital transformation. To address the different challenges related to this role, GRNET teams developed AppStack, a cloud-native platform, based on production-ready open source software, for running government-related services such as the gov.gr portal, the electronic issuance of documents signed by the Greek state, and gov wallet, among others.

AppStack provides an environment for integrating open-source and in-house software components, where DevOps can incorporate suitable tools to tackle scalability and security issues.

Currently, AppStack hosts workloads that serve more than 8 million Greek citizens, are able to handle more than 20K requests per second, and can generate hundreds of digital documents signed by the Greek state per second.

In this talk we will present AppStack, its numerous components, and how open source made it possible. Finally, we will describe some key experiences from production.


https://www.usenix.org/conference/srecon24emea/presentation/mitropoulos
Speakers
avatar for Alex Kiousis

Alex Kiousis

National Infrastructures for Research and Technology – GRNET
Alex Kiousis is a Site Reliability Engineer in GRNET in Greece. His team handles GRNET's on-premise infrastructure and services, delivering GRNET's custom Cloud service to Greece's Research and Academic communities and several user-facing Government-related Digital Transformation... Read More →
avatar for Dimitris Mitropoulos

Dimitris Mitropoulos

National Infrastructures for Research and Technology – GRNET and University of Athens
Dimitris Mitropoulos is an Assistant Professor at the National and Kapodistrian University of Athens and the Head of Reliability Engineering at the Greek National Infrastructures for Research and Technology (GRNET). Previously, he has been a postdoctoral researcher at the Computer... Read More →
Thursday October 31, 2024 14:00 - 14:40 GMT
The Liffey A

14:50 GMT

Science Reliability Engineering for High Performance Computing
Thursday October 31, 2024 14:50 - 15:30 GMT
Nicholas Jones, LANL


High Performance Computing (HPC) as an industry has long stood on very human facing operational workflows. These workflows exist because HPC systems are generally purpose built machines for small sets of code bases with very specific performance metrics. This purpose built nature has resulted in HPC having very bespoke one-off systems, resulting in process and infrastructure that benefit a small set of code bases well, but aren't resilient to generational churn. To combat the difficulty from generational churn we've adopted an SRE mindset for our new administrative stack OpenCHAMI. This lets us keep our figures of merit (exact reproducibility, parallel bandwidth, and compute time to solution) aligned with what benefits our customer base the most.


https://www.usenix.org/conference/srecon24emea/presentation/jones
Speakers
avatar for Nicholas Jones

Nicholas Jones

LANL
Nick is a scientist at Los Alamos National Lab, where he works on system security architecture, CI/CD infrastructure, and shared computing environments and strategies across the National Nuclear Security Administration Laboratories.
Thursday October 31, 2024 14:50 - 15:30 GMT
The Liffey A
 
Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.