Andy Hay is a lead site reliability engineer (SRE) working in DWP’s hybrid cloud services group.
“I’ve worked in DWP Digital for over four years as a software engineer, a DevOps engineer and now as an SRE.”, says Andy. “Prior to that I worked in the private sector in a variety of software engineering and/or pre-DevOps roles for 25 years.”
Within DWP, an SRE’s role covers three key areas: onboarding of services into cloud; run/operation of services in production and SRE tooling/process development.
“On a daily basis, SRE’s may do all or just one of these – it’s quite varied. Over time, their experience encompasses all three areas.” He adds, “And they make use of their soft skills as well as their technical skills, with a strong emphasis on encouraging both self and formal learning to keep technical skills up to date.”
Onboarding of services into cloud
Part of Andy’s role is engaging with business units to onboard new services into public cloud or lift and shift services from on-premises hosting into public cloud.
“DWP runs services in its own on premises datacentre and within AWS and Azure public clouds,” says Andy. “There are a set of standards and governance we need services to meet so they satisfy all of the required operational and security needs for running in production.”
“This is quite an interactive process that makes use of our softer skills and less of our hands-on technical skills, but more of our technical experiences.” He adds, “not every service is the same. We need to appreciate the differences and ensure they still meet the same levels when operating in production.”
Run/operation of services in production
Andy’s team look after many production services within public cloud. This means they execute deployments using runbooks, investigate production incidents to assist support teams to determine root causes and provide an on-call service to help restore service via alerting runbooks or technical experience.
“We’re always trying to improve our deployment processes. For example, we make the experience of deploying into non-production the same as production.” says Andy. “Infrastructure as code is at the heart of much of what we do so with repeatable, idempotent deployments we have no surprises when we deploy into production.”
How the team respond to all types of incidents to get services running again, is not defined just by the unexpected incidents, but by the follow-up. This is as important as what is achieved during the incident.
“We tackle this by creating problem tickets to investigate the root cause through use of the “five whys” and implementing code fixes or process changes to reduce the chance it will happen again.” Andy adds, “this tests our use of persuasion and influencing skills, as business units may be less interested in implementing fixes once the service issue has passed, but it’s key to what we do.”
SRE tooling and process development
Tools are developed and processes are refined to increase team automation and reduce the time and cost spent on repetitive tasks.
Andy says, “this allows us to use our technical skills building with technologies such as Terraform, Python, Go or documenting information via Markdown so it’s shareable and searchable.” He adds, “we use Gitlab through a variety of IDEs (integrated development environments) such as IntelliJ or VS Code or simply a plain terminal window. There are lots of opportunities to learn new technical skills or new or updated AWS services.”
Adapting to new ways of working
COVID-19 has changed many of the ways Andy’s colleagues work. They have few traditional telephone conferences but lots of interaction across Slack, Teams and Skype.
“In general, we have probably never been in more communication as a team as we are now,” says Andy. “There are SRE’s based around the country. Initially, we were quite hub-focused, but now are more team/squad focused, regardless of where an individual is based.”
Agile ceremonies are run virtually, with video strongly encouraged to give some human context and team bonding.
“Quite often we’ll run short calls between a number of engineers to screen share problems or take longer calls to pair-program through issues,” says Andy. “There have been many positives to how we have changed our ways of working which may have taken much longer to put in place if restrictions had not been in place.
“Within the team I work in, we implemented many technical changes to increase VPN capacity to enable remote working and provided external Citrix access into DWP resources for other government departments who could no longer access through DWP offices.”
He adds, “we’ve also accelerated onboarding of services into production which were needed much sooner than planned.”
Reducing toil and increasing automation
“For me personally, I enjoy the wide variety of opportunities and challenges the SRE role presents every week,” says Andy. “I like the interaction with people from business units. I like implementing new features for them or helping to resolve incidents with them. It’s satisfying to help people achieve their goals. I like to learn about new technologies or new languages so we can create something we couldn’t previously.
Overall I like trying to live up to the SRE mantra of reducing toil and increasing automation.”