What is data engineering?

Everything you need to know about the role of a data engineer

What does a data engineer do?

A data engineer takes raw data, transforms it and stores it in formats appropriate to the use cases.  

An analogy is the fuel industry. Oil is extracted from a well transported, refined into different products (diesel, Jet fuel, LPG, biofuels) and stored available for further use. The whole process is monitored, secure and automated, with alerts in place when problems arise. Data engineering is the same concept, with data instead of oil.

Man working on computer. There are lines of code on the screen in front of him.

The goal of a data engineer is to make data accessible to a wide variety of customers, so they can make informed data driven decisions and optimise the performance of their organisation.  

They’re the person who initially interacts with that data and funnels it through the organisation for further use – by data scientists, digital performance analysts, the analytical community or other external customers.  

They design and develop performant, robust, scalable and cost-effective data products and services.  

What skills or knowledge do I need to be a data engineer? 

A data engineer needs to be is logical and thorough, but also flexible and a problem solver. Traditionally, this has been from an IT or science background, but training is such that anyone from any background who has the generic skills above and is willing to learn could make a good data engineer. 

You’ll need to be able to think logically, solve problems, and show enthusiasm. You’ll be willing to learn new things, a strong communicator and work well in a team.

These are the key skills:

  1. Communicate clearly: You’ll constantly communicate with different stakeholders across the business and other technologists. Being articulate and able to convey complex technical problems at the right level to different stakeholders is important.
  2. Be a problem-solver: You’ll be able to process information, present the problem statement and be solution-orientated.
  3. Show passion and drive: You’ll invest in developing your skills and take pride in your personal development. You’ll also need to be adaptable to change, be flexible and able to pick things up quickly.
  4. Collaborate with others: You’ll be able to work and learn from others and share your knowledge. The tech stack is so vast you can’t expect to know it all. So as a team, you’ll need to be able to do it together.
  5. Logical thinking: You’ll take complex problems and break them down into logical components, ensuring solutions are supportable and extensible.

What are data products and services?

Data products and services vary depending on what the customer needs. For example, a product or service could be a: 

  • optimised, intuitive analytical database that enables finance or operations to easily create reports, analyse data and draw insights.  
  • data lake of raw or semi-structured data that data scientists can explore, play with and look for trends or patterns. 
  • output for another government department like a set of caseloads or names and addresses of vulnerable people or carers to support COVID-related projects
  • data streams that enable immediate interrogation and support real-time decision-making 

Who is the customer?

The customer is anyone who requires access to data to help them to improve their service or products. This could be: 

  • internal: analysts, operations or data scientists 
  • external: other government departments, such as the NHS, HMRC, ONS, education authorities or councils.  

The data we provide informs their decision making and policy.  There are examples of where customers interact directly with data products through the use of API’s like prescription checking – if a citizen visits a pharmacy and they receive benefits, data products allow real-time checks and a decision can be made on the spot about whether or not a customer needs to pay for their prescription.

Ultimately customers are the end citizen, either directly or indirectly. Supporting the most vulnerable and improving people’s lives is one of the core purposes of DWP, and data engineers provide key services that enable this.  

What's exciting about data engineering as a career?

Data is phenomenally important to the way we live our lives and supports how organisations make better decisions. The days of intuition-based decision-making have gone. The prime objective in collecting, transforming, provisioning and analysing data is to support better decision-making.  

Data, when refined, leads to information, which in turn is used to drive more income, or reduce costs. Examples are wide and varied, from social media companies, who harvest data that can be used to understand people’s patterns and habits which in turn can be monetised, to the sports industry who use data more and more to improve performance, by identifying area for improvement and focus. The manufacturing industry also use data to improve process, or proactively prevent problems occurring.  

As we move into the era of cloud computing and advanced technologies such as ML and AI, traditional constrains on compute power, storage and deployment times are removed. This opens up opportunities to use data more effectively, moving from descriptive analytics (looking at the past), through predictive (what is likely to happen) to prescriptive, where the data provides the best course of action to affect future outcomes.

As a data engineer, you know you’re creating data products that enable all the above. It’s an exciting and rewarding area to work in. 

But it’s challenging. With data, you can take software engineering to another level and you can get to use the latest technologies, it is a constantly evolving environment. 

Data engineers and data scientists are currently two of the highest-paid and most sought-after roles in IT. As a data engineer, you’ll need to build your expertise over time with experience and learning. There’s a high demand for people with these skills. 

What’s the career pathway of a data engineer?

At the start of their career, a data engineer will typically: 

  • Learn the basics of ingesting, transforming and provisioning data pipelines. And the supporting architecture.  
  • think about how you would model data in a certain way to meet a user case.  
  • work with end customers to understand their needs along with the needs of the business. 

Mid-career, a data engineer will:  

  • Design and develop data ecosystems. Understand a variety of tools and technology and how they work together to deliver the best products for customers 
  • Understand the use of infrastructure and automation to support data pipelines, utilising the power of cloud computing to maximise benefits and opportunities for customers.  
  • Design systems that are fully automated, robust, self-documenting, highly available with built-in telemetry supporting operations and complying with data governance and legislative requirements.  
  • Work with a wide range of other data professionals to build best of breed holistic data solutions. 

This skillset enables data engineers to transition to other roles like infrastructure and software engineering or data science. 

How does a data engineer help drive innovation? 

A data engineer provisions the right data, at the right time, in the right format for the particular use case. This data is then used in any number of ways and techniques including: 

  • Analytics: text mining, natural language processing (NLP), SQL, R, Python, visual recognition – huge variety of use cases from basic reporting and dashboards, to automated reading of health scans to detect for example, cancers. DWP use many of the above, NLP to interrogate case notes to look for risk to individuals and help caseworkers with next best action.  
  • Machine Leaning – teaching a computer system to learn and perform tasks without human interaction using algorithms and statistical models is used in fraud detection and prevention. Cybersecurity is another strong use case, mitigating the myriad of evolving threads tracking behaviour within networks and identifying security gaps 
  • Visualisation – BI and visualisation tools to convey complex messages 

In our work, we use data to help the department provide a better service to citizens and make a positive difference to their lives.

Data can help us to: 

  • predict who is suitable for specific benefits.
  • understand what the next best course of action for a citizen is. 
  • how to prevent people going back on to benefits. 
  • Keep citizens safe from harm 
  • Support government policy 
  • Support other government departments in meeting their objectives 

How much data do we hold in DWP Digital?

We hold records or information about every UK citizen, which is essentially over 60 million people. This includes the multiple interactions citizen have with the department and or its services. This granular data enables us to understand citizen journeys, and helps us improve their overall experience and the quality of the services we provide.

Man working on a computer. There are lines of code on the screen in front of him.

For example, every time a person interacts with Universal Credit, they generate an event. This could be when a customer:

  • registers a claim 
  • talks to a customer service representative about their claim  
  • is accepted or rejected for a claim 
  • receives a payment 

This helps us to understand how long it takes a customer to access the service they need, and what problems they encountered along the way.

To give you a sense of the scale, there are about 18 million events per day and billions of events held in our databases.

What tech stack does a data engineer use?

New technology is released all the time, so the only constant is change. There are general principles and best practice in how we collect, move, store, govern and use data.

The technology used is very much down to lines of business and teams, providing it is approved and on the DWP tech radar.

These are some of the technology and tools used in data engineering: 

Platforms 

  • On premise
  • AWS
  • Microsoft Azure
  • Google Cloud Platform 

Storage 

  • Oracle
  • AWS RDS
  • AWS S3
  • MongoDB
  • DynamoDB

Azure Blob Object Storage

  • AWS S3
  • Azure Blob Storage
  • ADSL
  • HDFS

Big data 

  • Cloudera
  • AWS EMR 

Extract, Transform and Load (ETL)

  • Informatica
  • AWS Glue
  • Data Pipeline
  • Step Functions
  • Azure Data Factory 

Messaging and Streaming 

  • Kafka
  • Spark Streaming 

Languages 

  • SQL
  • HQL
  • Python
  • Scala

To find out more about our work in digital government, sign up to our mailing list.