Senior data scientist Daniel Routledge explains how Python is helping us analyse data and improve digital services
DWP Digital depends on massive amounts of data to deliver services to 20 million people across the UK. As a Senior Data Scientist, I use Python to analyse data and help improve our services.
Python was the third most popular language on GitHub in 2019, so it has a fantastic worldwide community which we can tap into to help solve problems.
Python lets us plug into many different formats of data, from old-fashioned CSV files through to modern RESTful APIs. The Pandas library makes cleaning and reshaping that data easy and lets us work with various formats of data in a consistent way.
Once we’ve wrangled the data into the shape we need, we can then turn to SciKit-Learn (SKLearn), Python’s most popular machine learning toolkit, to get down to the real data science.
Making data more manageable
We’ve used these tools to tackle some big problems in DWP Digital. For example, we recently identified how difficult it can be for customers to find relevant jobs online. To help solve this, we looked at matching people with online job adverts by looking at their location, pay, skills and the keywords they used.
We were able to use SKLearn’s dimensional reduction tools in Python, which are useful in making big data more manageable, to identify which job roles are similar to each other based on the language used to describe them.
We started the process by counting how often keywords appear alongside each other under different job titles, using Pandas. We then used SKLearn’s distance metrics, specifically the cosine distance, to quantify how similar job roles are between zero (none of the same language) and one (described identically).
This helps people looking for work on DWP’s Find a Job service, by recommending similar jobs to the one they were searching for. It might be the same kind of job with a different title, or a role in a different sector that helps someone make use of their transferable skills.
Solving problems in the cloud
There’s more we can do with this data – and Python. Python can act as a language to access Spark, a popular distributed computing framework, using a hybrid language called PySpark. PySpark lets us solve problems in the cloud that would be impossible to solve on a developer’s laptop.
The online job advert data gives us an opportunity to try out a technique that’s very popular on online marketplaces and video streaming platforms – Market Basket Analysis. You might be used to being told “because you watched XX, you might like XXX”. We can use PySpark’s MLLib machine learning library to do something similar with skills. So for example, “Because you have Python skills, you might have (or want to learn) Linux skills”.
We do this using MLLib’s FPGrowth (frequent pattern) algorithm, that lets us use Python to run through tens of millions of job adverts and work out what common ‘baskets’ of skills employers are asking for. The frequent skill sets that FPGrowth generates lets us explore how we might recommend skills in a service like Find a Job.
Once we’ve found opportunities like these, we’ll explore deploying working prototypes to services using tools that work well with Python, like the Flask Web Framework, and Docker.
These are all great, open-source tools that we’re using more in DWP Digital.