The Changing Role and Importance of Data Engineering
Bruce Philp, Head of Data Engineering North America, Tom Goldenberg, Junior Principal Data Engineer and Toby Sykes, Global Head of Data Engineering, QuantumBlack
Over the last decade, the role of the data engineer has evolved drastically. Industry shifts such as the emergence of scalable processing and cloud data warehouses mean that today’s data is collected more frequently, is far less siloed and more accessible to everyone across an organisation than ever before. The role of the engineer has become integral, as these systems require increasingly sophisticated infrastructure.
This growth in significance, responsibilities and available tools means that today it can sometimes be difficult to define the data engineer role, and many companies actually differ in what they expect from their engineering talent. This creates a confusing landscape for data engineers — how can you tell what a prospective employer will require of you or how to categorise their own data engineering capabilities when the field now encompasses a broad swathe of work?
With this in mind, we have gathered our top pieces of advice for data engineers to help them keep pace with an ever-changing role, charting how the position has changed and how to work alongside data science teams to deploy the best standards of practice.
Data engineering, then and now
Ten years ago, data engineering consisted of data warehousing, business intelligence, and ETL (tooling to move data from place to place). There was limited development of pipelines for analytical models, but for the majority of industries these often remained in the ‘proof of concept’ phase due to no clear path to production.
In order to remain competitive, let alone stay ahead of the game, today’s companies are eager to invest in analytics and productionise models. The sheer size and quantity of data has also increased, giving scalability added importance.
With the changing attitudes and priorities of businesses, there are a couple of key focuses for today’s data engineers to thrive in this modern landscape:
● Best practices in the software development lifecycle (SDLC) — These include proper use of version control, release management, and automated devOps pipelines.
● Information security — With the ever-increasing threat of hacks, data engineers need to understand cloud security best practices and be vigilant in handling data. This includes managing data privacy in an evolving regulatory landscape (e.g. GDPR, etc).
● Data architecture principles — These have always been important in data engineering, and include separation of concerns, degree of logical groupings, traceability, lineage, and well-defined data models.
● Business domain knowledge — Domain expertise is increasingly required to draw insights from data.
Three trends that changed the role of the Data Engineer
Alongside the above areas that are key for all data engineers to understand, we believe it is vital for them to be aware of the industry trends that have helped shape the role over the last decade. Understanding these developments in analytics will help engineers explain the value of their work in conversations with organisations, who may not be aware of these revolutionary advancements.
● The rise of cloud
Cloud has finally reached a tipping point where even institutions such as finance and government, that have historically shied away, are embracing it. In the last 4 years alone, the market for cloud computing has doubled from ~$114B to ~$236B. Amazon Web Services has led the market over the past several years (currently at 33% market share) but Microsoft Azure (13%) and Google Cloud Platform (6%) are catching up.
● The expansion of open source
Data engineering used to be dominated by closed-source, proprietary tools. Now we are seeing a growth of open source tools and, in many cases, a preference for these tools in data organizations. Open source libraries such as Spark and Tensorflow have become widespread and many organisations are seeking to minimise vendor or product lock-in. This was a driving factor in open sourcing QuantumBlack’s very own Python library, Kedro.
● The growth of data in scale
Companies simply have more data at their disposal now more than ever before, which makes it more important for data engineers to understand how to scale. More than 90% of the world’s data was created in the last few years. Data engineers need proficiency in tools that can help to quickly organize and assess this massive amount of data.
Organisation archetypes
Before working closely with an organisation, data engineers need to assess the existing infrastructure for analytics to see what level of onboarding will be required. For example, how you approach a complex project with a company inexperienced with data science and data engineering will be completely different from an organisation with an existing analytics platform.
At QuantumBlack, we find that clients often fit one of four archetypes in regard to their data engineering capabilities. These are:
- Starter — No data science or data engineering capabilities, very limited infrastructure for analytics
- Siloed — Pockets of analytics capabilities, but uncoordinated and with messy infrastructure
- Potential — Has a functioning analytics centre of excellence (CoE) but team needs to build more capability or is missing process
- North Star — Mature data organisation, working analytics platform
In all of these archetypes, data engineering plays a critical role; it is often the make or break factor in organisations achieving their North Star in analytics. It is widely cited that 80% of data science consists of data engineering. However, even in companies that see analytics as a competitive differentiator and have it on their agenda, we often see ineffective internal data organisation. This leads to a conflict between upper management, who wish to see value from analytics, and the reality on the ground, mired in a challenging and technical data environment.
Unlocking Analytics through Data Engineering
For analytics teams to work efficiently with organisations and convince them to invest in data engineering, no matter the existing analytics capability, it is important to show the long term value that engineering talent can bring to the table. Data engineers can “unlock” data science and analytics in an organisation, as well as build well curated, accessible data foundations.
At QuantumBlack, we believe that data engineering and data science should work hand in hand. That was part of the inspiration for Kedro, our Python library that creates a shared project platform for data science and data engineering. Rather than staying in separate siloes, we have seen improved performance in clients that fully integrate the two teams.
In short, companies must consciously invest in developing their data engineering capability in order to have a truly successful analytics program. This includes setting a solid foundation with data governance — identifying gaps and quality issues while improving data collection. The organisations that thrive in the years ahead will be those that not only acknowledge the challenges of developing and productionising models, but actively invest in engineer talent in order to maximise value from data.
If you are interested in joining our data engineering team please see here for roles currently available and contact us at careers@quantumback.com.