Defining the role of a Data Engineer- 5 mins
The role of the Data Engineer has evolved rapidly over the past few years and created a lot of confusion. Some people with the title Data Engineer spend most their time writing SQL queries while others are creating databases from scratch. This leads to a lot of confusion for Engineering and Recruiting departments. Companies may not know when to shift responsibilities to Data Engineers and Recruiting departments may have trouble creating job postings. I’m going to try to clear that up in this post.
Data Engineering was born out of the need to have a Business Intelligence Analyst with most of the skills of a Software Engineer. The market used to be dominated by a few large companies and the scope of technologies was relatively narrow. Since then, the industry has been infused with many different types of technologies, which has greatly increased the scope of responsibilities. Today’s Data Engineer has most of the skills of a Software Engineer with additional skills to create solutions specific to data.
Examples of tasks for Data Engineers
- Administer AWS EC2 instances
- Build dimensional data marts per Kimball or Inmon
- Build tools for managing data in Python, PHP, Ruby or Go
- Containerize applications and services using docker, swarm or kubernetes
- Create backup and recovery plans
- Create CI/CD processes to manage deployments
- Create data modeling diagrams using Visio, ERwin or LucidCharts
- Create data pipelines for dependency management in Airflow, Gobblin, Luigi, Nifi, or Jenkins
- Create ELT jobs using spark, hive, Python, Scala, SQL, Java, pig or PLSQL
- Create ETL jobs using SSIS, Informatica, Talend or Pentaho
- Create MapReduce functions in Java
- Create REST APIs in Python or Java to ingest data from third party tools such as SalesForce or Zuora
- Create reports in Tableau, Qlik or Looker
- Deploy streaming tools such as Kafka, Flink, Streamsets, Beam or AWS Kinesis
- Deploy and optimize… distributed databases such as Hadoop, spark, Druid or ElasticSearch
- Deploy and optimize… MPP databases such as Redshift, Teradata, or Snowflake
- Deploy and optimize… NewSQL databases such as Rockset or TiDB
- Deploy and optimize… NoSQL databases such as MongoDB, HBase or Cassandra
- Deploy and optimize… RDBMS databases such as MySQL, Postgres, SQL Server, Oracle
- Design and develop data security policies
- Manage AI/ML modeling data pipelines in Pandas, Sci-kit Learn, or PySpark
- Migrate services and between Amazon AWS, the Google Cloud Platform and on prem solutions
- Support end users with a large range of technical skills, from data scientists to management
- Track down data quality issues
- Train companies and departments on best practices related to databases and data platforms
All these tasks above can fall under the scope of Data Engineering. This is a problem because all these tasks together involve a wide range of skills. It’s unrealistic to expect Data Engineers to be experts in all of these tasks. Part of the solution to that problem is Data Engineer must very good at learning new technologies. Another part of that solution is to create categories of Data Engineers and we’re starting to see that today. Titles like Data Infrastructure Engineer and Data Modeling Engineer now exist. Here is a list of common specialties within Data Engineering:
Primary purpose is to cover Data Engineering needs for the entire company. Typically comfortable with many skills but is not a master in any specific skills. Should understand cloud services, databases, ETL, data pipelines, SQL, at least one object oriented language, and unix. Must be a quick learner because this type of role has you moving in different directions. Usually found at small companies or those companies with new Data Engineering departments.
Sr Data Engineer
All the skills of a Data Engineer plus the ability to lead projects that impact a single department. Comfortable with ‘Big Data’ concepts. Can contribute to existing frameworks.
Staff Data Engineer
All the skills of a Sr Data Engineer plus the ability to lead projects that impact the entire company. Has the ability to create frameworks from scratch. Comfortable making Build vs Buy decisions plus defining and executing roadmaps for large projects.
Primary purpose is to act as a technical leader for all things related to data. Typically provides guidance for projects, does not have direct reports and leads data governance efforts. Should have all the skills of a Staff Data Engineer. Given a project with a dozen Software Engineers and a few Data Engineers, there will be one Data Architect who provides guidance for data related technology decisions. This role is commonly found at large companies and consulting companies.
Data Engineer - Analytics
Primary purpose is to deliver analytical reports and analysis. Should have a good understanding of reporting tools like Tableau, Qlik and Looker. Should be able to optimize queries and databases. This role share a lot in common with a Business Intelligence Analyst.
Data Engineer - Data Pipelines / ETL
Primary purpose is to create and manage data pipelines. Should be able to create their own tools for managing data pipelines or deploy tools like Airlow and Informatica. Has a good understanding of design patterns for batch processing and stream processing.
Data Engineer - Infrastructure
Primary purpose is to create and manage infrastructure projects. Should be able to create tools and build applications. Has a good understanding of OOP, open source projects and cloud services. Many open source projects are spawned by these types of Data Engineers.
Data Engineer - Modeling
Primary purpose is to create and manage dimensional models. Should be able to debate Kimball and Inmon design patterns and has an in depth understanding of database optimization.
The future of Data Engineering roles
For the future, we know that data is not going to go away, it is only going to grow. Technology will keep adjusting to handle the challenges that come with big data both on the infrastructure side as well as the application/tooling side. Today, we are seeing large growth in the data engineer-infrastructure role to keep up with the explosion of big data applications under Apache, AWS and GCP. Data must be accessible to be able to be applied. And we need people to help bridge the gap from raw data to applied data, and those roles may be classified as data engineers, data scientists, or ML engineers.
Thank you Cindy Rottinghuis for your contributions and editing.