There seems to be a disconnect around hiring data engineers.
The industry has shifted into 2 different fields:
1) Traditional data engineer roles require mostly SQL and orchestration
Whereas there are plenty of roles out there that are really a better fit for:
2) Software engineer with a focus in data
What type of data engineer is ideal to help machine learning teams set up custom data pipelines? Services involved could be Redshift, Kinesis, EMR, Glue, EKS, etc.
Definitely the second one.
In fact, this is how it works at Amazon Prime ML. It is mostly software engineering work with a focus on data engineering projects.
We see the same thing in data science and machine learning (job descriptions and working professionals):
1) Traditional data scientist or ML engineer roles require mostly SQL and Jupyter notebooks
Whereas there are plenty of projects out there that need:
2) Software engineers with a focus in production machine learning engineering
This is exactly what I see in my team.
What we do as ML engineers is modern cloud software engineering with a focus on productionalizing ML models through Well-Architected inference pipelines and MLOps platform.
With proper domain-based feature engineering, training good models is straightforward – especially when leveraging pre-built algorithm containers with a proven track record of success, such as XGBoost.
The real work that creates value for the business is everything that supports the production deployment of trained models at scale.
And fundamentally, it’s all about software engineering – with a focus on ML:
- Scalability, Extensibility, Modularity, & Testability
- Consistency & Reproducibility
- Logging & Monitoring
- [Serverless] Microservices Architecture
- Infrastructure-As-Code & Configuration Management
- Continuous Integration & Continuous Deployment/Delivery (CI/CD)
- Versioning & Rollback
- Fault Tolerance & Failure Recovery
- Containerization & Container Orchestration
- Model Deployment
- Model Drift Monitoring
- Model [Re]Training
- Model Evaluation
- Model Explainability
- Multi-Model Management
- Pipeline Metadata Management
- Feature Store
- Experiment Tracking & Management
This short list gives you an idea of what is truly involved in successful enterprise machine learning projects. My team is composed of this type of “full-stack ML engineer” (myself included), and we work on every single point above (and much more).
Many people in the DS/ML world like to say data science is not software engineering.
The more accurate statement is that most data scientists are not software engineers.
And this is a big problem, as evidenced by the high failure rate of enterprise ML projects.
Data science skills are certainly mandatory, but they should be a given for any team member. You can safely assume that anyone with enough experience has them.
However, if you want to create real business value from ML and yield an ROI, place higher focus on software engineering skills when building machine learning and data engineering teams.
What do you think? Comment below with your thoughts or questions.
If you need help implementing AWS Well-Architected production machine learning solutions, training/inference pipelines, MLOps, or if you would like us to review your solution architecture and provide feedback, contact us or send me a message and we will be happy to help you.
Written by Carlos Lara, Director of Data Science & Machine Learning Engineering
Follow Carlos on LinkedIn: https://www.linkedin.com/in/carloslaraai/