AI/ML Infrastructure Engineer
Content + Source + Freshness • 16 Dec 2025 • 95% confidence
Offer value
This role scores highly due to its strong demand for skills across AI/ML infrastructure, significant experience required, and potential for growth in technical expertise.
- Opportunity to architect AI/ML solutions directly
- Competitive salary offers in-demand technical roles
- Scope for professional development in cutting-edge technologies
- Requires significant experience and collaborative skills
Pros
- High demand for AI/ML skills in a dynamic environment
- Opportunity to work with cutting-edge cloud technologies
- Role provides significant autonomy in project implementations
Cons
- Requires extensive experience (6+ years), limiting applicant pool
- Potentially high-pressure environment with multiple stakeholders
- Limited work-life balance due to project deadlines
Who it's for
Mid to Senior • On-site with potential for hybrid
Good fit
- Experienced AI/ML engineers
- Technical architects with cloud experience
- Collaborative professionals eager for impactful roles
Not recommended for
- New graduates or inexperienced professionals
- Individuals seeking low-pressure working conditions
- Candidates uninterested in collaborative, high-stakes projects
Motivation fit
Key skills
About the job
This role is a member of the AI/ML Infrastructure Engineering team and will be dedicated to implementing and supporting AI/ML infrastructure solutions in cloud and on-premise environments. The role will work directly with infrastructure teams and potentially face off with data scientists, machine learning engineers, application developers, and quantitative analysts by functioning as both a solutions architect, helping them implement their own AI/ML solutions, and as a professional services engineer, implementing solutions for them in cloud environments such as AWS, GCP, and Kubernetes.
This is a hands-on developer role and candidates ideally have had experience deploying and supporting their own production-ready AI/ML models in cloud environments as well as automating the build and management of a broad range of cloud infrastructure using tools like Terraform. Candidates should be familiar with developing unit and functional tests, have experience designing and implementing CI/CD tools with infrastructure as code pipelines, and have knowledge of Linux systems administration, containerization, networking, security, automated configuration and state management, cross-system orchestration, configuration management, logging, metrics, monitoring, and alerting.
Principal Responsibilities:
• Architect, develop and maintain internal AI/ML infrastructure components, frameworks, and offerings
• Architect, develop and maintain AI/ML solutions for customers in cloud environments
• Help customers architect, develop and maintain their own AI/ML solutions in cloud environments
• Implement CI/CD pipelines which include application tests, security tests, and gates
• Implement availability, security, performance monitoring, and alerting of AI/ML solutions
• Automate data resiliency and replication for AI/ML models
• Manage multiple environments and promote code between them
• Automate systems configuration and orchestration using tools such as Terraform, Chef, Ansible, or Salt
• Automate creation of machine images and containers
Required Qualifications/Skills
• 6+ years of experience designing and supporting production cloud environments
• Experience consulting with customers to develop AI/ML solutions
• Experience developing collaboratively, including infrastructure as code, preferably in Python
• Systems engineering knowledge, including understanding of Linux, security, and networking
• Cloud templating tools such as Terraform
• Experience with AI/ML frameworks (e.g., TensorFlow, PyTorch)
• Experience with distributed computing tools (e.g., Ray, Dask)
• Experience with model serving tools (e.g., vLLM, KFServing)
• Experience with building, monitoring, and alerting on logs and metrics
• Cloud Networking including connectivity, routing, DNS, VPCs, proxies, and load balancers
• Cloud Security including IAM, Certificate Management, and Key Management
• Excellent written and verbal communications
• Excellent troubleshooting and analytical skills
• Self-starter able to execute independently, on a deadline, and under pressure

