Software Engineer - AI/ML, AWS Neuron

Amazon Web Services (AWS) Cupertino, CA $165,200 - $223,600
Full Time Mid Level 3+ years

Posted 2 weeks ago

Interested in this position?

Upload your resume and we'll match you with this and other relevant opportunities.

Upload Your Resume

About This Role

This role is for a Machine Learning Engineer in the Distributed Training team for AWS Neuron, responsible for development, enablement, and performance tuning of a wide variety of ML model families. You will help lead efforts building distributed training support into Pytorch and Jax using the Neuron compiler and runtime stacks to tune models for highest performance and efficiency on AWS Trainium instances.

Responsibilities

  • Develop, enable, and performance tune a wide variety of ML model families, including massive-scale Large Language Models (LLM) such as GPT and Llama, as well as Stable Diffusion and Vision Transformers (ViT)
  • Work with chip architects, compiler engineers and runtime engineers to create, build and tune distributed training solutions with Trainium instances
  • Lead efforts building distributed training support into Pytorch and Jax using the Neuron compiler and runtime stacks
  • Tune ML models to ensure highest performance and maximize efficiency running on customer AWS Trainium
  • Utilize strong software development and ML knowledge to contribute to the team

Requirements

  • 3+ years of non-internship professional software development experience
  • 2+ years of non-internship design or architecture experience (design patterns, reliability and scaling) of new and existing systems
  • Experience programming with at least one software programming language
  • Experience with training large ML models using Python

Qualifications

  • Bachelor's degree in computer science or equivalent
  • 3+ years of non-internship professional software development experience, 2+ years design or architecture experience

Nice to Have

  • 3+ years of full software development life cycle experience, including coding standards, code reviews, source control management, build processes, testing, and operations experience

Skills

Python * PyTorch * DeepSpeed * JAX * AWS Neuron * AWS Trainium * AWS Inferentia * FSDP (Fully-Sharded Data Parallel) * Nemo *

* Required skills

Benefits

Paid Time Off
Flexible spending accounts
Restricted Stock Units (RSUs)
Supplemental life plans option
Parental Leave
EAP
401K Matching
Sign-on payments
Medical advice line
Health insurance (medical, dental, vision, prescription)
Mental Health Support
Basic Life & AD&D Insurance
Adoption and Surrogacy Reimbursement coverage

About Amazon Web Services (AWS)

AWS Infrastructure Services owns the design, planning, delivery, and operation of all AWS global infrastructure, powering millions of businesses and services worldwide.

Technology
View all jobs at Amazon Web Services (AWS) →