Senior Platform Engineer
ROLE
Firmus Technologies is seeking a Senior Platform Engineer to join our Engineering and Technology team. You will drive the design and implementation of our MLOps capability. You will also collaborate with other engineers and make technical decisions on scaling Firmus AI factory platform engineering capabilities to planet scale, from IaC, container orchestration, observability, self-service portal to platform security. This role is ideal for a self-starter with passion for building things from first principles. You naturally break down complex problems into their fundamental truths to uncover novel and elegant solutions - rather than relying on conventional patterns.
KEY RESPONSIBILITIES
- Build MLOps capabilities from the ground up, enabling reproducible, scalable, and secure ML workflows across internal and customer-facing environments.
- Continuously improve our DevOps platform to ensure reliability, scalability, security, and seamless integration with CI/CD pipelines and infrastructure services.
- Design, implement, operate and secure Kubernetes-based production infrastructure for high reliability, performance and security, including clusters supporting NVIDIA GB300 NVL72 systems with NVIDIA Quantum-X800 InfiniBand or Spectrum-X Ethernet.
- Develop world-class observability platforms for internal and external customers to achieve ClusterMAX Platinum tier recognition from SemiAnalysis.
- Integrate Firmus central services with NVIDIA’s software stack, including Mission Control, NETQ, UFM, and NMX.
- Lead the enhancement and evangelism of internal platform products that provide cohesive, composable, secure-by-default, and low-friction self-service experiences that accelerates time to market and reduce engineers' cognitive load.
- Drive incident response efforts, participate actively in the on-call rotation, and lead detailed Root Cause Analysis (RCA) to continuously improve system reliability, operational maturity, and incident handling processes.
SKILLS AND EXPERIENCE
- Bachelor's degree in computer science or a related technical field.
- 7+ years of experience as Platform Engineer, Site Reliability Engineer, DevOps engineer, MLOps Engineer or Observability Engineer.
- Demonstrated strong proficiency on the following areas:
- Infrastructure-as-Code, configuration management and CI/CD (e.g., Terraform, Ansible, GitHub Actions, Jenkins, ArgoCD).
- Containerization technologies (e.g., Docker), Kubernetes networking and cluster management, including upgrades and troubleshooting.
- Observability stack design and scaling (e.g., Loki, Grafana, Tempo, Prometheus, Thanos, ClickHouse).
- Telemetry solutions using various technology (e.g., Redfish, gNMI, SNMP, eBPF, streaming analytics).
- Unified telemetry collection with OpenTelemetry.
- Compliance automation (e.g., OPA, Kyverno).
- Competent in scripting and programming skills (e.g., Bash, Python, Go).
- Systems knowledge on Linux internals, networking stacks, and distributed storage.
- Clear and effective English communication, written and spoken.
- Bonus: Experience in high-growth startups or regulated industries with robust security and data privacy requirements, including SOC 2 Type 2 and ISO 27001.
Firmus Technologies is a global leader pioneering the solution to AI’s energy challenge, founded in Australia in 2019 by a visionary team of entrepreneurs and engineers passionate about sustainable computing infrastructure.
Firmus builds and operates AI infrastructure across Asia-Pacific, utilising its proprietary AI Factory platform to deliver transformative cost-effective GPU clusters and AI cloud services for developers, enterprise, education and government users.
We are committed to building a diverse and inclusive workplace. We encourage applications from candidates of all backgrounds who are passionate about creating a more sustainable future through innovative engineering solutions.
Join us in our mission to revolutionize the AI industry through sustainable practices and cutting-edge engineering.




