Position Overview
We are seeking a Platform Reliability Engineer to design, build, and maintain large-scale, distributed systems that power data, machine learning, and enterprise platforms. This role combines systems reliability, automation, and infrastructure engineering to ensure seamless performance and high availability across global environments. The ideal candidate is proactive, detail-oriented, and thrives on solving complex engineering challenges with precision and innovation.
Why This Role Matters
In an environment where digital systems drive business operations, reliability isn’t just a goal—it’s the foundation. The Platform Reliability Engineer ensures critical platforms operate with consistency, scalability, and security. By building resilient infrastructure and integrating automation at scale, this role helps organizations move faster, reduce downtime, and deliver smooth, dependable digital experiences. As AI and data workloads grow, reliability engineering has become central to operational excellence.
About the Role
As a Platform Reliability Engineer, you’ll work on large-scale infrastructure powering machine learning and data-driven platforms. You’ll focus on ensuring system uptime, optimizing performance, and managing high-throughput workloads across cloud environments. You will also collaborate with cross-functional teams to develop observability, automation, and recovery mechanisms while exploring cutting-edge open-source and cloud-native technologies.
Key Responsibilities
- Design, build, and manage scalable distributed systems for data and ML workloads.
- Ensure platform reliability through proactive monitoring, fault tolerance, and incident response.
- Collaborate with engineering and operations teams to improve system performance and scalability.
- Automate deployments, monitoring, and recovery using cloud-native and open-source tools.
- Evaluate emerging technologies and incorporate best practices into reliability frameworks.
- Manage workloads across data, ML, and inference platforms to maintain optimal system health.
- Troubleshoot production issues and drive root cause analysis to prevent recurrence.
Minimum Qualifications
- Bachelor’s degree in Computer Science, Computer Engineering, or equivalent field.
- Strong understanding of operating systems, networking, and security principles.
- Hands-on experience with AWS, GCP, or Kubernetes-based environments.
- Proficiency in Python, Java, or Go programming, with the ability to work with open-source systems.
- Familiarity with distributed systems design, monitoring, and automation practices.
- Relevant internship or professional experience in reliability or systems engineering.
Preferred Qualifications
- Experience with data processing or model training workflows.
- Exposure to big data technologies such as Spark, Flink, or similar frameworks.
- Familiarity with cloud-managed AI/ML services (e.g., AWS Bedrock, GCP Vertex AI).
- Understanding of large language model infrastructure (GPUs, TPUs, accelerators).
- Knowledge of cloud networking (VPCs, DNS, security groups, Kubernetes networking).
- Expertise in performance tuning for JVMs and Linux-based systems.