Position Overview
We are seeking a Production Operations Engineer to support and enhance the reliability, availability, and performance of globally distributed, business-critical systems. This role blends software engineering with operational excellence — ensuring seamless performance, proactive monitoring, and continuous improvement in production environments. The ideal candidate is a problem solver, systems thinker, and automation advocate who thrives in a fast-paced, mission-driven setting.
Why This Role Matters
In large-scale, always-on environments, operations excellence defines business continuity. This role ensures the backbone systems that power enterprise products remain reliable, efficient, and resilient under pressure. As organizations scale and customer expectations rise, operational engineering becomes essential — not just for uptime but for innovation. The Production Operations Engineer acts as the bridge between development, infrastructure, and support, turning complex systems into predictable, high-performing services.
About the Role
As a Production Operations Engineer, you’ll manage incident response, capacity planning, and system optimization across distributed environments. You’ll design automation frameworks to eliminate manual overhead, improve deployment safety, and strengthen monitoring practices. The role requires deep technical expertise, hands-on troubleshooting, and strong cross-functional collaboration. You’ll partner closely with software engineers, reliability teams, and support functions to build systems that are not just stable — but continuously improving.
Key Responsibilities
- Lead and coordinate incident management, including SEV1/SEV2 bridges and postmortems.
- Monitor production systems for performance, capacity, and reliability, ensuring 24×7 operational readiness.
- Develop automation solutions to streamline deployment, maintenance, and operational workflows.
- Create dashboards, metrics, and alerts to proactively identify issues and optimize performance.
- Collaborate with engineering teams to improve system design, deployment processes, and scalability.
- Apply observability tools (Splunk, ExtraHop, Hubble) to improve root cause analysis and anomaly detection.
- Integrate AI/ML-driven automation for predictive alerting and intelligent incident response.
- Maintain adherence to enterprise ITSM practices and operational compliance standards.
Minimum Qualifications
- 4+ years of experience managing mission-critical, large-scale web or mobile systems in production.
- Proven experience leading incident management, root cause analysis, and problem management initiatives.
- Strong understanding of distributed systems, Linux internals, and network fundamentals (HTTP, DNS, TCP/IP).
- Proficiency in troubleshooting logs, metrics, and dashboards for operational insight.
- Familiarity with CI/CD pipelines, Infrastructure as Code, and container orchestration (Kubernetes, EKS).
- Strong scripting and automation experience using Python and Linux tools.
Preferred Qualifications
- Experience leading global or cross-functional operations teams.
- Background in Java, REST APIs, databases, and modern frontend technologies.
- Expertise in observability and monitoring platforms (Splunk, ExtraHop, Prometheus, Grafana).
- Knowledge of event-driven architectures (Kafka or similar).
- Exposure to AI/ML-based operational automation and self-healing systems.
- Understanding of ITSM frameworks and enterprise support processes.