Platform Reliability Engineer

Location TBD·Posted today
developer-toolstypescriptgoreactnest.jsnodenode.jsrediskubernetesaws
Apify is the largest marketplace of tools for AI. 40,000+ Actors helping people and agents get real-time web data, track competitors, generate leads, or integrate their apps. Actors are built by a global creator community that now earns more than $1.2 million every month. Join us to help people put the web to work. Apify can find missing children , protect consumers from fake discounts across the EU , and feed data to AI chatbots . To support our mission, we're looking for a Platform Reliability Engineer with a developer's mindset. You've shipped code and you care what happens when it runs in production (speed, failures, recovery). You'll help us strengthen how Apify monitors systems, handle incidents, and route alerts so engineering teams can ship with confidence. You won't be on-call. This role is focused on sustainable improvement, not after-hours emergency response. What you'll be working on: Monitoring & signals: Operate and improve our monitoring stack (Prometheus, Grafana, OpenTelemetry) - instrument services to expose the right metrics, define what we watch in production, and shape alerting so teams get actionable signals without the noise. When things go wrong: Help define how we run incidents - clear communication, structured learning afterward, and supporting artifacts (status page, runbooks). With the team: Work with platform and product engineers to make reliability standards practical - help teams adopt better tooling or practices when things change, and write documentation people actually use. Who we're looking for: Must-haves: You have hands-on experience choosing what to measure in production - not just reading dashboards, but picking signals that reflect the customer experience. You're comfortable with incidents and alerts , from early detection through resolution and follow-up so similar issues are less likely to recur. You have hands-on experience with Prometheus, Grafana, OpenTelemetry , or similar, and with alert-routing tools such as PagerDuty . You read and write code: you can follow services and pipelines across the stack and collaborate on technical details with the teams building them. You know what good post-incident culture looks like in practice - blame-free, learning-focused, and actually used to make things better - even if your past title never mentioned reliability. You can write clear, concise guidance that teams adopt, and you work constructively toward sound decisions. You're driven to automate repetitive tasks and improve developer workflows. Nice to have: Meaningful hands-on experience as an application or backend developer - you've built things that run in production and approach observability as someone who needs it as a "user," not just the person who sets it up. Experience building and maintaining infrastructure on AWS (EC2, EKS, S3, CloudFormation, or similar), and hands-on experience with container technologies. Some familiarity with CI/CD pipelines or release practices - enough to have an informed opinion on what makes deployments reliable and safe. Don't worry if you don't meet all of the above criteria. We value diverse skills and experience and would love to hear from you. Our tech stack Infra: AWS Compute (Kubernetes (EKS), EC2, Lambda), Helm, ArgoCD, MongoDB, Redis, DynamoDB, S3, GitHub Actions Monitoring: Grafana, Prometheus, OpenTelemetry, Mezmo, PagerDuty Frontend: React.js, styled-components, Storybook, Chromatic, Cypress, Playwright Backend: TypeScript/Node.js, Nest.js, Next.js, Express.js, Docusaurus, Vitest Tools: GitHub, Notion, Google Workspace Editor and AI assistant of your choice (GH Copilot, Cursor, Claude, Gemini, or JetBrains AI) Process: two-week sprints, code reviews, tests, automating whatever we can, and deploying multiple times per day. By the end of the first 3 months, we expect you to: Have completed the general onboarding process. Have built working relationships with platform engineers, engineering leads, and others involved in production response, and aligned on how you'll collaborate. Understand, in principle, how the Apify platform works, and be able to handle smaller problems, incidents, or bugs on the infrastructure you work with most. Have mapped how we handle monitoring, incidents, and alerts today - where the friction is and where a focused improvement would help. Have published initial monitoring, observability, and alerting guidelines - covering signals, naming, key dashboards, and alerting principles (severity, routing, and noise reduction) - aligned with existing tooling. Be participating in incident reviews and translating patterns into improved playbooks. Be contributing actively in team ceremonies (planning, grooming) and technical discussions, and in touch with other teams to support their infrastructure needs. By the end of the first 6 months, we expect you to: Be working on bigger tasks mostly independently (while staying fearless about asking for help). Have built a network across engineering, stay in touch with other teams on infrastructure initiatives, and gather feedback to find ways to help them in their daily work. Have teams referencing your guidance when planning higher-risk changes, with measurably less alert noise and duplicate paging. Have incident documentation (communication, roles, lessons learned) that's easy to find and actually used during real incidents. Own the monitoring and alerting improvement roadmap end-to-end. Have agreed with leadership on priorities for monitoring and alerting - tooling, training, and the metrics that actually matter. Why should you work at Apify? Space, support, and autonomy for personal growth, with a direct impact on Apify's success Full-time position in Prague (Lucerna Palace) or Brno (Titanium) 🏰 Option to work remotely 🛋️ Flexible working hours (perfect for both night owls 🦉 and early birds 🐥) Nobody counts holidays as long as the work gets done 💪 Unlimited Claude for every Apifier. We don't count tokens. Just use them well 🤖 Stock options and profit sharing 💰 We welcome pets, kids, and bikes at the office 🐕👨‍👧 Epic team buildings and offsites 🚢 with biking, canoeing, and other adventures 🪂 Solid education and training budget, conference tickets, internal "Eat & Learn" sessions, and the possibility to work across teams 👩🏼‍💻👨🏽‍💻 Generous hardware budget 💻 Free lunches every day when you're in the office 🌮🍱🍜🍕🥡 Unlimited supply of ☕ & 🍺 and snacks Free entry to the wonderful Prague Zoo 🐘 Free Multisport card 🏋 Ping-pong, chess, PS5, lightsabers, foosball league after lunch. For more details about Apify and what it's like to work with us, see our Careers page .