Affirm’s Infrastructure Platform team is building a large-scale, massively distributed, fault-tolerant global infrastructure shared across multiple financial products, merchants and vendors. Ensuring that our infrastructure is openly available to engineers is a critical part of Affirm’s success story. We pride ourselves on our culture across engineering design, architecture and writing detailed tech specs and capturing feedback before large changes to systems.
We are looking for a Staff Site Reliability Engineer with deep technical knowledge and who’s passionate about Linux, networking topics, microservices and distributed architectures and has experience with handling large scale services to join our Site Reliability Engineering team. Our goal is to enable Affirm's global, service oriented architecture based product and infrastructure stack to be observable, highly resilient, scalable and fault tolerant, while maintaining our high SLA uptime expectations. You will excel if you have passion for digging deep, and a flare for sharp technical communication, prioritization, and organization. You will work directly with our Platform / Infrastructure and Product Development teams to build our next generation “always up” cloud-based platform.
Our work ranges from Observability/Telemetry Engineering, Reliability and Scalability Engineering, Chaos Engineering, Performance Engineering, Capacity Engineering and Disaster Recovery Engineering, and working closely with the security team on managing application level security.
Site Reliability Engineers are hybrid System, Software, Data and Network Engineers who are responsible and accountable to build and scale reliable systems that impresses our customers.
What you'll do
Own end to end availability, reliability and performance of the mission critical services
Troubleshoot various issues around reliability, resiliency, scalability and availability.
Define and measure SLI, SLA and SLO
Augment instrumentation to build a cohesive dependency mapping with special attention to points of failure
Build command and control automations to quickly fail away to reduce TTR and reduce manual work/eliminate Toil.
Assist with oncall and triage rotation
What we look for
Linux, Networking and AWS experience
Experience with containerization and container platforms. (e.g., Docker, Kubernetes)
Familiarity with Elasticsearch, Kibana/Grafana, Logstash, kafka and ways to scale these systems
Experience with automation systems (ansible, puppet, terraform) is a plus, saltstack preferred
Experience with open source systems a plus
Software development experience in Python/Kotlin/Go is a plus
Experience with high performance networking (Quic, network layer optimization) or Real Time transaction protocols/methods (HTTP2, Server Sent Events, MQTT, WebSockets).
Recommends or helps architect an entire system. Acts as an expert in understanding and performing TCP dumps, snoop, and other network sniffers. Understands and applies knowledge of most protocols (TCP/IP, HTTP, UDP, etc.)
USA Pacific base pay range (CA, WA, NY, NJ, CT): $190,000-$284,900
Sapphire base pay range (all other U.S. states): $171,000-$256,500
Success story sharing