Grab is looking for experienced site reliability engineers to help us operate, troubleshoot, and improve our real time on-demand transport service. Our platform is written mainly in Go and hosted on the AWS cloud. We are heavy users of MySQL, Redis and Kinesis.
The Role
- Work with engineering teams to design and write code to create systems which are highly available and able to scale seamlessly.
- Plan for and eliminate any potential threats to stability, availability or security.
- Improve monitoring, alerting and resilience of systems.
- Write tools to assist work such as capacity planning or improving the ability to debug production issues over distributed systems.
- Contribute to a culture of learning and responsibility by writing detailed postmortem reports.
- Tackle live issues on production when on-call with assistance from the rest of the teams.
Requirements
- Experience in designing and writing software for production systems.
- Knowledge of Unix fundamentals.
- Experience crafting, analyzing, and troubleshooting distributed systems.
- Knowledge of TCP/IP networking.
- Fluent in spoken Mandarin & English.
Really Nice to Haves
- Experience with AWS
- Experience with a configuration management system such as Ansible.
- Experience with enterprise security is a plus.
- Experience with Go, Kinesis, Redis, and MySQL is very beneficial.
- Bash scripting experience is strongly desirable.