Company: Transfer Wise. London, Uk.
About this job
Wise is one the fastest growing companies in Europe and we’re on a mission: to make money without borders the new normal. We’ve got 10 million customers across the globe and we’re growing. Fast.
Current banking systems don’t let us send, spend or receive money across borders easily. Or quickly. Or cheaply.
So, we’re building a new one.
And, we’re looking for a Software Engineer – Reliability Engineering to work closely with product teams in London, solving some of the challenging scale up and reliability challenges that we have.
For our customers, using Wise should feel as simple as sending a text message. Yet behind our app and website lies a complex, one-of-a-kind engine of currencies and routes that’s being designed, built and powered by our talented teams in cities around the world. With new capabilities being built every day, there’s still a lot to figure out, and we can’t do it alone. This role is a unique opportunity to have an impact on Wise mission, grow as a product leader and help save millions more people money.
The Site Reliability Engineering team is responsible for understanding deeply the sources of unreliability, and how we can build better systems that are resilient and cope well against these. Reliability is a partnership between product and platform teams, and SRE is responsible for leading optimal use of the best practices across industry, to build the best possible product for our customers. It is expected that SREs have a healthy dose of paranoia knowing how complex, distributed systems can fail.
Here’s how you’ll be contributing to the Engineering Team
- You will be working hands-on as an Software Engineer Reliability Engineering, closely with one or several of our product teams. After onboarding, it will be a case of identifying challenges that the team(s) are facing when it comes to reliability and scaling to meet our customers’ demand. This means understanding what the tradeoffs are when it comes to the product – from what are the product expectations, to thinking about failure modes and fallbacks of distributed systems.
- You are keen to build scalable solutions, able to think about complex software and systems engineering problems, and need to be curious about how things work under the hood. You are a problem solver and are able to deliver iterable or evolving solutions, in a collaborative and open way.
Is that you?
- Strong experience in Java (or other languages), ideally knowledge of Spring Boot/Framework, designing and implementing libraries and frameworks
- Debugging skills when it comes to systems, e.g. issues with disk, network, app/JVM performance, and able to think in logging/metrics/alerting terms
- Understanding and curiosity when it comes to distributed systems, how they can fail, and the best way to cope with those scenarios (perhaps chaos engineering, automated canary analysis, etc)
- Experience with AWS (or other public cloud), Docker containerisation and Kubernetes
- Good knowledge of relational (RDBMS) and NoSQL databases – on how to best utilise them
- Will not settle at all for unexplained downtime and outages, do not want to be woken up in the middle of the night
Nice to Have
- Experience with building out scalable and automated Cloud platforms on preferably AWS
- Knowledge and an eye on newer architectural concepts such as microservices, service mesh, observability
- Have clear understanding of the test pyramid, including end-to-end functional and/or load tests at scale
- Security-first mindset – keeping up to date and possible have worked on remediating vulnerabilities reported by a bug bounty program
- Experience with advanced release and change management processes
Key Areas of the role;
- Reliability is not solely owned by SREs – it’s a partnership with product teams to mature their understanding of SRE principles
- Define and create standard operating procedures that are compliant and auditable – emphasis on automation and tooling where possible
- Plan out the reliability roadmap for product teams, making trade-offs and having discussions about SLI/SLOs where necessary.
- Sees failover and DR events as something that needs to happen with regularity and should be seamless
- Failing is an opportunity and lesson! Engage regularly with our blameless postmortem culture, always focused on continuous improvement and prevent similar problems from happening again