Qualification: Relevant Bachelor of Engineering or Bachelor of Science or equivalent or higher (IT related qualification may be required).
Company: Zensar is a leading digital solutions and technology services company that specialises in partnering with global organisations across industries in their Digital Transformation journey. Zensar’s Digital strategy has enabled customers to look beyond current investments towards realising visible business benefits in their digital transformation journey.
If you’re looking for a workplace where associates realise and contribute to their full potential, are recognised for the impact they make, and enjoy the company of the people they work with, then you’ve come to the right place!
As a Site Reliability Engineer, you will be responsible for overseeing the maintenance of applications, you’ll work closely with engineers to advocate and participate in sensible, scalable, systems design and share responsibility with them in diagnosing, resolving, and preventing issues.
Duties and Responsibilities
* Design, document and share specialist knowledge, including delivering training sessions when required as well as taking responsibility for all relevant documentation (updates, storage and roll out).
* Ensure high level of security by design, along with architecting a platform which supports monthly patching and vulnerability management to meet company approved information security policies and procedures.
* Support management of IT assets to ensure they are fully supported, including planning upgrades or replacements prior to end of life, to avoid increased risk or service interruption.
* Achieve SLA’s by building and maintaining services with no Single Points of Failure, identifying weak or failing components for replacement before they cause incidents.
* Configure and monitor infrastructure usage over time and with alerts to be ahead of demand.
* Configure and respond to monitoring alerts for issues with any devices, supporting incidents and escalating when required.
* Provide recommendations to avoid future incidents, including timely delivery of agreed solutions.
* Maintain configuration repositories, including network diagrams, IT asset management system and agreed documentation.
* Support the wider project and change programme, design and deliver agreed improvements following governance processes and industry best practices including documentation.
* Ensure all changes are released or made into controlled environments following agreed and repeatable processes, including roll-back to a known working state.
* Provide agreed reporting and updates to the CTO and wider team, including accurate status of tickets being worked on.
* Be aware of relevant new technologies, security threats and regulatory changes to support the Site Reliability strategy.
* Be aware of industry trends, best practices, and emerging technologies in data engineering, analytics, and data management to suggest improvements and innovations.
Technical Skills and Experience required
* Proven experience in a senior SRE role, with a strong track record of building and maintaining highly reliable infrastructure and services.
* Proven expertise in incident management, including incident response, resolution
* GCP delivery and support using IAC (terraform)
* Proficiency in monitoring, alerting, and observability tools such as Prometheus and Grafana
* Experience in IAC (terraform)
* Strong, proven scripting and automation skills, with proficiency in languages such as Python
* Nice to have experience in Helm charts
* Experience of windows and UNIX
* Excellent communication and collaboration skills, with the ability to work effectively with cross-functional teams in a remote environment.
* Demonstrated leadership capabilities, with a passion for mentoring and developing team members.
* GCP Certified preferred.