About the role:
Loftware is expanding its worldwide 24x7 Cloud Operations Team and we are looking for a technically motivated English speaking Cloud Operations Site Reliability Engineer with a strong cloud-based Linux and Windows knowledge. The Cloud Operations Site Reliability Engineer will be hands-on and involved with building, maintaining, and troubleshooting customer environments for mission-critical application use across the range of cloud platforms used by Loftware, including AWS and Azure. The Cloud Operations Site Reliability Engineer is someone that is a team player with the desire and passion for modern technology and keen to take on large-scale responsibility for the cloud environment.
The Cloud Operations Site Reliability Engineer will work with the rest of the Cloud Operations team and alongside QA and Development to continually improve automated infrastructure and application deployment, to build and maintain reliable cloud infrastructure and services and to manage the highly available and scalable solutions that Loftware customers rely on.
This is an excellent opportunity to be part of a team helping to evolve our solutions for different cloud platforms as well as expand your skills in the cloud.
Key Roles & Responsibilities:
1. Help continue to improve monitoring systems in AWS, Azure, and our other cloud environments to track the health and performance of cloud-based applications and infrastructure. Develop cloud-based alerts to proactively identify and address issues before they impact users.
2. Develop and maintain automation tools to streamline operational tasks with Terraform and Ansible
3. Implement security best practices and compliance standards for our AWS, Azure, and other cloud environments. Continuously assess and mitigate security risks and vulnerabilities. Create, maintain, and execute disaster recovery plans and backup strategies to ensure data and service continuity.
4. Collaborate with software engineers to improve the reliability and resilience of applications through code and architecture changes and help identify performance bottlenecks to optimize applications and infrastructure.
5. Help define and configure cloud-based networking to customer devices and data systems that are sat outside of our cloud environments (VPN, direct connect, transit gateways)
6. Respond to and resolve incidents quickly to minimize service disruptions and conduct post-incident analysis to identify the root causes and prevent similar issues in the future.
7. Participate in an on-call rotation to address critical incidents outside of regular business hours to provide on-call support.
Required Qualifications:
8. Cloud Platform : AWS and/or Azure
9. OS : Linux and/or Windows
Preferred Experience:
10. Database : PostgreSQL Microsoft SQL Server
11. Scripting: Python, Java, Bash, .NET/C#, Powershell
12. IAC and Automation: Terraform, Terragrunt,Ansible, Rundeck, Jenkins
13. Cloud networking concepts : VPN, direct connect, transit gateways
14. Container Technologies: Docker, Kubernetes
15. Cloud-native technologies: RDS, Microservices, Serverless computing