Job Summary:
Site Reliability Analyst (SRA) will develop an in-depth understanding of the hosted application platforms at both hardware and application level and can diagnose and resolve issues efficiently using defined playbooks. SRA is responsible for monitoring the application platforms to identify performance issues, an unexpected increase in load, application errors through log analysis, capacity concerns, and any risks from single point of failure. SRA will work closely with IT, DBAs, Developers, client supports teams, and internal Client Service Delivery Managers, to investigate recurring issues, client performance problems, and outages. Working as part of the Site Reliability Team will be required to research and integrate alerts to provide proactive monitoring and awareness for other teams.
Duties and Responsibilities of the job:
* Actively monitor the performance and availability of the hosted application, investigating common issues across clients or platform versions. Develop performance baselines to measure clients against to understand areas which need investigation or to alert development teams of potential issues as part of a regular report out at Development and Operation Meetings.
* Investigate issues across different types of servers and gateways including Web and Database, using inbuilt tools to run performance diagnostics to support investigations relating to client-reported performance issues.
* Part of the wider Site Reliability Team which includes using communication channels to update others on active issues, investigate overnight alerts, and react to client-specific reported issues as a priority.
* Working with IT Infrastructure and IT Security to understand the data flow, service dependencies, permissions, data throughput, and security.
* Work with Database Administrators, Development, and client Service Operations Managers (SOM) to provide data analysis to evidence the issue or demonstrate problems have been resolved.
* During outages and incidents provide regular analysis reports to key business leads, Support desk, and Technology team.
* Enhance monitoring and logging of key areas which are client impacting. Implement alerts and perform alert reviews to understand the effectiveness of these alerts.
* Provide regular reporting to the business on common trends, improvements made, or areas which have degraded.
* Understand application logging process, events, and what to look for during different types of situations, pre and post hotfixes and upgrades.
* Develop automation through scripting for common tasks.
* Build wiki articles around new processes as well as update existing wiki articles for internal use. Maintain Service Catalogues for platform services.
* Respond to requests assigned through the ticketing service in a timely and efficient manner, work with Support desk, and take ownership of running critical outage events and engaging with Escalation Managers.
* Seek to learn and apply new technologies, analyze new situations, and design solutions using a variety of technologies.
* Assist with the IT Disaster Recovery plans for the hosted application, test and review.
Education:
* Minimum of 4 years of professional Application administration or equivalent experience is required.
* A combination of experience and education may be considered.
Experience and Training:
* Experience with Windows and Linux operating systems logs.
* Experience of how Web Services and Database operate, specifically Microsoft IIS and SQL.
* 3 years in a Client Support Role, IT Support, or other related role which has required data log analysis and research to identify issues.
* 3 years using monitoring services using tools like Azure Monitor, Application Insights, SolarWinds, or similar cloud/Application Performance Monitoring tools.
#J-18808-Ljbffr