Overview
How You’ll Make an Impact
A subsidiary of Publicis Groupe, Epsilon is a leading provider of multi-channel marketing services, technologies, and database solutions. Join our team for your chance to work in the digital marketing space and solve meaningful problems on a massive scale—and have fun doing it.
The System and Platform Operations Manager is a technical leadership role responsible for the support, reliability, and stability of Epsilon Retail Media production systems, environments, and offerings. This position has solid-line responsibility for operations including the deployment, management, monitoring, reporting, troubleshooting, and repair of production systems. Core to the success of the role is to provide a premium customer support experience focused on a “center of excellence” that allows for a full-service delivery support cycle.
This role is responsible for managing the Platform Operation Team centralized within a single geo-region, orchestrating the regional teamwork, serving with both technical and professional support, and championing the company values. The Platform Operations Engineer works closely with the Engineering team to ensure ongoing system stability and supports the Technical Account Managers from an environment's perspective.
The Platform Operations team is responsible for supporting all retailers once they are live, collaborating with other teams such as Customer Support, Technical Account Management, Engineering, and Customer Success teams.
What you’ll do:
* Operational Practices
o Establish and manage operational practices and ensure we design, implement, and operate a support model that is fit for purpose for our future.
o Implement proactive solutions for incident and problem detection, response, remediation, and continuous improvement.
o Owner of the operational integrity of all production environments.
* Production Monitoring and Operational Reporting
o Adopt a “Measure Everything” approach to ensure that internal service level objectives and customer service levels agreements are exceeded.
* Customer Support & Incident Management
o Own incident management processes and on-call response.
o Take ownership of complex issues related to performance, reliability, and scalability, leading resolution of serious incidents and events.
* Change Management
o Uphold processes and procedures to manage change across production platforms.
o Provide insight on how customers will perceive the changes to drive customer organization change management and communication.
o Empower the Delivery teams to release new products, features, updates, and fixes quickly.
* System Reliability
o Work with the wider Engineering, Product, Delivery, and Security teams to ensure appropriate attention is given to production/system reliability.
o Establish Operational Practices in conjunction with the Product and Engineering teams.
o Provide delivery status information on System Reliability initiatives to the IT Leadership Team.
* IT Service Management
o Execute Service Management processes including Change, Config, Service Level, Performance, Incident, and Problem Management.
o Leverage industry standards and best practices for improving service levels and performance.
o Ensure SLAs and KPIs are met to the best of your ability.
* Organizational Capability
o Identify the capabilities needed to meet the current and emerging business needs.
o Evaluate current capabilities, identify gaps, and prioritize development activities.
* Technical Developments, Process Improvement and Simplification
o Discuss and recommend more complex or innovative technical developments to improve the quality of software and supporting infrastructure.
o Maintain understanding of current technology, database management, reliability practices, and future trends.
o Ensure all processes and procedures are documented for ease of continuous improvement activities.
* Personal Capability Building
o Develop own capabilities by participating in assessment and development planning activities.
Who You Are
* What you’ll bring with you:
o At least 5 years of hands-on experience in Site Reliability focused positions.
o Strong knowledge of containerization technologies (Docker, Kubernetes).
o Experience with infrastructure as code (Terraform).
o Solid understanding of networking, security, and system architecture.
o Proficient in scripting languages (Java, Golang, Python, Bash, or similar).
o Experience with monitoring and observability tools (DataDog, Prometheus, Grafana).
o Knowledge of database management systems (PostgreSQL, Bigtable).
o Understanding of API and microservices architecture.
o Strong people leadership skills with at least a year in leading high-performance technical teams.
o Experience implementing and managing Logging, Monitoring, and Alerting frameworks.
o Expertise with ITSM principles from previous positions held.
o Excellent communications and written skills.
* Why you might stand out from other talent:
o Google Cloud Architect or Engineer certification preferred.
o Bachelor’s degree or equivalent.
Additional Information
When You Join Us, We’ll Create Something EPIC Together
Epsilon is a global data, technology, and services company that powers the marketing and advertising ecosystem. We process 400+ billion consumer actions each day using advanced AI and hold many patents of proprietary technology. Epsilon has been consistently recognized as industry-leading by Forrester, Adweek, and the MRC.
Epsilon is committed to equal access to opportunity for people without regard to race, age, sex, disability, neurodiversity, sexual orientation, gender identity, pregnancy and maternity, marriage and civil partnership or religion or belief.
#J-18808-Ljbffr