Team Building, Leadership, and Development: Build and scale an effective SRE team responsible for managing platform operations across the organization, fostering a culture of continuous improvement and collaboration. Lead and mentor the team, conducting regular one-on-one meetings to provide feedback and performance evaluations. Set measurable goals for both team and individual growth, ensuring leadership development and operational excellence within the SRE team.
SRE Roadmap: Develop and maintain a clear SRE roadmap aligned with business and technology strategy. Execute the roadmap by planning resources and initiating strategic projects to enhance infrastructure reliability and operational efficiency.
Consolidate Operations: Take ownership of operations across all product areas, including integrating operations from acquired companies to ensure seamless performance and reliability.
Cloud Strategy and Operations: Oversee the strategic direction and operational management of cloud infrastructure, ensuring scalable, secure, and efficient operations. Lead the planning and execution of phased migrations to Azure, aligning with long-term cloud strategies and ensuring minimal disruption.
BCDR Planning: Lead the improvement and ongoing maintenance of the Business Continuity and Disaster Recovery (BCDR) plan, ensuring robust mechanisms are in place to mitigate service disruptions and data loss.
Monitoring and Observability: Optimize existing monitoring solutions to enhance visibility and ensure proactive incident management. Take full accountability for the end-to-end monitoring lifecycle.
Deployment Process Improvement: Streamline and enhance deployment pipelines for greater efficiency, reliability, and speed. Ensure adherence to best practices and automation across environments.
SRE Metrics & Performance Indicators: Implement and track key SRE metrics, including Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets, to quantitatively measure and improve system reliability.
Incident Management: Establish, refine, and oversee the incident management framework, ensuring a systematic approach to on-call rotation, incident detection, response, and post-mortem analysis.
Risk Management: Identify, assess, and mitigate operational risks, including security vulnerabilities, ensuring infrastructure is secure, compliant, and resilient to threats.
Security Management: Oversee the security posture of all infrastructure, ensuring that best practices and protective measures are implemented and maintained across systems.
Ops Help Desk (OHD) Oversight: Improve and manage the SRE Ops Help Desk process, establishing clear Service Level Agreements (SLAs) to ensure internal customer expectations are met. Continuously monitor and enhance help desk performance.
Team Process Management: Lead regular team planning sessions, retrospectives, and process refinements to foster continuous improvement, transparency, and the ability to adapt to organizational needs.
On-Call Schedule Management: Build and maintain the on-call schedule, ensuring adequate coverage, effective incident management, and balanced workloads for the team.
Reporting: Provide transparency and regular reports to leadership, ensuring visibility into SRE performance, incident trends, and team progress against strategic goals.
Stakeholder and Partner Collaboration: Serve as the primary point of contact with internal stakeholders, partners, and hardware vendors. Manage relationships and expectations, ensuring service delivery aligns with organizational needs.
Documentation: Ensure comprehensive documentation of systems, processes, and procedures to foster knowledge sharing and maintain operational consistency.