Title : SRE – Site Reliability Engineer
Location – RTP, North carolina or SanJose, CA
Hybrid
Visa sponsorship is not avaialble.
Due to the nature of the job, Citizenship is mandatory.
Job Summary
As a Sr. Site Reliability Engineer, you operate seamlessly between development and operations. You will engage in and improve the lifecycle of cloud services – from design to deployment, operation, and refinement. You will maintain services by measuring and monitoring availability, latency, and overall system health. You will play a key role in scaling systems sustainably through automation and evolving them by pushing for changes to improve reliability and velocity. You will administer cloud-based environments that support our SaaS (Software as a Service) / IaaS (Infrastructure as a Service) offerings implemented on a microservices, container-based architecture (Kubernetes). To be successful in this role, you must be a motivated self-starter and self-learner, possess strong problem-solving skills; and be someone who embraces challenges.
Key Responsibilities
- Managing production environments by monitoring availability and taking a holistic view of platform and product health.
- Building software and systems to manage platform infrastructure and applications.
- Expert in identifying and strategizing stability and reliability issues in product code.
- Ability to mentor SRE (Site Reliability Engineering) engineers and coach automation first mindset
- Partner with development teams to improve services through rigorous testing and release procedures
- Ability to identify and balance the infrastructure feature acceleration vs. Well-deserved pause and fix
- Debug and troubleshoot service bottlenecks throughout the whole software stack.
- Measure and monitor availability, latency, and overall system health. Develop and improve instrumentation for monitoring and logging the health and availability of services
- Conduct CICD operations to deploy an assortment of software deliverables across a global, production environment
- Provide architectural guidance to optimize the observability stack across NetApp’s cloud services
- Be hands-on in the implementation of our observability stack. You have driven the deployment of these tools at scale and have experience working with a rapidly growing infrastructure.
- Build dashboards to provide insights and visibility into critical business metrics for a variety of audiences from engineering and SRE teams
Job Requirements
- At least 10 to 12 years of experience is required.
- Experience in writing, troubleshooting and bug fixing product code
- Scripting and infrastructure automation using, for example, Ansible, Python, Go, Perl, or Ruby.
- Deep working knowledge of Containers, Kubernetes, and Serverless computing implementation.
- Understanding of SDLC lifecycle and DevOps development methodologies
- Experience with one of the three (AWS, Azure, GCP) hyper-scalers.
- Experience in defining, applying, and managing SLAs, SLOs and SLIs to the product.
- Good interpersonal communication and customer service skills are needed to work successfully with stakeholders in high-stress and/or ambiguous situations
- This role includes on-call work and travel sometimes.
- Education
- Bachelor of Science Degree in Computer Science, a master’s degree; or equivalent experience is required.
Thank you,
Shobana Prabhakar
From:
Shobana,
Dizercorp
shobana@dizercorp.com
Reply to: shobana@dizercorp.com