Grafana SRE Architect in Basking Ridge, NJ (Onsite)

C2C
  • C2C
  • Anywhere

Job Title: Grafana SRE Architect

Location: Basking Ridge, NJ (Onsite)
Duration: 6+ months contract

 

Job Summary

The Grafana SRE Architect will lead the design, implementation, and management of scalable, reliable, and performant Grafana-based observability solutions. This role bridges Site Reliability Engineering (SRE) practices with Grafana’s ecosystem (Loki, Mimir, Tempo, etc.) to ensure robust monitoring, logging, tracing, and alerting for mission-critical systems. You will collaborate with DevOps, engineering, and infrastructure teams to align technical strategies with business objectives, driving automation, resilience, and cost efficiency across cloud and on-premises environments.

 

Key Responsibilities

  1. Architecture & Design
    1. Design end-to-end Grafana solutions for metrics, logs, traces, and dashboards, ensuring scalability, security, and compliance.
    2. Architect integrations with Prometheus, Loki, Mimir, Tempo, and third-party tools (e.g., AWS CloudWatch, Datadog).
    3. Define best practices for Grafana deployment (self-managed vs. Grafana Cloud) and optimize data storage/retention strategies.
  2. SRE Leadership
    1. Implement SRE principles: SLAs/SLOs/SLIs, error budgets, and blameless post-mortems.
    2. Build automated monitoring/alerting systems to preemptively identify system bottlenecks and failures.
    3. Lead incident response, root cause analysis, and remediation for observability-related outages.
  3. Collaboration & Integration
    1. Partner with DevOps teams to embed Grafana into CI/CD pipelines and automate provisioning via IaC (Terraform, Ansible).
    2. Work with developers to instrument applications for observability (OpenTelemetry, custom exporters).
    3. Advise stakeholders on cost-effective monitoring strategies and resource optimization.
  4. Performance Optimization
    1. Tune Grafana dashboards, queries, and data sources for high-performance environments.
    2. Optimize PromQL/Loki LogQL queries and manage large-scale time-series databases (Mimir).
    3. Conduct capacity planning and disaster recovery testing for Grafana ecosystems.
  5. Governance & Security
    1. Ensure compliance with security policies (RBAC, SSO, encryption) and audit requirements.
    2. Monitor Grafana stack health, perform upgrades, and enforce version control.
  6. Mentorship & Innovation
    1. Mentor SRE/engineering teams on Grafana best practices and SRE culture.
    2. Stay ahead of Grafana/Observability trends and pilot new tools (e.g., AI-driven anomaly detection).

 

 

Education & Experience

  • Bachelor’s/Master’s in Computer Science, Engineering, or related field.
  • 10+ years in SRE/DevOps roles, with 5+ years hands-on Grafana experience.
  • Proven track record in designing large-scale observability solutions.
  • Managing offshore teams
  • Open to work overlapping hours with offshore teams

 

Technical Skills

  • Expertise in Grafana: Dashboards, plugins, alerting, and integrations (Prometheus, Loki, Mimir, Tempo).
  • Cloud Platforms: AWS/GCP/Azure, Kubernetes, and serverless architectures.
  • Automation: Terraform, Ansible, Python/Go scripting.
  • Monitoring Tools: Thanos, Cortex, Jaeger, OpenTelemetry.
  • Database Optimization: Time-series data (Mimir), log management (Loki).

 

Certifications (Preferred)

  • Grafana Certified: Observability Engineer/Administrator.
  • AWS/GCP/Azure Architect or DevOps certifications.

 

Soft Skills

  • Leadership in cross-functional teams and crisis management.
  • Strong communication for technical and non-technical audiences.
  • Analytical problem-solving and strategic thinking.

 

 

Preferred Qualifications

  • Contributions to Grafana/Prometheus open-source projects.
  • Experience with AI/ML model monitoring.
  • Knowledge of regulatory frameworks (GDPR, HIPAA). 

 


From:
Prashant,
Veridian Tech Solutions, Inc.
d.prashant@veridiants.com
Reply to:   d.prashant@veridiants.com