Job Description
Job Summary:
We are seeking a highly skilled and motivated Grafana Observability Architect with experience in design, implementation, and optimization of observability solutions using the Grafana ecosystem. The ideal candidate will work closely with platform engineers, SREs, developers, and business stakeholders to ensure end-to-end visibility into system performance, reliability, and user experience across distributed systems.
Key Responsibilities:
Grafana & Observability:
- Architect and implement observability platforms using Grafana, Tempo, Loki, Mimir, and Prometheus.
- Design and maintain scalable telemetry pipelines using OpenTelemetry and Grafana Agent.
- Define and enforce observability standards, SLIs/SLOs, and alerting strategies.
- Collaborate with application and infrastructure teams to instrument services for metrics, logs, and traces.
- Develop reusable dashboards and templates for performance monitoring and incident response.
- Design and implement visually compelling and data-rich Grafana dashboards for Observability.
- Integrate Grafana Cloud with data sources such as Prometheus, Loki, ServiceNow, PagerDuty, Snowflake, AWS
- Integrate telemetry data sources such as Tomcat, Liberty, Ping, Linux, Windows, and databases (Oracle, PostGres) and REST API.
- Create alerting mechanisms for SLA breaches, latency spikes and transaction anomalies.
- Develop custom panels and alerts to monitor infrastructure, applications, and business metrics.
- Collaborate with stakeholders to understand monitoring needs and translate them to define KPIs and visualization needs.
- Optimize dashboard performance and usability across teams.
- Implement and manage OpenTelemetry instrumentation across services to collect distributed traces, metrics, and logs.
- Integrate OpenTelemetry data pipelines with Grafana and other observability platforms.
- Develop and maintain OpenTelemetry collectors and exporters for various environments.
- Develop and implement monitoring solutions for applications and infrastructure to ensure high availability and performance.
- Collaborate with development, operations, and other IT teams to ensure monitoring solutions are integrated and aligned with business needs.
DevOps & Automation:
- Architect, design and maintain CI/CD pipelines using tools such as Jenkins, Bitbucket, and Nexus.
- Implement Infrastructure as Code (IaC) using Terraform and Ansible.
- Automate deployment, scaling, and monitoring of both cloud-native and on-premises environments.
- Ensure system reliability, scalability, and security through automated processes.
- Collaborate with development and operations teams to streamline workflows and reduce manual intervention.
͏
SME Responsibilities:
- Act as a technical advisor on automation and observability best practices.
- Lead initiatives to improve system performance, reliability, and developer productivity.
- Conduct training sessions and create documentation for internal teams.
- Stay current with industry trends and emerging technologies in DevOps and observability.
- Advocate for and guide the adoption of OpenTelemetry standards and practices across engineering teams.
- Optimize monitoring processes and tools to enhance efficiency and effectiveness.
Required Qualifications:
- Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience).
- experience in DevOps, SRE, or infrastructure automation roles.
- hands-on experience with Grafana and dashboard development.
- Strong proficiency in scripting languages (Python, Bash, Go).
- Experience with monitoring tools (Grafana Cloud, Prometheus, Loki, Dynatrace, Splunk, etc.).
- Deep understanding of CI/CD, and cloud platforms (AWS and Azure).
- Expertise in Kubernetes, Docker, and container orchestration.
- Familiarity with security and compliance in automated environments.
- Hands-on experience with OpenTelemetry instrumentation and data collection.
Preferred Qualifications:
- Grafana certification or equivalent experience.
- Experience with custom Grafana plugins or panel development.
- Knowledge of business intelligence tools and data visualization principles.
- Contributions to open-source DevOps or observability projects.
- Strong communication and stakeholder management skills.
- Experience with OpenTelemetry Collector configuration and integration.
- Familiarity with distributed tracing concepts.