͏
Key Responsibilities
· Design, deploy, and manage a highly available, distributed LGTM stack on Kubernetes.
· Operate and optimize Grafana, Loki, Tempo, and Mimir/Prometheus at scale.
· Implement observability best practices (metrics, logs, traces) for microservices and cloud-native applications.
· Build and maintain Helm charts, GitOps/AzureDevOps workflows for observability services.
· Ensure high availability, disaster recovery, scaling strategies, and performance tuning of observability components.
· Manage storage backends (object storage such as S3/GCS/Azure Blob) for logs, metrics, and traces.
· Configure alerting strategies using Prometheus Alertmanager and Grafana Alerting.
· Define and enforce SLOs, SLIs, and monitoring standards across engineering teams.
· Support onboarding of application teams to observability tooling.
· Troubleshoot complex distributed systems issues across Kubernetes and observability pipelines.
· Implement security best practices (RBAC, TLS, network policies, secrets management).
· Automate operational processes using Infrastructure as Code (Terraform, Helm, etc.).
· Monitor system capacity, optimize cost, and ensure efficient resource utilization.
· Maintain documentation and provide knowledge sharing sessions for internal teams.
Required Skills & Experience
Technical Skills
· Strong hands-on experience managing LGTM stack:
o Grafana (dashboards, alerting, RBAC, multi-tenancy)
o Loki (log ingestion, indexing, retention, scaling)
o Tempo (distributed tracing, sampling strategies)
o Mimir or Prometheus (remote write, federation, scaling, HA)
· Solid experience with Kubernetes (cluster operations, networking, storage, RBAC).
· Experience deploying distributed systems using Helm, GitOps/AzureDevOps tools
· Knowledge of PromQL and LogQL.
· Experience with object storage systems (S3-compatible, GCS, Azure Blob).
· Familiarity with OpenTelemetry and instrumentation standards.
· Experience configuring and tuning Alertmanager.
· Understanding of microservices architecture and cloud-native patterns.
· Experience with CI/CD pipelines.
· Scripting skills (Bash, Python, or Go).
· Familiarity with cloud platforms (AWS, GCP, and Azure).
Soft Skills
· Strong problem-solving and analytical skills.
· Ability to work cross-functionally with engineering and platform teams.
· Clear communication and documentation skills.
· Proactive mindset with a focus on reliability and automation
͏
Experience: 3-5 Years .
Reinvent your world. We are building a modern Wipro. We are an end-to-end digital transformation partner with the boldest ambitions. To realize them, we need people inspired by reinvention. Of yourself, your career, and your skills. We want to see the constant evolution of our business and our industry. It has always been in our DNA - as the world around us changes, so do we. Join a business powered by purpose and a place that empowers you to design your own reinvention.