With over 300,000 Spark jobs running daily, Nubank’s innovative observability platform, powered by Apache Pinot, delivers speed, simplicity, and savings—backed by real-time insights and proactive tuning.
Nubank, one of the world’s largest digital banking platforms, recently crossed the milestone of 100 million customers. While remarkable, that scale presents substantial engineering challenges, especially regarding data observability and performance optimization across petabyte-scale infrastructure. At the heart of Nubank’s solution is Apache Pinot, enabling a new class of real-time observability and cost optimization products built on its data platform.
In a recent technical session, Felipe, a Senior Analytics Engineer at Nubank, described how the company built a powerful observability layer for Spark-based ETL pipelines and how the adoption of Apache Pinot helped transform operational visibility into measurable cost savings—$1 million annually in cloud costs, to be precise.
Why Observability Matters at Nubank’s Scale
Nubank’s data stack processes hundreds of terabytes daily across 300,000+ Spark datasets, driven by internally built orchestration, metadata, and ETL tooling. This is a massive, distributed data pipeline environment, and with more than 300,000 Spark jobs executed daily, observability and optimization are mission-critical.
While Apache Spark provides distributed processing at scale, out-of-the-box observability remains limited. Traditional Spark UI tools fall short when managing thousands of concurrent jobs. Nubank addressed this by engineering a custom Spark “new listener” that aggregates task-level metrics to reduce telemetry overhead and push meaningful observability data into Kafka for downstream consumption.
Pinot Powers Three Core Data Products
Nubank leveraged Apache Pinot to power three foundational observability solutions that significantly improved data pipeline performance visibility. The first, Platform Metrics, aggregates Spark job metrics and highlights anomalies such as slow commit rates, enabling proactive issue detection. For example, a spike in commit rates observed on March 23 signaled a backend problem, flagged in real time through Pinot-powered dashboards, allowing engineers to take swift corrective action.
The second solution, Dataset Metrics, simplifies Spark’s complex performance data into intuitive, user-friendly dashboards. These dashboards monitor key metrics such as execution time, memory usage, data skew, and spill events. By benchmarking each job’s performance against a rolling 14-day average, the system provides early warnings for performance degradation, helping teams maintain optimal data pipeline efficiency with minimal guesswork.
The third and most popular tool is the Tuning Recommender, which uses rule-based heuristics and statistical models to suggest actionable code-level optimizations. It can automatically recommend changes to configurations like shuffle.partition based on factors such as data volume and stage behavior. This dramatically reduces the complexity of Spark performance tuning, empowering analytics engineers to implement improvements with confidence and speed.
Analyst Insight: Pinot’s Real-Time Advantage
From an industry analyst perspective, the choice of Apache Pinot over other real-time frameworks like Apache Flink reflects a growing trend in the market: reducing operational overhead without sacrificing performance.
According to theCUBE Research, Apache Pinot stands out for its ability to serve as a consolidated platform for real-time analytics, thanks to its OLAP engine and flexible indexing capabilities such as star-tree indexes. This design allows organizations to streamline their data architecture into a single source of truth, unlike systems like Flink, which often require deploying separate applications for each product or use case. This leads to infrastructure sprawl and a slower time to insight.
Pinot also plays a key role in democratizing access to real-time data. Its support for standard SQL empowers a broader range of users- analysts and developers without deep streaming expertise—to contribute meaningfully to performance tuning and observability efforts. “Real-time insights are only valuable if they’re accessible,” said Dave Vellante, Chief Analyst at theCUBE Research. “Pinot’s query-first model and native Kafka ingestion provide the agility modern data teams demand.”
Cost Savings and Reliability Gains
Adopting Apache Pinot at Nubank has delivered substantial performance and cost benefits. By applying automated tuning recommendations, one critical dataset saw an 80% reduction in processing time, from 300 minutes to just around 60, while its failure rate dropped to zero. These improvements directly translated into measurable cloud cost savings, with partition tuning cutting infrastructure expenses by approximately $1 million annually.
Beyond raw efficiency gains, Pinot has driven a culture shift in observability and performance optimization. More than 300 engineers now actively use real-time Spark stage metrics to monitor and fine-tune their workloads—an exponential increase from the pre-Pinot era, when engagement was minimal. With a 96% success rate across tested jobs, the tuning recommender has proven scalable and reliable, delivering performance improvements in 19 of every 20 use cases.
Final Thoughts on Real-Time Observability
Nubank’s case study is a leading example of how modern data platforms can evolve from batch-centric architectures into real-time, intelligent observability systems. Their strategic use of Apache Pinot shows how a well-designed, low-latency analytical database can be at the core of operational excellence and cost efficiency. For any organization running large-scale Spark jobs or facing complexity in managing thousands of ETL pipelines, the Nubank blueprint offers both inspiration and a tangible path to better outcomes.

