By Mariya Giy, 21 December, 2020
The Challenge
Interactive kids’ products generate billions of micro-events each day today: every button press, retry attempt, and creative build sequence gets logged. Constraints to scale and longer path to ML due to centralization of these high-volume event streams as we see in several papers, pushed researchers to more distributed and federated ideas (Dispersed Federated Learning, 2020). Big data traditionally flowed through ETL pipelines that centralized everything into a single warehouse:
-
- Terabytes of logs bottlenecked ingestion.
-
- Schema mismatches between apps, toys, and services bottlenecked transformations.
-
- ML training cycles idled, waiting for data preparation to complete.
-
- If we were going to keep up at a global scale, we needed a new approach to big data prep for ML.
Hands-On Optimizations
In my role, I made data pipelines more performant and reliable. To do this, I wrote SQL queries for a lot of the heavy lifting for multi-source integrations, resulting in a 50% reduction in setup time for test data. I also reorganized our data ingestion workflow so that repeated processes, such as some ETL steps, were fully automated and did not require manual intervention. This saved around 45 minutes of manual work per build, which, at scale, is a lot.
If I were to suggest tools for automating this sort of effort, with what was available to us at the time, I’d say:
- Apache Airflow to standardize against ad-hoc scheduling and facilitate reliable automation.
- dbt (data build tool) to transform data in SQL, consistently across teams, while making the queries I manually wrote reusable and versioned.
- Great Expectations for baked-in data quality checks as part of the pipeline itself, to catch outliers and other edge cases that can otherwise impede ML training speed.
In short, the takeaway is that manual workarounds can provide short-term efficiency, but only automation with robust orchestration, transformation, and validation layers at scale.
Federated Approach: Ship Code, Not Data
A direction I found the most fascinating at the time was a federated learning-esque approach: instead of shuffling terabytes of raw logs upstream, move code down to the data. In a proof-of-concept, we explored local ETL jobs that normalized schemas and extracted features (completion times, retry frequency, etc) while containerized ML workloads could be run close to the source. Aggregated updates only would flow back upstream, minimizing both network strain and retraining latency.

Diagram: Example Kubeflow Architecture
Academic research confirmed these ideas as well. Some researchers demonstrated how the scalability of federated learning in IoT and digital twin systems was improved through workload distribution in adaptive federated learning2 (Adaptive Federated Learning and Digital Twin for IIoT, 2020). Others explained how federated learning for anomaly detection in smart buildings circumvented centralization entirely, while still producing precise models3 (Federated Learning for Anomaly Detection in Smart Buildings, 2020).
POC-wise as well as academically, it really struck at the heart of scale problems with kid-oriented products, where data is prolific and personalization is important. In hindsight, if I wanted to point towards pieces of the stack that would give us that power today in an easier way, it would be:
- Ray or Kubeflow for cleaner orchestration of distributed ML workloads than the Kubernetes-only solutions I worked with back then.
- Redpanda or Apache Pulsar for a newer streaming layer that operates more cleanly than Kafka.
- Flower and FedML as a few more mature federated learning frameworks building on the work of TF Federated and PySyft.
- Edge data validation/monitoring so you know you can trust the model updates before you aggregate them.
The idea itself holds up as a powerful concept: ship the logic, not the logs. We only scratched the surface of it experimentally, but it was a valuable proof point for building scalable, adaptive pipelines where both big data and personalization are in play.
Cross-Team Synergy
It wasn’t all data work, either. I also led a focus meeting once a week with product, sales, and support teams. It was straightforward: each group shared what they were learning about where pipelines were slowing, which features needed prioritizing, and how data flow needed to improve to yield new results. Collaborating in this way not only allowed non-tech stakeholders to see ETL as the backbone of product intelligence, but also exposed some of the bottlenecks engineers may not have caught on their own.
References
-
- Bao, W., Chen, J., & Wang, Y. (2020). Dispersed Federated Learning: Vision, taxonomy, and future directions. arXiv. https://arxiv.org/abs/2008.05189
- Chen, X., Xu, H., & Zhao, D. (2020). Adaptive federated learning and digital twin for industrial internet of things. arXiv. https://arxiv.org/abs/2010.13058
- Nguyen, D. C., Ding, M., Pham, Q. V., Pathirana, P. N., Le, L. B., Seneviratne, A., Li, J., & Niyato, D. (2020). Federated learning for anomaly detection in smart buildings. arXiv. https://arxiv.org/abs/2010.10293