From Chaos to Clarity: How Immutable Data Simplifies Machine Learning Data Storage
Karpathy’s data engine concept is a data management strategy designed to simplify the process of working with large and complex datasets in machine learning applications.
Potentially nitpicky but competitive advantage in AI goes not so much to those with data but those with a data engine: iterated data aquisition, re-training, evaluation, deployment, telemetry. And whoever can spin it fastest. Slide from Tesla to ~illustrate but concept is general pic.twitter.com/6O2KxZBg17
— Andrej Karpathy (@karpathy) December 5, 2022
Developed by Andrej Karpathy, former Director of AI at Tesla, the data engine concept aims to streamline the process of data collection, preprocessing, and transformation, allowing machine learning practitioners to focus on the development and deployment of their models. The approach is based on the principles of modularity, scalability, and reproducibility, with a focus on enabling end-to-end data workflows that are efficient, flexible, and easy to manage. In this article, we will delve into the details of Karpathy’s data engine concept, exploring its benefits, challenges, and real-world applications.
I have experienced and dealt with various data storage paradigms being used in both machine learning research and industry and noticed subtle differences between how data is treated in research vs industry. Data storage is a crucial aspect of any machine learning system. The type of data and the use case can greatly influence the storage strategy. Traditionally, databases or data lakes have been used for structured data, while object stores like AWS S3 and Google Storage have been used for unstructured data. However, storing data as files raises challenges when it comes to maintaining versions and lineage, especially when data may mutate and evolve.
With these challenges in mind, having an immutable data storage paradigm can simplify your system design, especially when dealing with a heterogeneous and evolving dataset with both structured and unstructured data. Immutable data storage refers to a data storage technique that allows for the creation of versions of data that can never be modified. This means that you can always access a particular version of the data and track the changes that occurred over time.
Immutable data storage has several benefits, including data consistency, reproducibility, versioning, and lineage. With immutable data storage, you can easily reproduce experiments and analyses as the data is always the same. Additionally, you can track changes and roll back to a previous version if needed. This allows for easy versioning of data and provides a clear lineage of data, allowing you to trace back to the source and understand how the data was transformed over time.
Traditionally, machine learning practitioners have used databases or data lakes for storing data. While these approaches work well for structured data, unstructured data poses unique challenges. Unstructured data such as images, audio, or text may have different formats, schema, or versions. The data could be scattered across multiple sources, or it may be too big to fit into a single database or data warehouse. In these scenarios, using files on object stores like AWS S3 and Google Storage can be an ideal choice. Files are a simple and flexible data storage paradigm that can accommodate diverse data types and sizes. They can also be easily replicated, distributed, and accessed via applications such as labelling tools and visual explorers, etc.
However, using files as data storage raises challenges when it comes to maintaining versions and lineage, especially when data such as labels may mutate and evolve. For example, imagine you have a machine-learning model that is trained for the semantic segmentation of areal imagery. You store the images and shape data as files on an S3 bucket and use a microservice to manage the metadata associated with the images and labels. In such scenarios, it becomes increasingly difficult to maintain dataset reproducibility and tracking as the number of classes for labels increases over time, the images get updated over time, labels may be corrected and updated and the complete bundle of image label pairs may need to be classified and tagged for different use cases and properties.
Maintaining a complicated knowledge graph with complete traceability as well as reproducibility, lineage and introspection of training sets becomes an absolute necessity.
With these principles in mind, and the reality of all types of mutations in data over time, having an immutable data storage paradigm can be a game-changer. Immutable data storage refers to a data storage technique that allows for the creation of versions of data that can never be modified. This means that you can always access a particular version of the data and track the changes that occurred over time.
When it comes to implementing an immutable data storage system, there are several practical approaches you can take. One common I have found to be quite effective is to use a microservice to manage CRUD operations on individual data items, allowing systems, services and applications to contribute to the data over time and with ease. Meanwhile managing lineage and search in active storage and full data in form of object files. This allows easy and inexpensive maintenance of an S3-based data warehouse for access and introspection at scale. The microservice can use an immutable and idempotent data versioning and id generation process to ensure that any changes to the data are tracked and versioned.
Many companies have implemented immutable data storage systems to streamline their machine learning workflows and ensure high-quality data. For example, Uber has built an immutable data platform called Hudi (Hadoop Upserts Deletes and Incrementals) that allows for easy versioning, lineage, and data integrity. Hudi uses Apache Spark and Hadoop to enable large-scale data processing and storage. Another example is Netflix, which has built an immutable data pipeline called Genie that allows for easy versioning and reproducibility of machine learning workflows. Genie uses Amazon S3 for data storage and Spark for data processing.
However, the principles to keep in mind remain the same, i.e.
Multi-modal data access pattern — Applications, label generation, and system access of individual data items require fast and search-optimised storage, while aggregate introspection and dataset scale operations for data science and machine learning would rely on analytical data for bulk operations. Lineage is important — As we maintain versions, tracking the relationship graph between data items and across versions. Immutable history — This enables the ability to easily reproduce and provide the same level of access to older versions and relationships as we mutate data over time. In conclusion, an immutable data storage paradigm can simplify your system design, especially when dealing with a heterogeneous and evolving dataset with both structured and unstructured data. Implementing an immutable data storage system through a microservice approach can provide several benefits, including data consistency, reproducibility, versioning, and lineage. Many companies have implemented immutable data storage systems to streamline their machine learning workflows and ensure high-quality data, proving the effectiveness of this strategy.