“Data is the new oil” was the slogan of the last decade. Companies were told how valuable their data was (or could be). They rushed to invest in a modern data stack and store terabytes of data in data warehouses. Data science teams crunched the numbers, and the analyses were supposed to be used to inform product decisions (or even, in some cases, customer-facing features like recommendation feeds).
There were success stories, but many organizations failed to execute. Siloed data (or data teams), expensive cloud data warehouses and rogue queries (that are now being downsized), and the absence of clean data pipelines (significant ops work to get the data in a refined state).
Now, with generative AI, is data still a moat? Is data more or less valuable when synthetic datasets account for a non-zero part of training and inference pipelines?
On the one hand, quality data still matters. A lot of focus on LLM improvement is on model and dataset size. There’s some early evidence that LLMs can be greatly influenced by the data quality they are trained with. WizardLM, TinyStories, and phi-1 are some examples. Likewise, RLHF datasets also matter.
On the other hand, ~100 data points are enough for significant improvement in fine-tuning for output format and custom style. LLM researchers at Databricks, Meta, Spark, and Audible did some empirical analysis on how much data is needed to fine-tune. This amount of data is easy to create or curate manually.
Model distillation is real and simple to do. You can use LLMs to generate synthetic data to train or fine-tune your own LLM, and some of the knowledge will transfer over. This is only an issue if you expose the raw LLM to a counterparty (not so much if used internally), but that means that any data that isn’t especially unique can be copied easily.
So what does this mean for companies that want to leverage their data as a competitive advantage? Here are some possible implications:
Data quality is more important than quantity. Having large amounts of noisy or irrelevant data can hurt your performance and increase your costs. Focus on collecting and cleaning the data that matters for your use case.
Data uniqueness is more important than commonality. Having data that is easily replicable or available to others can reduce your edge. Focus on creating or acquiring data that is rare or proprietary for your domain.
Data augmentation is more important than accumulation. Having more data can help you improve your models, but only up to a point. Focus on using generative AI techniques to create synthetic data that can enhance your existing data and fill in the gaps.
Data protection is more important than exposure. Having your data stolen or leaked can damage your reputation and competitiveness. Focus on securing your data and limiting access to authorized parties.
Generative AI is changing the game for data-driven businesses. Data is still valuable, but not in the same way as before. Companies need to adapt their strategies and practices to make the most of their data assets in the new era of generative AI.