# Data Engine

<figure><img src="https://3980091207-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxwiB3tV6oLM7g7SThvsv%2Fuploads%2FIIwgCFC9lHEFWuSGfuhy%2FData%20Engine.jpg?alt=media&#x26;token=05e49eca-2c61-42b3-8273-b7961fb02709" alt=""><figcaption><p>Data Engine HLD internally at metaforms.ai</p></figcaption></figure>

Hypothesis:

* Unknown unknowns: Dataset is always imperfect, all scenarios are not represented well yet and can always be more diverse
* Capable base model/architecture: Improving dataset improves AI/product guarantees

**Inspirations**:

{% embed url="<https://karpathy.github.io/2019/04/25/recipe/>" %}
**1. Become one with the data**\
**2. Set up the end-to-end training/evaluation skeleton + get dumb baselines**\
**3. Overfit**\
**4. Regularize**\
**5. Tune**\
**6. Squeeze out the juice**
{% endembed %}

{% embed url="<https://youtu.be/g2R2T631x7k?t=390>" %}
from 6th to 15th minute
{% endembed %}

{% embed url="<https://www.youtube.com/watch?v=zPH5O8hRfMA>" %}

"The only sure certain way I have seen of making progress on any task is, you curate the dataset that is clean and varied and you grow it and you pay the labeling cost and I know that works.”

"Potentially nitpicky but competitive advantage in AI goes not so much to those with data but those with a data engine. And whoever can spin it fastest. Slide from Tesla to \~illustrate but concept is general”

[QualEval: Qualitative Evaluation for Model Improvement](https://qualeval.org)

{% embed url="<https://x.com/georgejrjrjr/status/1729996423457091731?s=20>" %}

<figure><img src="https://3980091207-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxwiB3tV6oLM7g7SThvsv%2Fuploads%2FZLViJesUWNL9cgvA8d0d%2Fimage.png?alt=media&#x26;token=4a2c2c93-32f8-423b-800c-72ea83b34f97" alt=""><figcaption><p><a href="https://medium.com/swlh/about-the-long-tail-113e98ce8717">https://medium.com/swlh/about-the-long-tail-113e98ce8717</a></p></figcaption></figure>
