# Data Engine

<figure><img src="/files/gkU9ALXopZBTq34RmNAA" alt=""><figcaption><p>Data Engine HLD internally at metaforms.ai</p></figcaption></figure>

Hypothesis:

* Unknown unknowns: Dataset is always imperfect, all scenarios are not represented well yet and can always be more diverse
* Capable base model/architecture: Improving dataset improves AI/product guarantees

**Inspirations**:

{% embed url="<https://karpathy.github.io/2019/04/25/recipe/>" %}
**1. Become one with the data**\
**2. Set up the end-to-end training/evaluation skeleton + get dumb baselines**\
**3. Overfit**\
**4. Regularize**\
**5. Tune**\
**6. Squeeze out the juice**
{% endembed %}

{% embed url="<https://youtu.be/g2R2T631x7k?t=390>" %}
from 6th to 15th minute
{% endembed %}

{% embed url="<https://www.youtube.com/watch?v=zPH5O8hRfMA>" %}

"The only sure certain way I have seen of making progress on any task is, you curate the dataset that is clean and varied and you grow it and you pay the labeling cost and I know that works.”

"Potentially nitpicky but competitive advantage in AI goes not so much to those with data but those with a data engine. And whoever can spin it fastest. Slide from Tesla to \~illustrate but concept is general”

[QualEval: Qualitative Evaluation for Model Improvement](https://qualeval.org/#:~:text=QualEval%3A%20Qualitative%20Evaluation%20for%20Model%20Improvement)

{% embed url="<https://x.com/georgejrjrjr/status/1729996423457091731?s=20>" %}

<figure><img src="/files/EDahYlY5RLef3IX1lA43" alt=""><figcaption><p><a href="https://medium.com/swlh/about-the-long-tail-113e98ce8717">https://medium.com/swlh/about-the-long-tail-113e98ce8717</a></p></figcaption></figure>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://notes.siddish.com/ai/data-engine.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
