> For the complete documentation index, see [llms.txt](https://notes.siddish.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://notes.siddish.com/ai/data-engine.md).

# Data Engine

<figure><img src="/files/gkU9ALXopZBTq34RmNAA" alt=""><figcaption><p>Data Engine HLD internally at metaforms.ai</p></figcaption></figure>

Hypothesis:

* Unknown unknowns: Dataset is always imperfect, all scenarios are not represented well yet and can always be more diverse
* Capable base model/architecture: Improving dataset improves AI/product guarantees

**Inspirations**:

{% embed url="<https://karpathy.github.io/2019/04/25/recipe/>" %}
**1. Become one with the data**\
**2. Set up the end-to-end training/evaluation skeleton + get dumb baselines**\
**3. Overfit**\
**4. Regularize**\
**5. Tune**\
**6. Squeeze out the juice**
{% endembed %}

{% embed url="<https://youtu.be/g2R2T631x7k?t=390>" %}
from 6th to 15th minute
{% endembed %}

{% embed url="<https://www.youtube.com/watch?v=zPH5O8hRfMA>" %}

"The only sure certain way I have seen of making progress on any task is, you curate the dataset that is clean and varied and you grow it and you pay the labeling cost and I know that works.”

"Potentially nitpicky but competitive advantage in AI goes not so much to those with data but those with a data engine. And whoever can spin it fastest. Slide from Tesla to \~illustrate but concept is general”

[QualEval: Qualitative Evaluation for Model Improvement](https://qualeval.org/#:~:text=QualEval%3A%20Qualitative%20Evaluation%20for%20Model%20Improvement)

{% embed url="<https://x.com/georgejrjrjr/status/1729996423457091731?s=20>" %}

<figure><img src="/files/EDahYlY5RLef3IX1lA43" alt=""><figcaption><p><a href="https://medium.com/swlh/about-the-long-tail-113e98ce8717">https://medium.com/swlh/about-the-long-tail-113e98ce8717</a></p></figcaption></figure>


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://notes.siddish.com/ai/data-engine.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
