ADSX
HOME/GLOSSARY/TRAINING DATA
AI FUNDAMENTALS

Training Data

The large dataset of text used to teach an LLM language patterns and knowledge.

DEFINITION

What is Training Data?

Training data refers to the massive collection of text used to train large language models. This data typically includes books, websites, articles, forums, and other text sources—often comprising trillions of words. The quality, recency, and composition of training data directly affects what an LLM knows about any given topic, including your brand. Most LLMs have knowledge cutoff dates, meaning they only know information from their training data up to a certain point.

IN PRACTICE

We help ensure the information about your brand that exists in training data is accurate, positive, and comprehensive.

WHY IT MATTERS

Your brand's presence in training data affects how LLMs understand and recommend you. Content published before an LLM's knowledge cutoff becomes part of its foundational knowledge.

EXAMPLES
01

Information from your website being included in training data

02

Reviews and mentions from third-party sites

03

News articles and press coverage about your brand

FREQUENTLY ASKED QUESTIONS

Can I add my content to LLM training data?

You can't directly add content, but by having quality content widely published and linked, it's more likely to be included in future training runs.

What if incorrect information is in training data?

This is challenging but manageable. Newer content and retrieval systems can partially override training data, and optimization strategies can improve how your brand is represented.

Ready to improve your AI visibility?

Get a free audit to see how your brand appears across ChatGPT, Claude, Perplexity, and other AI platforms.