Training Data

The large dataset of text used to teach an LLM language patterns and knowledge.

DEFINITION

What is Training Data?

Training data refers to the massive collection of text used to train large language models. This data typically includes books, websites, articles, forums, and other text sources—often comprising trillions of words. The quality, recency, and composition of training data directly affects what an LLM knows about any given topic, including your brand. Most LLMs have knowledge cutoff dates, meaning they only know information from their training data up to a certain point.

IN PRACTICE

We help ensure the information about your brand that exists in training data is accurate, positive, and comprehensive.

WHY IT MATTERS

Your brand's presence in training data affects how LLMs understand and recommend you. Content published before an LLM's knowledge cutoff becomes part of its foundational knowledge.

EXAMPLES

Information from your website being included in training data

Reviews and mentions from third-party sites

News articles and press coverage about your brand

FREQUENTLY ASKED QUESTIONS

Can I add my content to LLM training data?

You can't directly add content, but by having quality content widely published and linked, it's more likely to be included in future training runs.

What if incorrect information is in training data?

This is challenging but manageable. Newer content and retrieval systems can partially override training data, and optimization strategies can improve how your brand is represented.

MORE AI FUNDAMENTALS TERMS

Large Language Model (LLM)

An AI system trained on massive text datasets to understand and generate human-like text.

→

AI Assistant

A software application powered by AI that helps users complete tasks through natural conversation.

→

Natural Language Processing (NLP)

The branch of AI focused on enabling computers to understand, interpret, and generate human language.

→

Knowledge Cutoff

The date after which an LLM has no information from its training data.