Is data scale fuelling you AI? Or your AWS bill?
The AI win built on only 176,000 ancient tweets.
“When is a dataset big enough?” - this question comes up in nearly every one of my AI workshops.
Many think AI learns by the gigabyte, when it actually learns by the example.
Call-centre logs repeating the same scripted greeting offer repetition, not learning.
(That takes up space but adds no depth).
AI works best when data is focused - even a small dataset will outperform a giant one if it’s dialled into the use case.
Take Aeneas,
(Not a Trojan hero, but still well-trained)
An AI tool created by Google DeepMind to piece together damaged Latin inscriptions.
The Roman Empire doesn’t post much new data, so the archive is pretty thin:
176,000 inscriptions - most shorter than a tweet.
ChatGPT-4, which was trained on the equivalent of 15M books, still can’t reliably unravel these texts.
While Aeneas,
With the equivalent of 40 books,
Goes beyond just recovering the words - it also maps where fragments came from and when they were carved:
- It restores missing text with 73% accuracy.
- Identifies the home province 72% of the time.
- And pins the date to within a 13-year window.
A historian who worked with Aeneas was amazed as well:
“Aeneas’ parallels completely changed my perception of the inscription. It noticed details that made all the difference for restoring and chronologically attributing the text.”
That level of precision came from feeding Aeneas the right examples - you don’t need the entire internet to solve every problem.
You need focused, representative data aligned to the use case.
Without that, you’re just overtraining, wasting money, and filling up hard drives.
Find it - and you might just conquer your own empire.
See the amazing work by Google on Aeneas here:
Aeneas AI: Restoring the Roman Empire
.
P.S. You’ll also find a bonus link there to Ithaca – his AI “brother” restoring ancient Greek scrolls.