Most Meaningful Dates on the Web and for an LLM
May 14 2026
I recently read this blog post by David Hagen who figured out why the 11th of every month in ordinal form (‘February 11th’) occurs so much less often than every other date (except September of course) in xkcd #1140 (reproduced below):
This graphic was generated using data from the Google N-grams corpus. But we have much bigger corpora now thanks to all the language models everyone is training (and deploying as products with no regards for copyright, fair use, or possible harms). These corpora are mostly scraped from the web, so we can ask: What does a similar “Calendar of Meaningful Dates” on the web look like?.
Firstly, I should have stuck to just ordinal dates as Randall Munroe did, because there a lot of ways of representing a date. In the end I focused on 4 forms:
- Full month and date: June 12
- Full month and ordinal date: June 12th
- Abbreviated month and date: Jun 12 1
- Abbreviated month and ordinal date: Jun 12th
Within each there are a couple of variations: the date could come before the month, there could be a period for the abbreviated months, and single digit dates could have a zero before them. Using the nifty infini-gram mini API, querying the DCLM corpus containing over 4 billion tokens (≈1.5 billion words, mostly filtered from Common Crawl), and sizing dates by their count rank we end up with the following calendar.
One caveat: sizing by rank tends to amplify differences, but that’s what’s necessary to notice patterns in the data. I couldn’t find the algorithm Randall Munroe used for the xkcd comic, but after trying a bunch of different ways to scale font sizes, I decided that rank-based scaling of date counts was the best.
The source code and raw count data are up on GitHub if you want to look through them. A few observations and unanswered questions:
- January 1, September 11 and July 1 are the most common dates. I think July 1 is artificially high though because it marks the halfway point of the year. A lot of half-yearly reports, announcements, and articles posted online are probably posted on that date.
- February 29th is the least occurring date on the web, which isn’t too surprising. But the 3 rarest dates after? December 24th, 25th and 26th. This surprised me initially, but I think it’s because everyone just refers to that as Christmas eve, Christmas and Boxing day, rather than the actual date2.
- The US and Western centricity of the web is apparent from the scarcity dates around Thanksgiving weekend and Christmas day. A lot of dates on the web are bylines to (or mentioned in) articles, blogs, and social media posts, of which there are fewer during the holidays.
- There is a peak around the 15th in every month. My only hypothesis is there is a lot of stuff that happens on a bi-weekly schedule that gets published on that date.
- October has the fewest dates mentioned on the web 🤔 August and May dates are also relatively rare. Not sure why.
- Relatedly why are the first 10 days in November and December much more frequent than most other months?
What are ‘meaningful’ dates for a language model?
DCLM is a pretty big corpus of web text derived mostly from Common Crawl data, but to build a useful language model you need more diverse data. Luckily infini-gram mini lets us also query The Pile, a smaller, but more diverse open source language modeling dataset from EleutherAI. The Pile contains more code, research papers, and (controversially) copyrighted books as well. Querying this dataset for all date variations leads to this interesting calendar:
Again, the relative sizes of dates are exaggerated due to my rank ordering, but the differences (and similarities) from the DCLM calendar are fascinating. 9/11 is the 4th most common date, December 31 leaps into 3rd place. There are more dates from October and May, but August dates are still rare. The weeks around Christmas and Thanksgiving are still low in count, and the peaks around the 15th remain, but January is no longer full of highly frequent dates. March dates are more frequent in The Pile than in DCLM.
There are probably more interesting questions and patterns in the data. Let me know if you find anything, or see any issues with my code!
-
May doesn’t have an abbreviated form. September has at least two: Sept and Sep. The other months might have more, but I realized I would never get this done if I tried to enumerate all variations and date forms. ↩
-
July 1 is bigger than July 4 for a similar reason: Americans refer to it overwhelmingly as ‘4th of July’. ↩