Friday, August 7, 2009

Finding the Next Dot-Com

Its seems like everyone and their mom has a domain name nowadays, and domain squatters have the rest.

The land grab is speeding up. VeriSign says that more than 2.4m new .coms are registered every month. Virtual real estate is just a name, but domain auctions sometimes make actual real estate look cheap--and even if you aren't paying $7.4m for "", a lot of domain names are expensive. If it's pronounceable, paying a few thousand bucks is apparently normal.

All this squatting and grabbing makes life harder for people who actually want to build a web presence. Just look at the names of some of the newer companies around here: Ooyala, anyone? DealKat? SaaSure? That first one reads like a misspelling of a French exclamation. The second is a misspelling. The third is a pun on SaaS, which is usually pronounced "sass", so it sounds like a lisp ("sass-sure"?).

These guys seem to have gotten what lots of people want: an unregistered domain name. Those sell for about $10, renewed yearly; the details depend on the specific TLD you want. In any case, they're vastly cheaper than domains that are already registered, whether by previous legitimate users or by squatters.

After spending half an hour recently typing stuff into whois, looking for the coveted "not found" that would tell me a name was unregistered, I found nothing. I thought: there's got to be a better way...

So I wrote a few quick-and-dirty programs that generate domain names, then used a shell script to test them automatically and pick out the unregistered ones. I wanted domains that were reasonably short and looked and sounded like English words. Since all the tricks I tried to that end rely on a large English word list, and since I didn't feel like waiting very much, I wrote these mini-programs in C++.

Attempt #1 goes through a list of English words and builds a map of all each 'syllable' along with how often that syllable occurs. For syllables, I just used all the substrings in each word of 2-5 letters that had at least one vowel. The most common were es, in, er, ed, ing...

Then, I combined two syllables at a time, starting with the the most common and the second most common:,,,

The names got a bit more complicated after that, but they weren't very wordlike. Worse, the first 1000 were all four or five letters long, so despite sounding like gibberish, almost all of them were taken.

Attempt #1 results

500 domains
53 of which happen to be in the original wordlist. None of those was available.
Of the remaining 447, 2 were available: 0.4%

I needed something better...

Attempt #2 is a Markov model of the words in the dictionary. After a bit of experimenting, I decided to use a Markov model. For every three-char sequence in the input word list, I kept track of all the characters that come next. I included the null character at the end of each word, to model the length of English words as well as the characters they're made of.

To generate names, I sampled the most common three-character sequences at the beginning of dictionary words. For each sequence, I randomly picked from the characters that could come next, and repeated that until I got to a null character.

This one worked a lot better:

I especially like Maybe a communications or PR agency? It's a sweet domain.

These names were generally a bit longer, but pronounceable. A lot more of them were available--it seems like raw length is what the squatters really care about.

Attempt #2 results:

500 domains
122 of which happen to be in the original wordlist. None of those were available.
Of the remaining 378, 61 were available: 16.4%

Possibilities abound. I could..
  • Artificially crank up the probability of the null character for every seed in the Markov model, to make it produce shorter domain names while still trying to keep them wordlike.
  • Exclude long words from the wordlist and then build the Markov model, also in an effort to produce shorter domain names.
  • Use the Markov model to generate two words at a time, and simply concatenate them, also to get more compound-word domain names.
  • Repeat the experiment for languages other than English.
  • Run a bunch of popular domains and less popular ones through the Markov model to see how they score. Are domains that look like words more popular than those that don't?
  • Compare some sites with a lot of direct traffic (people typing the domain into their URL bars, instead of going through a link or a search engine) and see if those tend to be more wordlike than their less typeable counterparts.

I might try some of this stuff. If I do, I'll keep you posted. Peace...

1 comment:

Feross said...

This is AWESOME, Daniel.

I'd like to see the results you'd get if you used the Markov model to generate two words and concatenate them together. I bet you'd find some really interesting domains.

This is the coolest thing I've seen in a long while.