Sunday, January 31, 2010

Seattle Startup Companies, What Are They Running?

Take a list of companies, courtesy of Seattle 2.0, break it down to a list of URLs that looks like this:
...
http://www.zillow.com
http://www.Picnik.com
http://www.buddytv.com
http://www.Smilebox.com
http://failblog.org/
http://icanhascheezburger.com/
http://www.wetpaint.com
http://www.feedjit.com
http://www.questionpro.com/
...
feed that through a shell script that looks like this:
#!/bin/bash
while read line
do
wget http://uptime.netcraft.com/up/graph?site=$line;
done < startup-index.txt
to get a few hundred files from Netcraft.com's "What's That Site Running?" page in the local directory.

Now the fun really begins, the filtering!  We want to see what each company is using to put their best face forward to the public.  It all starts at the server.  No server, no nothing.  Every one of the 'graph' files is an html-formatted page with lots of extraneous information.  This filter collects the list of ALL servers used, regardless of past or present.
grep -h "<span style=\"white-space: nowrap\">" graphs* | sort | uniq
and the result of that is:
F5 Big-IP
FreeBSD
Linux
NetBSD/OpenBSD
Solaris 9/10
unknown
Windows Server 2003
Windows Server 2008
Now feed that through this script which narrows the search and we get our server numbers:
#!/bin/bash
while read line
do
echo "$line," `grep -h "$line.*<span" graph* | wc -l`
done < servers.txt
In this case (December 2009) we get the following:
F5 Big-IP, 21
FreeBSD, 8
Linux, 289
NetBSD/OpenBSD, 1
Solaris 9/10, 3
unknown, 11
Windows Server 2003, 71
Windows Server 2008, 22
On to web servers, using the same process, distill the list of web servers from the "graph" files with a filter like this:
grep -h ' was running ' graph* | sort | uniq > webservers.txt
which needs some editing and final refinement with
grep -h .* webservers.txt | sort | uniq > webservers
then we are ready to apply the script that will tally the numbers
#!/bin/bash
while read line
do
echo "$line," `grep -h "$line on" graph* | wc -l`
done < webservers
And the final score is:
Apache, 223
Apache-Coyote, 13
Google, 5
Jetty(6.1.5), 1
KWS, 1
lighttpd, 1
Microsoft-IIS, 103
Mongrel, 13
nginx, 51
Ning, 2
Resin, 1
Server, 2
SSWS, 3
Sun-ONE-Web-Server, 1
thin, 2
Thrivesmart, 1
unknown, 2
WWW, 2
Since the whole point of this exercise is to expose what really happens behind the curtain, the rivalry is obviously Microsoft against Open Source and everybody else.  The following charts are mashups from the results above.
Any questions?