<< June 2009 | Home | August 2009 >>

Stack Overflow: Simple Data Dump Download Archive

7-zip or bit torrent an inconvenience for you? This page is for you!
Bookmark and Share

The folks over at Stack Overflow publish the cc-wiki licensed data dumps as a .7z file via bit torrent.  For some, this poses a small logistical problem.  Thanks to xtendx's large pipe, I've elected to host these data dump archives on one of our servers. These downloads won't even be noticeable for us.

Stack Overflow Data Dump

Because the data dumps are a snapshot, and do not contain anything like a change log or history, this page will contain all the dumps over time as they are released. So far, that is only two dumps, but the intent seems to be one a month.

The data dump files have also been "trans compressed" from 7-zip to bzip2'd tar files. Those folks on a *nix system will probably appreciate this.

I hope this makes someone's life a little bit easier. If so, let me know!

 

My first Erlang program: Hello World!

Also, 'Hello Erlang!' and 'Hello YAWS!'
Bookmark and Share

Erlang LogoSigh...It's not every day that a programmer tries out a new language, but that is exactly what I did this evening! 

A few months ago some of the hype around Erlang caught my eye.  Specifically, Erlang's approach to parallel processing and multi-threading are interesting to me because 1) multiple cores are becoming the norm and increasing in number, 2) the existing approaches by popular platforms and architectures, including my beloved Java, may not scale out as well with more cores, and 3) streaming media servers, which butter my bread, seem like a great application for Erlang.

The first thing I did was download the source to Erlang, compile it, and then install.  Tim Dysinger has a wonderfully concise blog post (Compiling Erlang on Mac OS X Leopard from Scratch) which I have reproduced below slight altered to my way of doing things:
cd /tmp
wget http://www.erlang.org/download/otp_src_R12B-5.tar.gz
tar xzf
otp_src_R12B-5.tar.gz
cd otp_src_R12B-5
./configure --enable-hipe --enable-smp-support --enable-threads
make
sudo make install
cd ~
And with that Erlang is ready to go.  Next I found a great "Hello World!" blog post from Edward Garson's blog: The *Real* Erlang "Hello, World!".  Garson has an opinion on what should be a learner's first Erlang program, that it should use the Actor Model, a core approach of the language.  So, the code:
--module(hello).
-export([start/0]).

start() ->
    spawn(fun() -> loop() end).

loop() ->
    receive
        hello ->
        io:format("Hello, World!~n"),
        loop();

    goodbye ->
        ok

end.
With a quick copy from Garson's blog into a new text file, hello.erg, in my home directory, I was ready to compile and run in the shell:
manoa:~ stu$ erl
Erlang (BEAM) emulator version 5.6.5 [source] [smp:2] [async-threads:0]
[hipe] [kernel-poll:false]

Eshell V5.6.5 (abort with ^G)
1> c(hello).
{ok,hello}
2>
2>
2> Pid = hello:start().
<0.37.0>
3> Pid ! hello.
Hello, World!
hello
4> Pid ! goodbye.
goodbye
5> halt()
5> .
manoa:~ stu$
Fantastic!  Afterward, I went through the official "Getting Started with Erlang" page at erlang.org, which touches lightly on deeper topics, such as the Erlang shell, job control, distributed Erlang, and debugging. 

Also on my googles this evening I found criticisms of Erlang, which is healthy to see in the community.  An example would be Damien Katz's (creator of CouchDB, which is written in Erlang) "What Sucks About Erlang".  It was good to see the ugly side of Erlang up front before I become too enamored. 

Next steps?  I need to go through the Hello World and Getting Started pages again, write some little programs, and they try to do something with YAWS, a web server written in Erlang.  I have a project in mind already which is one one hand very motivating, but on the other probably means I will skip or miss some basic concepts and learn them the hard way.

What is the project you ask? Well, I'd like to write an FLV pseudo-streaming module for YAWS and potentially an accompanying FLV indexing tool. I've written one from scratch in Java for xtendx, and there are plenty of other implementations in PHP, C, etc for other web servers.

Hello Erlang!
Tags :

Our DL380 G5 Servers Turn Two Years Old

Happy Birthday, My Babies!
Bookmark and Share

It was two years ago that the team at xtendx setup a pair of new servers at a 'proper' hosting company, Aspectra, with quality infrastructure, engineers and network peering in support of new, demanding customer requirements.  (Our previous hosting company was second rate, at best, and our entry level servers were getting old.)  After shopping for a new hosting company, and selecting Aspectra, we then began analysis/negotiations regarding the hardware.  Our requirements boiled down to these critical items:
  • Hardware Vendor must be an industry leader and well supported by hosting company
  • Support from the hardware vendor for a modern flavor of Linux
  • The hardware itself should have a solid degree of internal redundancy
  • Expandable/upgradable in the future
  • There should be a 'hot backup' server
  • A large, shared, performant, fully backed-up file system 
This turned out to be a pretty straight forward decision, as Aspectra was very familiar HP's product line and had a large spare parts inventory on site.  The HP 300-series servers met our redundancy requirements.  And a SAS-based storage solution was more than adequate.  In the end, we...
  • (2x) HP DL380 G5 - purchased
    • A single E5320 "Clovertown" 4-core 1.86GHz Intel Xeon CPU (other CPU socket is open)
    • 4GB of PC2-5300 DDR2 RAM in 2 modules (6 slots free)
    • A pair of 10k RPM 72GB 2.5" disks in a RAID 1 (mirrored) configuration
    • Red Hat Enterprise Linux 4
  • (1x) HP MSA 1500 - shared, leased
    • (4) 300GB 3.5" drives in a RAID 1+0 (mirrored + stripping) configuration
    • Fiber Optic SAS connection

Front View

xtendx HP DL380 G5 streaming servers (Larger Image)

Note that on both server chassis one of the hard disk drives' handle is ajar.  This picture was taken during the initial setup and configuration, and the drive is disconnected to test that a) the RAID1 array does not fail, and b) the monitoring system reports a failure to the engineers.

Rear View

xtendx HP DL380 G5 streaming servers (Larger Image)

Again, note that one of the power supplies is disconnected and has no green power light.  This photograph was also taken while the servers were undergoing initial setup and configuration, and the power cable is disconnected to test that a) the continues to function with only one power supply, and b) the monitoring system reports a failure to the administrators.

(Original) Storage Array

HP MSA 1500 in action (Larger Image)

Four of these 300GB drives were dedicated to our servers, giving us a usable ~575GB of space.  Since the system was built up we have expanded our storage twice.  Now the storage is in another chassis and we have ~1.5TB to work with.

We made a solid choice with these servers and they should last us another two or three years, even with current customer growth and additional features accounted for.  While we have not upgraded the memory or CPUs yet, that is sure to happen in the next year.  That said, the current platform's limitations have forced me to continuously tune my application in multiple dimensions: reduce memory consumption, reduce CPU load, and keeping response times low.  This has been a good thing for both my application and my programming skills.

Anyway, "Happy Birthday", my babies!

The Personalities of Stack Overflow by the Numbers

A look for patterns in reputation, posts and profile views of the 'personalities' on Stack Overflow
Bookmark and Share


After reading some commentary on meta.stackoverflow.com about the famous users of Stack Overflow, and the 'star power' of the two founders, I decided to have a look at the one metric that can easily be interpreted as a measure of name recognition: Profile Views.  The first resulting graph is below and contains the Top 1000 users by reputation score.
 Reputation (x-axis) vs. Votes (y-axis) vs. Profile Views (z-axis)

Stack Overflow: Reputation Score versus Total Votes Cast versus Profile Views

(Larger Image)

While much of the graph meets my expectations (Jon Skeet has by far the most profile views, the top 10 users generally have larger bubbles), there are some interesting data points:
  • Rich B, the most prolific down voter and a bit of a rabble rouser among SOpedians both on and off the site, has a huge bubble for someone not in the Top 10 or even Top 100 of users
  • Two other Top 1000 users, but not Top 10 or 100, have relatively large Profile View bubbles: Darron and Ates Goral.  I am not familiar with them, although they are apparently popular by this measure.  
What was not  clear was what the characteristics of the Top 10/100/1000 users were, when ranked by Profile Views.  Do they just post volumes of answers and question?  Are profile views tied to reputation?  So I whipped up the below graph looking for answers. Results are limited to users with reputation of more than 100.

 Reputation (x-axis) vs. Posts (y-axis) vs. Profile Views (z-axis)

Stack Overflow: Reputation Score versus Posts versus Profile Views

(Larger Image)

This graph turns out to be much more of a surprise. While some users' data points are very much predictably prominent (Atwood, Spolsky, Skeet, Gravell, de Lizard, and Rich B), others strike me as unusual:
Why have these three users had so many profile views?  I've never heard of these folks and their reputations scores are relatively low. 

Beyond these three users, the correlation between both a) reputation and profile views, and b) posts and profiles views is weaker than I expected. Note the density of Top 100 user data points (red) with less than 500 posts and reputation scores well under 10,000. Interesting...