Although fixing the set of possible search results could be advantageous because we would a predetermined website list to examine, I decided that this approach was too contrived and not representative of reality. Additionally, users often refine their queries by doing multiple searches and taking this ability away from them could adversely affect their actual search behavior.
Instead, I created Zoogle, which is more or less just a wrapper for Google. The live (and ever-changing) implementation can be seen at http://stanford.edu/~nchen11/cgi-bin/cs303/index.py.
Like in the Joachims paper, users would be asked to search for answers to questions I have prepared in advance. I would probably ask the users to rate whether the visited sites were relevant or not relevant, as well as rating the relevance myself.
While writing the python script that generates Zoogle, I also considered other experiment possibilities like testing how users respond to security warnings or error messages. For example, do different kinds of error messages prompt different kinds of user behavior? Or do users generally do the same thing (ignore the warning and continuing to the malicious site or heeding the warning and leaving) regardless of the type of message they see? This was something that I threw into the code very quickly as a rough prototype and is not what I will necessarily run the experiment on in the end. It is, however, an interesting question, although my ability to keep focused seems to slightly lacking. :)
As implementation took longer than I expected, I do not yet have preliminary results. Ideally, I would be able to track user behavior even when they navigate to different websites that are not under my domain, thus negating my current need to observe users in person, on a prepared machine. If this is possible, I would be able to test many more users online, making for a more statistically robust experiment overall.