Thursday, May 26, 2011

Take 2

Hooray!  Thanks to a speedy response from the cs303 staff, I will be running the Turk experiment again with slight modifications.  Hopefully it doesn't get flagged again, but all is not lost!

Take 2 of the experiment is identical to the original Turk experiment except for the following:
  • I will be using different questions, as some of them (e.g. "Find a website that gives a tutorial of how to type Chinese characters on a Windows computer.") now have this blog as the first Google hit!  I find this highly amusing, if not a bit annoying to deal with in its slightly Heisenberg Uncertainty Principle-esque nature.  So, I will hold off posting the modified questions publicly until the experiment is over.
  • I will separate out any Turkers who participated in the first run of the experiment (using their unique Turk ID), since they are familiar with how the experiment works, which could bias their behavior.
  • Logging has been improved either further!  Each user now has their own log file (thanks to cookies) instead of being listed in a single giant file, which will make parsing and reading much easier.

Now to rewait for results to slowly roll in...


  1. separating out prior test users is a great idea! I myself have had to do that for other projects! Perhaps whoever flagged your turk project, if you could block them (by ID) or show them a different (non-lagging) page, it would prevent you from being flagged again! Can you flag the IDs of anyone that hits your page?

  2. Flagging is anonymous, and for good reason, I think. However, it is so anonymous that I initially wasn't even sure if my HIT was taken down do to automating flagging or user flagging. I only found that it had to have been user flagging thanks to Scott and his Turk-knowledgeable contacts.

    Given the little information that I had, I am assuming that maybe someone maliciously flagged because it took me a few days to approve their task, so they could receive payment. That would be annoying and unfortunate, but I am approving tasks more consistently now, just in case.

    Thanks for the comments!

  3. To Answer some of the questions you asked in your previous blog:

    "How should I ask users whether or not they judged a website to be relevant? If I ask them right after they finish viewing the site, it could be too distracting and possibly affect their search behavior. However, if I ask them after they have completed the experiment, it is very likely that they have forgotten which websites they looked at, much less than whether or not they considered the website to be relevant.
    Alternatively, if I ask simple enough questions, would it be a good idea to make the simplifying assumption that the last site visited in most relevant, and the previously visited sites were less or not relevant?
    Other feedback is also welcome!"

    I think you can log all their navigation and at the end in your survey you can ask them questions like "You visited site x when searching for query y. How relevant it was (some set of choices)?"

    How are you planning to handle conflicts (some users will mark site x very relevant while some would mark it as less relevant for the same query). I think you may want to do Cohen's kappa measure instead of simple percentage measure ('s_kappa).

    This is a really nice experiment.

  4. Yay! I'm very relieved. My fingers are crossed.

  5. Hahaha I can't believe your study got flagged-- that's more than a bit unfathomable. Hopefully it goes smoothly this time! Also, it's pretty funny that google already indexed your questions from before; that's both more than a little bit impressive and also strangely unhelpful...

    Separating out the prior participants is definitely a good idea. It'll also be interesting to see, since you do have data series per turker, to see what the performance characteristics of individual turkers are, and whether they agree with the trends we had been talking about before in class (e.g. if there are some turkers who basically always do good work, and others who are basically always worthless).