Thursday, May 26, 2011

Take 2

Hooray!  Thanks to a speedy response from the cs303 staff, I will be running the Turk experiment again with slight modifications.  Hopefully it doesn't get flagged again, but all is not lost!

Take 2 of the experiment is identical to the original Turk experiment except for the following:
  • I will be using different questions, as some of them (e.g. "Find a website that gives a tutorial of how to type Chinese characters on a Windows computer.") now have this blog as the first Google hit!  I find this highly amusing, if not a bit annoying to deal with in its slightly Heisenberg Uncertainty Principle-esque nature.  So, I will hold off posting the modified questions publicly until the experiment is over.
  • I will separate out any Turkers who participated in the first run of the experiment (using their unique Turk ID), since they are familiar with how the experiment works, which could bias their behavior.
  • Logging has been improved either further!  Each user now has their own log file (thanks to cookies) instead of being listed in a single giant file, which will make parsing and reading much easier.

Now to rewait for results to slowly roll in...

Monday, May 23, 2011

NO TURK FOR YOU!

Well, this is helpful:

"Your Human Intelligence Tasks 'Search for an answer to 1 simple question' have been removed from Amazon Mechanical Turk because they violated the terms of our HIT listing policy."

I . . . what?  I went to look at the policies page for the list of "prohibited activities," which include the following:
  • collecting personal identifiable information
    • definitely did not do this
  • fraud
    • hahaha
  • disrupting or degrading the operation of any website or internet service
    • uh, does making Google searches count as being disruptive??
  • direct marketing
    • nope
  • spamming
    • nope
  • unsolicited contacting of users
    • nope
  • impermissible advertising or marketing activities
    • nope
  • infringing or misappropriating the rights of any third party
    • Although Zoogle is called Zoogle, I did that deliberately as well as explicitly stating that the whole point is that people were performing searches "on a modified Google search engine."  I cited my sources!
  • posting illegal or objectionable content
    • My logs do not show any inappropriate search terms, and I don't see how else the content could be illegal or objectionable?
  • disrupting operation of the Mechanical Turk website
    • nope
  • creating a security risk for Mechanical Turk, any Mechanical Turk user, or any third party
    • nope
 I don't even know if this happened because a random Turker flagged my HIT (why would they do that? :( ), or whether Amazon has people/algorithms that comb the HITs regularly to find "objectionable content," and they happened to stumble upon my apparently not-so-innocent little experiment.

Since I am not even entirely clear what policy I violated, I have not modified/reposted the HIT for obvious reasons.

Abruptly halted experiment is sad.  :(

Fortunately, I collected data from 104 Turkers before shutdown, so maybe I will be able to cull some interesting conclusions from that anyway.  I deliberately wanted to avoid polling people like my friends, because they would be a self-selected population of young adults who are probably reasonably computer savvy, which would not make for as interesting of an experiment group, I think.

Research is so exciting.

    Friday, May 20, 2011

    Initial Turk Data

    Questions used (including some from the Joachims paper)
    Open-ended (multiple correct answers possible)
    Find a website that gives a tutorial of how to type Chinese characters on a Windows computer.
    About how many calories are in an apple?
    List a movie that is coming out in the US on 07/08/11.
    List a movie that is coming out in the US on 06/03/11.

    Factual (only one correct answer)
    What is the name of the tallest mountain in New York?
    What is the elevation of the tallest mountain in New York?
    Find the homepage of Michal Jordon, the statistician.
    What is the name of the researcher who discovered the first modern antibiotic?


    Disclaimer
    The initial run on Turk revealed that my logging was not quite detailed enough, since I only noted the user's unique ID when they started and ended the task.  This created a problem when more than one user was using the website at the same time, since I would be unable to differentiate which user was issuing what search query.   I have since fixed this and will continue to collect more data that I could actually create more fine-grained graphs, like graphing the distribution of links clicked.
    Because of this, I was unable to graph very much data besides super high-level stuff that may not be that interesting.

    Data




    Average time per assignment: 2 minutes
    Average number of search queries per task: 1.86

    Comments
    • My worries about people using Google instead of Zoogle turned out to be unfounded, because it would actually be more effort to use Google, instead of just clicking on the Zoogle link that I provide. 
    • One user commented that "the first few results were not helpful for the answer I was looking for," basically trying to provide helpful criticism.  However, this is actually desired behavior, since it forces the user to search more (giving me more data) before they complete the task!  :)
    • More than one user would put something like "good HIT" or "good task, thank you!" in the optional comments section.  I was amused and surprised by this behavior because I feel that people tend to think of Turk as a bunch of random workers who mechanically do what they are told.  However, here we have examples of users subsequently trying to influence the experimenters, perhaps either to encourage the experimenter to accept the HIT or to even give the worker a Worker Bonus.

      Do note that I do not regard this as a bad thing, as it appears that it is a way for good workers to attempt to distinguish themselves. This could ultimately results in a mutually beneficial relationship (the worker gets more tasks that they like, and the experimenter gets higher quality data). Obviously, this would not be the case if bad workers put nice comments, but it would be interesting to see if there is a correlation.

    To Do
    • Graph the distribution of number of links clicked
    • Graph the distribution of number of queries made
    • Graph the distribution of time spent on each link
    • Graph the number of HITs per user
      • Compare the results of repeat users to those who only completed a single task (answering one question).
    • Think of more ambiguous questions that force users to search through multiple links
    • Run a randomized dummy user

    Saturday, May 14, 2011

    Notes from class

    Additional things to do
    • (completed) track turk ID and/or have user enter a random code after they complete the task
    • force reload of results page for better timing and more fine-grain logging
    • (completed) run half of the users with Google's original rankings and half with randomized rankings
    • (completed) log the randomization of the results that each user sees
    • (completed) put google analytics on my site

    Thursday, May 12, 2011

    Weekly Update

    Progress made
    I have more or less completed the coding portion of my website. This week, I had been working on how to track the amount of time that users spend on the websites returned in the search results, which I can't do directly because the links go to external webpages. Instead, my site logs the timestamp of when a user clicks on each result link and approximates the difference to be the amount of time spent on each external site. This approximation is not exact because it includes the time that the user spends looking at the search results page, but I am making the simplifying assumption that this time remains relatively constant between different page views and is therefore negligible.

    I also spent a fair amount of time making sure that the logging worked on multiple browsers, since I discovered that some calls were handled differently. For example, my initial implementation would not log any data if the user was using Chrome, due to the way Chrome handles javascript. I also (re)discovered that Internet Explorer is a giant pain in the butt and does silly things like reload pages for no reason.

    Timeline
    5/15/11 – Go over the final implementation with a few people in person, so they can point out any errors or clarifications that I should make before running it live on AMT
    5/18/11 – Start preliminary tests on AMT to see what kind of data is collected, fixing bugs in my script or making AMT adjustments if necessary
    5/23/11 – Run a randomized “dummy user” who clicks on links randomly and views the page for a random period of time. Use the data from the dummy user to compare with actual collected data, to see if there are actual patterns in collected data or whether the behavior is basically random.
    5/27/11 – Final presentation

    Questions
    1. Right now, my website randomizes the order of the top ten Google results, but I am concerned that this might introduce unnecessary noise into the data. The Joachims paper only used three orderings: original Google order, reversed order, and original Google order with the first and second results swapped. I originally decided to randomize the order of the results in case the time spent on each website was affected by the ordering of the websites, but complete randomization may be too much. Should I randomize all of the results? Leave them in the order that Google provides? Or do half of both?

    Friday, May 6, 2011

    Implementation and initial rollout

    Zoogle, written in Python, begins by prompting the user to enter a search query, emulating the Google homepage.  The experiment begins by giving the user an overview of the project as "seeing how modifications to the current Google search affect the quality of the search experience."  For full disclosure, they will be told that their behaviors on the site will be logged.

    When the user first opens the homepage, their session ID and the current time are logged in a file.  Let's say that the user enters "Stanford University" as their search.


    The user is then redirected to the results page, which is (at the current implementation) a randomized ordering of the first ten Google search results for the query.


    The user's query is noted in the logs.

    After the user completes their search task, they will be asked to complete a survey.  One question I am interested in seeing the answer to is the amount of time users think they personally spend on relevant websites vs. irrelevant websites.  I will then compare their perceived visit times with the actual visit times, to see how accurate (or not accurate) users are at judging their own search behavior.

    In order to deal with the problem of not being able to track the user's behavior on external websites, I developed a workaround by logging the time at which the user clicks on a link on the results page and the time at which they return to the Zoogle results.  In this way, I cannot track detailed information of navigation within the website (such as how many internal pages the user looked at), but I can estimate the total amount of time spent on the website.  I am thus modifying my hypothesis to be using the time spent on websites as a measure of implicit feedback instead of the original plan of looking at the bounce rate.  I think that this behavior can also be revealing and possibly even correlated with the bounce rate, even though that is no longer the goal of the experiment.

    Testing
    I will be running experiments on Amazon's Mechanical Turk, so I hope to collect at least 105 datapoints for a 95% confidence interval with a Cohen's d of 0.5.  Collecting enough data by running experiments manually would have been a near impossible task otherwise.

    Questions
    1. How should I ask users whether or not they judged a website to be relevant?  If I ask them right after they finish viewing the site, it could be too distracting and possibly affect their search behavior.  However, if I ask them after they have completed the experiment, it is very likely that they have forgotten which websites they looked at, much less than whether or not they considered the website to be relevant.
    2. Alternatively, if I ask simple enough questions, would it be a good idea to make the simplifying assumption that the last site visited in most relevant, and the previously visited sites were less or not relevant?
    3. Other feedback is also welcome!

    Friday, April 29, 2011

    Creating the initial experiment

    Although fixing the set of possible search results could be advantageous because we would a predetermined website list to examine, I decided that this approach was too contrived and not representative of reality.  Additionally, users often refine their queries by doing multiple searches and taking this ability away from them could adversely affect their actual search behavior.

    Instead, I created Zoogle, which is more or less just a wrapper for Google.  The live (and ever-changing) implementation can be seen at http://stanford.edu/~nchen11/cgi-bin/cs303/index.py.

    Like in the Joachims paper, users would be asked to search for answers to questions I have prepared in advance.  I would probably ask the users to rate whether the visited sites were relevant or not relevant, as well as rating the relevance myself.

    While writing the python script that generates Zoogle, I also considered other experiment possibilities like testing how users respond to security warnings or error messages.  For example, do different kinds of error messages prompt different kinds of user behavior?  Or do users generally do the same thing (ignore the warning and continuing to the malicious site or heeding the warning and leaving) regardless of the type of message they see?  This was something that I threw into the code very quickly as a rough prototype and is not what I will necessarily run the experiment on in the end.  It is, however, an interesting question, although my ability to keep focused seems to slightly lacking.  :)

    As implementation took longer than I expected, I do not yet have preliminary results.  Ideally, I would be able to track user behavior even when they navigate to different websites that are not under my domain, thus negating my current need to observe users in person, on a prepared machine.  If this is possible, I would be able to test many more users online, making for a more statistically robust experiment overall.