Thursday, May 26, 2011

Take 2

Hooray!  Thanks to a speedy response from the cs303 staff, I will be running the Turk experiment again with slight modifications.  Hopefully it doesn't get flagged again, but all is not lost!

Take 2 of the experiment is identical to the original Turk experiment except for the following:
  • I will be using different questions, as some of them (e.g. "Find a website that gives a tutorial of how to type Chinese characters on a Windows computer.") now have this blog as the first Google hit!  I find this highly amusing, if not a bit annoying to deal with in its slightly Heisenberg Uncertainty Principle-esque nature.  So, I will hold off posting the modified questions publicly until the experiment is over.
  • I will separate out any Turkers who participated in the first run of the experiment (using their unique Turk ID), since they are familiar with how the experiment works, which could bias their behavior.
  • Logging has been improved either further!  Each user now has their own log file (thanks to cookies) instead of being listed in a single giant file, which will make parsing and reading much easier.

Now to rewait for results to slowly roll in...

Monday, May 23, 2011


Well, this is helpful:

"Your Human Intelligence Tasks 'Search for an answer to 1 simple question' have been removed from Amazon Mechanical Turk because they violated the terms of our HIT listing policy."

I . . . what?  I went to look at the policies page for the list of "prohibited activities," which include the following:
  • collecting personal identifiable information
    • definitely did not do this
  • fraud
    • hahaha
  • disrupting or degrading the operation of any website or internet service
    • uh, does making Google searches count as being disruptive??
  • direct marketing
    • nope
  • spamming
    • nope
  • unsolicited contacting of users
    • nope
  • impermissible advertising or marketing activities
    • nope
  • infringing or misappropriating the rights of any third party
    • Although Zoogle is called Zoogle, I did that deliberately as well as explicitly stating that the whole point is that people were performing searches "on a modified Google search engine."  I cited my sources!
  • posting illegal or objectionable content
    • My logs do not show any inappropriate search terms, and I don't see how else the content could be illegal or objectionable?
  • disrupting operation of the Mechanical Turk website
    • nope
  • creating a security risk for Mechanical Turk, any Mechanical Turk user, or any third party
    • nope
 I don't even know if this happened because a random Turker flagged my HIT (why would they do that? :( ), or whether Amazon has people/algorithms that comb the HITs regularly to find "objectionable content," and they happened to stumble upon my apparently not-so-innocent little experiment.

Since I am not even entirely clear what policy I violated, I have not modified/reposted the HIT for obvious reasons.

Abruptly halted experiment is sad.  :(

Fortunately, I collected data from 104 Turkers before shutdown, so maybe I will be able to cull some interesting conclusions from that anyway.  I deliberately wanted to avoid polling people like my friends, because they would be a self-selected population of young adults who are probably reasonably computer savvy, which would not make for as interesting of an experiment group, I think.

Research is so exciting.

    Friday, May 20, 2011

    Initial Turk Data

    Questions used (including some from the Joachims paper)
    Open-ended (multiple correct answers possible)
    Find a website that gives a tutorial of how to type Chinese characters on a Windows computer.
    About how many calories are in an apple?
    List a movie that is coming out in the US on 07/08/11.
    List a movie that is coming out in the US on 06/03/11.

    Factual (only one correct answer)
    What is the name of the tallest mountain in New York?
    What is the elevation of the tallest mountain in New York?
    Find the homepage of Michal Jordon, the statistician.
    What is the name of the researcher who discovered the first modern antibiotic?

    The initial run on Turk revealed that my logging was not quite detailed enough, since I only noted the user's unique ID when they started and ended the task.  This created a problem when more than one user was using the website at the same time, since I would be unable to differentiate which user was issuing what search query.   I have since fixed this and will continue to collect more data that I could actually create more fine-grained graphs, like graphing the distribution of links clicked.
    Because of this, I was unable to graph very much data besides super high-level stuff that may not be that interesting.


    Average time per assignment: 2 minutes
    Average number of search queries per task: 1.86

    • My worries about people using Google instead of Zoogle turned out to be unfounded, because it would actually be more effort to use Google, instead of just clicking on the Zoogle link that I provide. 
    • One user commented that "the first few results were not helpful for the answer I was looking for," basically trying to provide helpful criticism.  However, this is actually desired behavior, since it forces the user to search more (giving me more data) before they complete the task!  :)
    • More than one user would put something like "good HIT" or "good task, thank you!" in the optional comments section.  I was amused and surprised by this behavior because I feel that people tend to think of Turk as a bunch of random workers who mechanically do what they are told.  However, here we have examples of users subsequently trying to influence the experimenters, perhaps either to encourage the experimenter to accept the HIT or to even give the worker a Worker Bonus.

      Do note that I do not regard this as a bad thing, as it appears that it is a way for good workers to attempt to distinguish themselves. This could ultimately results in a mutually beneficial relationship (the worker gets more tasks that they like, and the experimenter gets higher quality data). Obviously, this would not be the case if bad workers put nice comments, but it would be interesting to see if there is a correlation.

    To Do
    • Graph the distribution of number of links clicked
    • Graph the distribution of number of queries made
    • Graph the distribution of time spent on each link
    • Graph the number of HITs per user
      • Compare the results of repeat users to those who only completed a single task (answering one question).
    • Think of more ambiguous questions that force users to search through multiple links
    • Run a randomized dummy user

    Saturday, May 14, 2011

    Notes from class

    Additional things to do
    • (completed) track turk ID and/or have user enter a random code after they complete the task
    • force reload of results page for better timing and more fine-grain logging
    • (completed) run half of the users with Google's original rankings and half with randomized rankings
    • (completed) log the randomization of the results that each user sees
    • (completed) put google analytics on my site

    Thursday, May 12, 2011

    Weekly Update

    Progress made
    I have more or less completed the coding portion of my website. This week, I had been working on how to track the amount of time that users spend on the websites returned in the search results, which I can't do directly because the links go to external webpages. Instead, my site logs the timestamp of when a user clicks on each result link and approximates the difference to be the amount of time spent on each external site. This approximation is not exact because it includes the time that the user spends looking at the search results page, but I am making the simplifying assumption that this time remains relatively constant between different page views and is therefore negligible.

    I also spent a fair amount of time making sure that the logging worked on multiple browsers, since I discovered that some calls were handled differently. For example, my initial implementation would not log any data if the user was using Chrome, due to the way Chrome handles javascript. I also (re)discovered that Internet Explorer is a giant pain in the butt and does silly things like reload pages for no reason.

    5/15/11 – Go over the final implementation with a few people in person, so they can point out any errors or clarifications that I should make before running it live on AMT
    5/18/11 – Start preliminary tests on AMT to see what kind of data is collected, fixing bugs in my script or making AMT adjustments if necessary
    5/23/11 – Run a randomized “dummy user” who clicks on links randomly and views the page for a random period of time. Use the data from the dummy user to compare with actual collected data, to see if there are actual patterns in collected data or whether the behavior is basically random.
    5/27/11 – Final presentation

    1. Right now, my website randomizes the order of the top ten Google results, but I am concerned that this might introduce unnecessary noise into the data. The Joachims paper only used three orderings: original Google order, reversed order, and original Google order with the first and second results swapped. I originally decided to randomize the order of the results in case the time spent on each website was affected by the ordering of the websites, but complete randomization may be too much. Should I randomize all of the results? Leave them in the order that Google provides? Or do half of both?

    Friday, May 6, 2011

    Implementation and initial rollout

    Zoogle, written in Python, begins by prompting the user to enter a search query, emulating the Google homepage.  The experiment begins by giving the user an overview of the project as "seeing how modifications to the current Google search affect the quality of the search experience."  For full disclosure, they will be told that their behaviors on the site will be logged.

    When the user first opens the homepage, their session ID and the current time are logged in a file.  Let's say that the user enters "Stanford University" as their search.

    The user is then redirected to the results page, which is (at the current implementation) a randomized ordering of the first ten Google search results for the query.

    The user's query is noted in the logs.

    After the user completes their search task, they will be asked to complete a survey.  One question I am interested in seeing the answer to is the amount of time users think they personally spend on relevant websites vs. irrelevant websites.  I will then compare their perceived visit times with the actual visit times, to see how accurate (or not accurate) users are at judging their own search behavior.

    In order to deal with the problem of not being able to track the user's behavior on external websites, I developed a workaround by logging the time at which the user clicks on a link on the results page and the time at which they return to the Zoogle results.  In this way, I cannot track detailed information of navigation within the website (such as how many internal pages the user looked at), but I can estimate the total amount of time spent on the website.  I am thus modifying my hypothesis to be using the time spent on websites as a measure of implicit feedback instead of the original plan of looking at the bounce rate.  I think that this behavior can also be revealing and possibly even correlated with the bounce rate, even though that is no longer the goal of the experiment.

    I will be running experiments on Amazon's Mechanical Turk, so I hope to collect at least 105 datapoints for a 95% confidence interval with a Cohen's d of 0.5.  Collecting enough data by running experiments manually would have been a near impossible task otherwise.

    1. How should I ask users whether or not they judged a website to be relevant?  If I ask them right after they finish viewing the site, it could be too distracting and possibly affect their search behavior.  However, if I ask them after they have completed the experiment, it is very likely that they have forgotten which websites they looked at, much less than whether or not they considered the website to be relevant.
    2. Alternatively, if I ask simple enough questions, would it be a good idea to make the simplifying assumption that the last site visited in most relevant, and the previously visited sites were less or not relevant?
    3. Other feedback is also welcome!

    Friday, April 29, 2011

    Creating the initial experiment

    Although fixing the set of possible search results could be advantageous because we would a predetermined website list to examine, I decided that this approach was too contrived and not representative of reality.  Additionally, users often refine their queries by doing multiple searches and taking this ability away from them could adversely affect their actual search behavior.

    Instead, I created Zoogle, which is more or less just a wrapper for Google.  The live (and ever-changing) implementation can be seen at

    Like in the Joachims paper, users would be asked to search for answers to questions I have prepared in advance.  I would probably ask the users to rate whether the visited sites were relevant or not relevant, as well as rating the relevance myself.

    While writing the python script that generates Zoogle, I also considered other experiment possibilities like testing how users respond to security warnings or error messages.  For example, do different kinds of error messages prompt different kinds of user behavior?  Or do users generally do the same thing (ignore the warning and continuing to the malicious site or heeding the warning and leaving) regardless of the type of message they see?  This was something that I threw into the code very quickly as a rough prototype and is not what I will necessarily run the experiment on in the end.  It is, however, an interesting question, although my ability to keep focused seems to slightly lacking.  :)

    As implementation took longer than I expected, I do not yet have preliminary results.  Ideally, I would be able to track user behavior even when they navigate to different websites that are not under my domain, thus negating my current need to observe users in person, on a prepared machine.  If this is possible, I would be able to test many more users online, making for a more statistically robust experiment overall.

    Friday, April 22, 2011

    Background info and experimental methodology

    So, after reading Joachims et al.: Accurately Interpreting Clickthrough Data as Implicit Feedback paper, I have reconsidered my original idea and have instead been motivated to pursue a new experiment: analyzing the predictive qualities of website navigation and bounce rates as a quality indicator.  Website "bounce rate" refers to the percentage of people who visit only one page of a website before leaving, compared to the total number of unique visitors to the entire site.

    I found the Joachims paper to be enlightening and novel, and I became interested in whether other aspects of user navigation, such as website bounce rates can also be used as implicit feedback.  I hypothesize that "bad" or irrelevant websites will have a bounce rate that is below a certain threshold (e.g. users will not stay on the site for more than X seconds), and that users stay on more relevant websites for a longer period of time, because they find the information given to be useful.

    Alternative outcomes may also be possible, which is why this would be a worthwhile experiment to pursue.  For example, if a user is looking for a piece for something that can be explained or answered within one or two sentences (such as, "Who was the 23rd president of the United States?"), he may be able to find what he is looking for rather quickly on a candidate website.  Since his goal was fulfilled, he has no reason to stay on the website any longer and thus has a high bounce rate for said website, even though it was a high quality result.  Conversely, users may be inclined to spend more time exploring lower quality websites if they believe that they can find their goal within that website, leading to lower bounce rates for "bad" websites.

    Experimental Method
    To test my hypothesis, I would emulate the Joachims paper by asking study participants to look for specific information using a standard web search engine, like Google.  Participants would do all of their searching on a prepared computer that records their clicks and navigation behavior, which can be reviewed later for further analysis.

    In order to objectively determine whether various websites are useful, I would either ask a panel of experts or Mechanical Turk users to rate the relevancy of each website compared to the initial search query.  If possible, I would use both sources for cross-validation, in case there are any biases within the groups.

    Users in the study group would be asked to think aloud while they are executing their searches and comment on whether or not the websites they visited were seen as useful or not useful.  Each website will then be plotted against the average amount of time that all the users spent on that website, with separate graphs for "useful" and "nonuseful" websites.

    In the interest of keeping the selection of websites as uniform as possible, I may attempt to constrain the results somehow so that search results are limited to a preselected list of sites.  While this has the advantage of providing a fixed set that we could observe multiple users on, there is the potential problem that this arbitrary constraint would affect the user's behavior in an undesirable way.  I could potentially separate the users into two groups, where one group gets a fixed set of website results to view, and the other group is given free reign to explore whatever sites they wish.  If the behaviors between the two groups are drastically different, then I would know that artificially constraining the search results is a bad idea.  If so, it may be easier to just treat all good websites as one group, and all bad websites as another group, even though the actual content of the sites may differ widely.

    I was concerned that my original experiment idea of observing the effect of rewards on user behavior on citizen science apps was none other than a reskinned version of the basic "How do external rewards affect intrinsic motivation" studies that are popular in psychology.  As prior studies have shown, users tend to be less happy or motivated to produce quality output when they are motivated by an external reward (like getting a high score in a game) as opposed to an internal one (they like participating in the app because it contributes to society).  Thus, the true question would not be whether or not external rewards affect motivation (since they have been shown to do so), but rather, how could we structure the rewards system so that it prohibits undesired behavior/output?  I could not think of a way to determine this with experimental data, and subsequently became distracted with a new idea after reading the Joachims paper.  If it turns out that my new idea is not viable, perhaps I will need to reflect on the original idea more...

    Friday, April 15, 2011

    Experiment Outline

    With the current trend of app development being focused on the social aspect, it is not surprising that crowdsourcing is a popular topic.  Thus, I am interested in determining what kinds of motivation drives people to participate in crowdsourcing activities and how these motivations may be used to encourage either more data being generated, higher quality data, or ideally, both!

    This experiment is inspired by the citizen-science iPhone app CreekWatch.  CreekWatch asks users to submit information and pictures of local water sources, so that the city has a cost-effective way of knowing the status of various waterways and what areas made need more work or support.  CreekWatch has a fairly large user base (I forget the exact number), which is fascinating because the app provides no external rewards but is driven solely by the internal motivations that users create for themselves.  This in itself is a fairly desirable state, because numerous psychological studies have shown that people are happier and more motivated when their reward is internal (such as the feeling that they are contributing to society) vs. external (they are receiving payment to participate).

    The question, then, is can we (and how do we) encourage the collection of more data without sacrificing data quality?  An introduction of a game-like system might encourage more new users to participate and/or existing users to participate more frequently, which would be nice.  However, if we cannot ensure the quality of this sudden influx of new data, then adding different elements would actually be detrimental.  A basic experiment that can be built off of this could compare the data collected on a standard (control) version of a crowdsourced app with that of a modified version of the app.  Different types of modification can also be tested, to fully explore the effectiveness of various alternatives.