Friday, April 29, 2011

Creating the initial experiment

Although fixing the set of possible search results could be advantageous because we would a predetermined website list to examine, I decided that this approach was too contrived and not representative of reality.  Additionally, users often refine their queries by doing multiple searches and taking this ability away from them could adversely affect their actual search behavior.

Instead, I created Zoogle, which is more or less just a wrapper for Google.  The live (and ever-changing) implementation can be seen at

Like in the Joachims paper, users would be asked to search for answers to questions I have prepared in advance.  I would probably ask the users to rate whether the visited sites were relevant or not relevant, as well as rating the relevance myself.

While writing the python script that generates Zoogle, I also considered other experiment possibilities like testing how users respond to security warnings or error messages.  For example, do different kinds of error messages prompt different kinds of user behavior?  Or do users generally do the same thing (ignore the warning and continuing to the malicious site or heeding the warning and leaving) regardless of the type of message they see?  This was something that I threw into the code very quickly as a rough prototype and is not what I will necessarily run the experiment on in the end.  It is, however, an interesting question, although my ability to keep focused seems to slightly lacking.  :)

As implementation took longer than I expected, I do not yet have preliminary results.  Ideally, I would be able to track user behavior even when they navigate to different websites that are not under my domain, thus negating my current need to observe users in person, on a prepared machine.  If this is possible, I would be able to test many more users online, making for a more statistically robust experiment overall.

Friday, April 22, 2011

Background info and experimental methodology

So, after reading Joachims et al.: Accurately Interpreting Clickthrough Data as Implicit Feedback paper, I have reconsidered my original idea and have instead been motivated to pursue a new experiment: analyzing the predictive qualities of website navigation and bounce rates as a quality indicator.  Website "bounce rate" refers to the percentage of people who visit only one page of a website before leaving, compared to the total number of unique visitors to the entire site.

I found the Joachims paper to be enlightening and novel, and I became interested in whether other aspects of user navigation, such as website bounce rates can also be used as implicit feedback.  I hypothesize that "bad" or irrelevant websites will have a bounce rate that is below a certain threshold (e.g. users will not stay on the site for more than X seconds), and that users stay on more relevant websites for a longer period of time, because they find the information given to be useful.

Alternative outcomes may also be possible, which is why this would be a worthwhile experiment to pursue.  For example, if a user is looking for a piece for something that can be explained or answered within one or two sentences (such as, "Who was the 23rd president of the United States?"), he may be able to find what he is looking for rather quickly on a candidate website.  Since his goal was fulfilled, he has no reason to stay on the website any longer and thus has a high bounce rate for said website, even though it was a high quality result.  Conversely, users may be inclined to spend more time exploring lower quality websites if they believe that they can find their goal within that website, leading to lower bounce rates for "bad" websites.

Experimental Method
To test my hypothesis, I would emulate the Joachims paper by asking study participants to look for specific information using a standard web search engine, like Google.  Participants would do all of their searching on a prepared computer that records their clicks and navigation behavior, which can be reviewed later for further analysis.

In order to objectively determine whether various websites are useful, I would either ask a panel of experts or Mechanical Turk users to rate the relevancy of each website compared to the initial search query.  If possible, I would use both sources for cross-validation, in case there are any biases within the groups.

Users in the study group would be asked to think aloud while they are executing their searches and comment on whether or not the websites they visited were seen as useful or not useful.  Each website will then be plotted against the average amount of time that all the users spent on that website, with separate graphs for "useful" and "nonuseful" websites.

In the interest of keeping the selection of websites as uniform as possible, I may attempt to constrain the results somehow so that search results are limited to a preselected list of sites.  While this has the advantage of providing a fixed set that we could observe multiple users on, there is the potential problem that this arbitrary constraint would affect the user's behavior in an undesirable way.  I could potentially separate the users into two groups, where one group gets a fixed set of website results to view, and the other group is given free reign to explore whatever sites they wish.  If the behaviors between the two groups are drastically different, then I would know that artificially constraining the search results is a bad idea.  If so, it may be easier to just treat all good websites as one group, and all bad websites as another group, even though the actual content of the sites may differ widely.

I was concerned that my original experiment idea of observing the effect of rewards on user behavior on citizen science apps was none other than a reskinned version of the basic "How do external rewards affect intrinsic motivation" studies that are popular in psychology.  As prior studies have shown, users tend to be less happy or motivated to produce quality output when they are motivated by an external reward (like getting a high score in a game) as opposed to an internal one (they like participating in the app because it contributes to society).  Thus, the true question would not be whether or not external rewards affect motivation (since they have been shown to do so), but rather, how could we structure the rewards system so that it prohibits undesired behavior/output?  I could not think of a way to determine this with experimental data, and subsequently became distracted with a new idea after reading the Joachims paper.  If it turns out that my new idea is not viable, perhaps I will need to reflect on the original idea more...

Friday, April 15, 2011

Experiment Outline

With the current trend of app development being focused on the social aspect, it is not surprising that crowdsourcing is a popular topic.  Thus, I am interested in determining what kinds of motivation drives people to participate in crowdsourcing activities and how these motivations may be used to encourage either more data being generated, higher quality data, or ideally, both!

This experiment is inspired by the citizen-science iPhone app CreekWatch.  CreekWatch asks users to submit information and pictures of local water sources, so that the city has a cost-effective way of knowing the status of various waterways and what areas made need more work or support.  CreekWatch has a fairly large user base (I forget the exact number), which is fascinating because the app provides no external rewards but is driven solely by the internal motivations that users create for themselves.  This in itself is a fairly desirable state, because numerous psychological studies have shown that people are happier and more motivated when their reward is internal (such as the feeling that they are contributing to society) vs. external (they are receiving payment to participate).

The question, then, is can we (and how do we) encourage the collection of more data without sacrificing data quality?  An introduction of a game-like system might encourage more new users to participate and/or existing users to participate more frequently, which would be nice.  However, if we cannot ensure the quality of this sudden influx of new data, then adding different elements would actually be detrimental.  A basic experiment that can be built off of this could compare the data collected on a standard (control) version of a crowdsourced app with that of a modified version of the app.  Different types of modification can also be tested, to fully explore the effectiveness of various alternatives.