# AB Testing and Eyetracking

For this project, I worked with a group designing two different versions of a website. We were given a template (below) for a website providing different options for transportation in Memphis, Tennessee, and expanded that template into two different sites. The task was to determine which version of the site a transportation company should use.

For each new version we made alignment/formatting changes and aesthetic changes, and added information about each option. Below is a table documenting the changes we made for each site.

## Version B

Larger bold titles

Pictures on the left

Buttons under text

Change button color to yellow

Bold the names of companies

Make photos a consistent size

Add titles for each company

Add price estimates

Add ratings

Add longer descriptions for each company

## Version A

Center all items

Text under photos

Buttons under text

Change background color to yellow

Change button color to blue

Bold the names of companies

Increase font size

Make photos a consistent size

Add titles for each company

Add price estimates

Add ratings

Add three word descriptions

## Design Justifications for Version B

## Design Justifications for Version A

The objective of our version A is to increase its effectiveness and efficiency. To do this, we center aligned all the information to prevent cognitive overload. This also means a user's eyes only have to travel down to page to get relevant information. Additionally, instead of having paragraphs for each company description, we used three-word summaries that only include the most necessary information, company name, price estimates, and ratings (out of 5 stars). We hypothesize that these changes combined with the more aesthetically pleasing site, will ease the effort required by users since they no longer need to read from right to left side of the page and increasing overall user experience. The changes should also increase click through rate and reduce dwell time and time to click because buttons are larger and users need to read less information.

Our main goal for the version B is to improve the site’s usability and satisfaction. To do this we elaborated on each company descriptions by adding colloquial summaries, price estimates, contact information, and ratings. We hope that the conversational language and larger quantities of information will increase satisfaction because users will have better ideas of what each company does. To improve usability, we decluttered the left side of the page by spreading out the layout out information (descriptions and buttons to the right) and changed the button color to a brighter yellow so that it is clearer where to click to learn more. Having this format also increases affordance and click through rate because the buttons will be right next to the company information, allowing users to easily locate the button and click to learn more about companies that they are interested in. Finally we believe that our changes will also improve return rate because users will be incentivized to return to our informational site to compare their various options.

## AB Testing

After designing our sites, we ran an AB test with the two versions, measuring clickthrough rate, time to click, return rate, and dwell time. For each of those metrics we came up with null and alternative hypotheses.

AB TESTING HYPOTHESIS

Clickthrough rate

Null Hypothesis: The click through rate of version A will be equal to that of version B.

Alternative Hypothesis: The click through rate for version A will be greater than that of version B because version A’s buttons are larger, centered, and familiar blue color, making them easier to recognize and click.

Time to click

Null Hypothesis: Time to click for version A will be equal to that for version B.

Alternative Hypothesis: Time to click for version A be will less than that for version B because the limited amount of information listed on the website will entice users to click to the taxi sites faster in order to learn more.

Dwell time

Null Hypothesis: Dwell time on version B is equal to that of version A.

Alternative Hypothesis: Dwell time on version B is less than that of version A. Users will spend less time away from version B because B already has a lot of information, reducing the need to go to other sites.

Return rate

Null Hypothesis: Return rate for version A will be equal to that for version B

Alternative Hypothesis: The return rate for version for B will be higher than that for version A because version B has more information on it. Therefore, if users want to return to the landing page and re-evaluate which company to use, they will be more likely to go to the site that has more data.

Ultimately I failed to reject the null hypothesis for all four metrics. Below are the statistics for each.

Clickthrough rate

Version A: 15/31 or 0.484

Version B: 24/42 or 0.571

I calculated clickthrough rate by finding the number of unique clicks and dividing that by the number of unique sessions for each respective version. I used a chi squared test because we’re looking at categorical data, whether or not the users clicked.

Test statistic: 0.548

Degrees of freedom = 1

According to the chi-squared table, test statistic needed to reject the null hypothesis at a 95% confidence level is 3.84, so the test statistic of 0.548 is too small and therefore we fail to reject the null hypothesis that the clickthrough rates for interface A and B are equal.

Time to click

Version A: 75698.667 ms

Version B: 22917.125 ms

I calculated average time to click by subtracting the time of the click from the time of the loading of the page for each user’s first click and then took the average of those times. I used a t-test for difference of means because we’re looking at two numerical averages and their equality.

Test statistic: 1.853

Degrees of freedom = 37

According to the t-test lookup table, the test statistic required to reject the null hypothesis is 2.0262, so the test statistic of 1.853 is too small and therefore we fail to reject the null hypothesis that the average time to click is the same for interfaces A and B.

Dwell time

Version A: 12900.842 ms

Version B: 5973 ms

I calculated average dwell time by subtracting the time of the returning page load from the time of the click for each return and taking the average of those times. I used a t-test for difference of means because we’re looking at numerical averages and whether they are equal.

Test statistic: 1.799

Degrees of freedom = 34

According to the t-test lookup table, the test statistic required to reject the null hypothesis at a 95% confidence level is 2.0322, so the test statistic of 1.799 is too small, and therefore we fail to reject the null hypothesis that the average dwell time is the same for interfaces A and B.

Return rate

Version A: 19/24 or 0.792

Version B: 17/31 or 0.586

I calculated return rate by finding the number of returns and dividing that by the number of total clicks for each version of the website. I used a chi-squared test because we’re looking at categorical data again, whether or not users returned.

Test statistic: 2.458

Degrees of freedom = 1

According to the chi-squared table, test statistic needed to reject the null hypothesis is 3.84, so the test statistic of 2.458 is too small and therefore we fail to reject the null hypothesis that the return rates for interfaces A and B are equal.

I also calculated the Bayesian probability for clickthrough rate.

To calculate the Bayesian probability that Interface B has a higher clickthrough rate than Interface A, I wrote an R script to calculate the probability (located at bayesian testing.R). I transferred the formula on to an R function and then used Interface B as B and Interface A as A. Then alpha_a became the number of clicks on A + 1, which is 16, beta_a became the number of nonclicks on A + 1, which is 17, alpha_b became the number of clicks on B + 1 which is 25, and beta_b became the number of nonclicks on B + 1, which is 19.

I got about 0.768, or 76.8% chance that Interface B has a higher clickthrough rate than Interface A.

After we collected our data, I used tests to determine whether there was a significant difference between the two versions for each metric. Ultimately I found that for all of the metrics, there was no significant difference.

However, when calculating Bayesian probability for clickthrough rate, I got about 0.768, or 76.8% chance that Interface B has a higher clickthrough rate than Interface A.

## Eyetracking

After we finished the AB test, we also ran an eyetracking test. We had two users undergo eyetracking (one for each version of the website). We hypothesized that Version A will have a larger amount of focus on the center of the page because A has large attention grabbing pictures in the center and because all of A’s layout is center aligned whereas the information on B is spread out across the page.

After collecting the data, I created a heatmap and replay simulation of each eyetracking sessions. Below are photos of the heatmaps and replays for both Version A and Version B

Visualization analysis: Looking at the eyetracking data for Versions A and B, we can see that attention for Version A seems to be focused on the center of the page, where the images, buttons, and text are. By contrast, in Version B, the focus is on the left side, where the pictures are aligned, which supports our hypothesis that Version A would have more focus in the center.

## Conclusion

Based on our research, I recommend that the company conduct more testing, as our AB testing yielded no statistically significant results. Since there is no statistical significance, we cannot conclude whether one version of the website leads to more user interaction with the interface. Additionally, the eye tracking data reveals that users concentrate on the areas on the page with images and attention grabbing text, and while on Version A this was the center and on Version B this was the left side, there doesn’t seem to be a difference in terms of the clustered focus around the areas containing information. Therefore, it is difficult to tell whether Version A or Version B would fare better, and I recommend that the company do more testing before proceeding.

As mentioned earlier, the AB testing tracked clicks and page loads, with times for each, producing no statistically significant difference between the two websites. The eyetracking data, showing where the user was looking at a specific point in time, however, did pick up on the users focusing on different locations on the page for the two different versions. AB testing data is better for tracking clicks and actual user interaction with the site but doesn’t tell us where the users were clicking or focusing, while eyetracking data is better suited to tell us where users were paying the most attention, but doesn’t tell us about user clicks and loading of pages