Wah, meh...huzzah!

I followed a Twitter / blog thread today which led me to some discussions about how the familiar 5-star rating system may be fundamentally flawed (i.e. useless).

This makes some intuitive sense, both from the point of view of a viewer and contributor to product ratings. I’m working on an App right now that I had planned to casually throw a 5-star rating into, but now I’m starting to reconsider.

Thinking about various complicated “comparison” methods makes me feel a little nauseous though. The idea of an absolute rating is very accessible (people are used to being evaluated themselves on an apparently absolute scale, thanks to the prevailing culture of constant assessment that starts off in education and lingers for far too long into adult life). Could a much simpler solution to the problem simply be to drop two of the stars, making the 5-star system into a 3-star system?

Maybe the stars would need to be changed to communicate the difference, but I would suggest that the interpretation would have to be something like this:

  • 1 star = terrible aka “wah!”
  • 2 stars = mediocre aka “meh…”
  • 3 stars = fabulous aka “huzzah!”
  • Oh, also 0 stars = unrated aka “huh?”

The advantages of this method is that there is less possibility for temporal variation (e.g. my 4 on an angry day = my 5 on a happy day), and it also limits the effect of personal variation as there is less interpretation involved in discriminating between “fabulous” and “mediocre” as there is between abstract numerical values.

How should we report on this type of rating? Clearly there are serious shortcomings with taking a mean numerical value (can one hundred “terribles” and one hundred “fabulouses” really average out to an aggregate score of “mediocre”?) With less categories, it should be easy enough to present the data as simple percentages. Actually, Amazon do this for their 5-star ratings which does solve the problem of taking the mean, although given the problem of discriminating 3 vs. 4 and 4 vs. 5 etc., the usefulness of group aggregations is unfortunately limited.

Getting even more simplistic, it would make sense to highlight the modal rating for this system (e.g. most people thought this was “fabulous”) — a modal 4 out of 5 doesn’t say nearly as much.

I’d love to do an experiment to convert 5-star ratings down to 3-star but, of course, that’s impossible! To computer scientists, it may seem backwards, but this is a case of increased “precision” actually destroying data — the addition of extra ratings increments inevitably leads to more and more arbitrary decisions.

What would be interesting though, would be to map some 5-star ratings to 3-star using some vaguely sensible method, and then seeing how ratings viewers’ perceptions differ.