Visualizing Value Distributions and Event Time Clusters in One Dimension (Update: Source Code Added)

(Update: The source is available for download at bottom of the entry.)

I must admit I always felt a little uncomfortable generating or referencing histograms. The enormous bias that might be introduced by changing the segment/bin size always nags in the back of my mind. True, thoughtfully constructed histograms with a fairly large sample set can be very illuminating, but the reader must trust the visualization architect completely or rebuild the graph from source data and verify that the parameters were well-chosen.

For those who have the same concerns will I outline a set of visualization techniques that are less susceptible to accidental/intentional bias. These visualizations are simple to implement, very spatially compact, easy to understand, and applicable to not only value distributions within a set but also representing clustering of events in time.

Enough talk, take a look at an example visualization:
A taste of visualizations to come

Continue on for for an explanation and all the details…

The “basic” form of the visualization is simple, one pixel wide strokes at each data point’s position. By making the strokes partially transparent, if two or more points are in the same position they will add to become bolder than single strokes in proportion to the number of overlapping lines. This is well suited to showing event clustering in time as well as modestly dense value distribution across a range. The information is encoded very tightly along the x-axis, so graphics can often be made remarkably small without loss of utility.

Here are some examples of the simple stroke format:
An Event Timeline:
event_cluster_1d_sample1-optimized-blue.svg

Value Distribution:
val_distrib_1d_sample1-optimized-blue.svg

Value Distribution, Narrow format:
val_distrib_n_1d_sample1-optimized-blue.svg

Sometimes data may be too dense for the simple stroke form. This is often the case with showing large data set distributions that were previously rendered in histograms. In this case, the sharp pixel-wide lines are replaced by gradients. The key to these visualizations is that the human brain is excellent at recognizing changes in color values (determining absolute color values in different contexts is a different story—working around that flaw is addressed later on). By rendering each data position with a segments translucent color, the constructive interaction of nearby points creates visually intuitive “bright” spots proportional to the degree of clustering. The exact width of the segments is the display width divided by the total data points; the color fades from the center of the segment to 100% transparent. These two properties mean that exactly even distribution of data points produces a solid unchanging band, whereas two points on top of each other are roughly twice as “intense”, and two points separated by twice the average distance will fade to transparent at exactly their center point. The end product manages to capture both the overall shape/global trends and the detail of smaller sub-clusters and groupings. With the addition of a reference color stripe below, the viewer can always re-orient their color space to the global average to overcome local color perceptual effects.

Below are gradient-based displays of the exact same data sets already shown with the simple stroke format.

An Event Timeline:
event_cluster_1d_sample2-optimized-blue.svg

Value Distribution:
val_distrib_1d_sample2-optimized-blue.svg

Value Distribution, Narrow format:
val_distrib_n_1d_sample2-optimized-blue.svg

Additionally, the two forms can be combined to highlight both individual events and overall clustering patterns. Below I have stacked all three variants along with optimal histograms (where appropriate) to illustrate each of their relative merits. I am sure better-tweaked renderings are possible, but this shows what an hour with a common visualization library (protovis) can produce along these lines.


An Event Timeline:

event_cluster_1d_sample1-optimized-blue.svg Simple Ticks

event_cluster_1d_sample2-optimized-blue.svg Gradient

event_cluster_1d_sample3-optimized-blue.svg Gradient (with stroke marks)


Value Distribution:

Possible Histograms, 2–28 Segments (Click to animate/stop)
histograms-blue.png

Ticks at Data Positions
val_distrib_1d_sample1-optimized-blue.svg
optimal_histogram_bins_14-blue.png
Optimal histogram, per this algorithm (14 Bins)

Distribution Gradient
val_distrib_1d_sample2-optimized-blue.svg

Distribution Gradient (+ticks)
val_distrib_1d_sample3-optimized-blue.svg


Value Distribution, Narrow:

val_distrib_n_1d_sample1-optimized-blue.svg Ticks at Data Positions
optimal_histogram_bins_14-blue.png Optimal Histogram

val_distrib_n_1d_sample2-optimized-blue.svg Distribution Gradient

val_distrib_n_1d_sample3-optimized-blue.svg Distribution Gradient (+ticks)


Update: For anyone interested in the code I used to produce these graphs, you can download a zip archive containing a working copy of the page and its resources. Beware, the code is not especially well factored yet as it was just a test bed to generate the images for this post. It should convey some simple algorithms to accomplish this data visualization technique, regardless. If you are familiar with protovis and/or JavaScript it should be trivial to modify and extend to your own purposes. The code original to me is licensed under GPL v3, the images are licensed with the Creative Commons Attribution-ShareAlike 3.0 License.

archive_of_data_distribution_in_1d_code_workspace

One Response to Visualizing Value Distributions and Event Time Clusters in One Dimension (Update: Source Code Added)

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Please leave these two fields as-is: