Monthly Archives: May 2014

Text Figures and Lining Figures

Are you using the correct sort of numbers? Numbers come in two forms: old style and lining.

oldstyle-proportional-equalOld style figures

lining-proportional-equalLining figures

The difference between old style and lining figures is how they sit relevant to the text’s baseline and x-height.

handgloves-oldstyle-baseline handgloves-lining-baseline

Generally speaking, old style numbers look better within normal text, because they keep the same “pattern” of ascenders and descender; and lining numbers look better when used with text set in all caps as they match the height of the line.

supernova-capitals-oldstyle

With the text in all caps the old style numbers stand out as odd, but the all caps text looks much better when the old style figures are replaced with lining figures.

supernova-capitals-lining

With normal text, the old style figures help the text to look more normal, but the name of the supernova still stands out as looking a bit odd.

supernova-oldstyle

But the best option is a combination of both old style and lining figures as shown below.

supernova-mixed

Both old style and lining figures can come in proportional and monospaced (fixed-width) varieties. Monospaced figures are sometimes called tabular figures, because they are used in tables so that columns of tens, hundreds, thousands, etc. line up properly.

comparison-oldstyle
L-R: Proportional old style figures and tabular old style figures.

comparison-lining
L-R: Proportional lining figures and tabular lining figures.

Stop Putting Commas In Your Numbers

or Why you need to read Le Système international d’unités (8e édition)

How do you write very large or very small numbers? How, for example, would you write the speed of light out in full?

If you would write c = 299,792,458 m/s then please stop, because you’re doing it wrong. You can argue all you want about tradition, and “the way things have always been done” but you are still totally, absolutely, unequivocally wrong. There is a right way, an official, standardised way, to write very large and very small numbers, and it’s not with commas in them.

“Following the 9th CGPM (1948, Resolution 7) and the 22nd CGPM (2003, Resolution 10), for numbers with many digits the digits may be divided into groups of three by a thin space, in order to facilitate reading. Neither dots nor commas are inserted in the spaces between groups of three.”

The correct way to write the speed of light is c = 299 792 458 m/s. Ideally you’d use a special Unicode character, known as “NARROW NO-BREAK SPACE (U+202F)”, which stops text from wrapping around half-way through a number, but this isn’t very well supported, so the better-supported “THIN SPACE (U+2009)” or even just a normal space will do.

The reason for this is that the decimal point isn’t always a decimal point. Only 60% of countries use a full stop, whereas other countries use other marks. For example, a number that would traditionally be written in the UK as 123,456,789.01 would be written in France, Germany, Spain and many other countries as 123.456.789,01 and in Canada as either, depending on whether you’re working in English or French. This confusion (see this for example) was deemed undesirable and as such the scientific community declared in 2003 that:

The 22nd General Conference [of the BIPM],
considering that a principal purpose of the International System of Units is to enable values of quantities to be expressed in a manner that can be readily understood throughout the world …
reaffirms that “Numbers may be divided in groups of three in order to facilitate reading; neither dots nor commas are ever inserted in the spaces between groups”, as stated in Resolution 7 of the 9th CGPM, 1948.

Remember that thousand separators are also used when dealing with very small numbers. I’ve provided some examples below if you’re struggling to get your head around them.

Incorrect Correct Incorrect Correct
123 123 0.123 0.123
1234 1234 0.1234 0.1234
12,345 12 345 0.12345 0.123 45
123,456 123 456 0.123456 0.123 456
1,234,567 1 234 567 0.1234567 0.123 456 7
12,345,678 12 345 678 0.12345678 0.123 456 78

Ranking Ratings

Imagine that you’re trying to rank items that people have either voted for or against. What is the best way to do this?

You could simply take the number of for votes, and subtract the number of against votes. But this doesn’t work if there are a different number of votes for different items: an item with 100 for votes and 50 against votes would be ranked higher than an item with 30 for votes and 1 against vote. You could rank items by their ratio of for votes to against votes, essentially calculating the average score, but this doesn’t work either: an item with just a single for vote (ratio 1.000) would beat an item with 999 for votes and one against vote (ratio 0.999)

The correct method to use in this binomial case is to use the lower bound of Wilson’s score interval:

\mathrm{WSI} = \frac{1}{1 + \frac{1}{n}z^2} \left( \hat p + \frac{1}{2n} \pm z \sqrt{\frac{1}{n}\hat p \left( 1 - \hat p \right) + \frac{1}{4n^2}z^2} \right)

This is a fairly imposing equation, but what’s important is what it does, not how it works.

When ranking items using Wilson’s score interval we are still considering the for-against ratio, but we’re also taking into account the uncertainty created by having a different number of votes for each. For example, consider the following four items:

Item Total Votes Votes For Votes Against Ratio
Item 1 10 5 5 0.5
Item 2 20 10 10 0.5
Item 3 50 25 25 0.5
Item 4 100 50 50 0.5

As you can see, the ratio for each item is the same, but Item 4 received ten times the votes of Item 1 and should therefore be ranked higher.

items-graph

In the graph above, each item has the same score ratio, but the curve for Item 4 (n=100) is much sharper around 0.5 because there is less uncertainty about whether it has the “correct” score. An item with only 10 votes might have a “correct” ratio of 0.5, but it’s less likely than for an item with 100 votes.

If we now calculate the lower bound of Wilson’s score interval, we obtain the following results which we can then rank correctly:

Item Total Votes For Against Ratio Wilson SI
Item 1 10 5 5 0.5 0.2366
Item 2 20 10 10 0.5 0.2993
Item 3 50 25 25 0.5 0.3664
Item 4 100 50 50 0.5 0.4038

items-graph-indicators

The position of each arrow indicates the lower bound of the Wilson score interval.

In this case we are taking the lower bound of a 95% confidence interval. Taking the lower bound at a confidence interval of 95% means that you are finding, given the data you have, the lowest “correct” score with a probability of 95%. We cannot be 100% sure, so 95% is a good choice – scientists like 95% confidence intervals.

This system could be extended to sites like Amazon that use star rating systems. Currently Amazon calculates a weighted average, which places a product with one ????? rating above a product with one hundred ????? ratings and one ????? rating. A better idea would be to convert the star ratings to for and against votes and use Wilson’s score interval, or a Bayesian model to rank products.

UPC Barcodes

This post was inspired by episode 108 of the fantastic 99% Invisible podcast.

UPC barcodes are genuinely ubiquitous. You’ve probably seen a dozen or more UPC barcodes today without even realising or think about it. But how do they work?

A UPC barcode is a graphical representation of a twelve-digit number. What a barcode reader does with that twelve-digit number is the important part from a user’s point of view, but we’re interested in how the barcode reader turns a barcode into that twelve-digit number.

barcode-original

An example UPC barcode. There are thirteen written digits because the last digit is a check digit  which ensures that the code has been entered correctly if it has to be entered by hand.

The first thing to realise is that a barcode is not a pattern of black lines: it is a pattern of black lines and white spaces. A UPC barcode encodes each of the twelve digits in binary, with the black bars representing 1s and the white bars representing 0s. Each digit is represented by seven bits, with each bit represented by a “sub-bar” that is either black or white.

seven-lhs

seven-rhs

The number 7 as represented on the left-hand side (top) and right-hand side (bottom) of a UPC barcode. In binary the representations would be 0111011 and 1000100 respectively.

barcode-coloured

The same barcode as above, but with each line of bits (or “pixels”) coloured blue or yellow.

The barcode begins (reading in either direction) with two guide bars that let the barcode scanner know the width of each bit in the barcode, and features another set of guide bars in the centre. The number of black sub-bars for each digit is always odd on the left-hand side of the central guide bars, and even on the right-hand side, which enables a barcode scanner to tell if it is scanning a barcode right-side up or upside-down (see the representations of the number 7 above). The digits start immediately after the guide bar, but as each of the left-hand digits begin with a 0-bit, and each of the right-hand digits ends with a 0-bit, these digits never run into the guide bars or into a following or preceding digit.

With three bits (101) for each of the two guide bars on either side, plus five bits (01010) for the central guide bar, and seven bits for each of the twelve digits, this makes a total of ninety-five bits for the entire barcode. The complete binary representation of the barcode at the top of this post would be:

101 0110111 0110001 0001011 0100011 0100011 0001101 0111101 01010 1100110 1100110 1110010 1100110 1100110 101

(The bits representing the guide bars are shown in bold.)

By reading and decoding this binary series the barcode reader then provides a computer with the twelve digit UPC number, which the computer can then use to control stock, add up prices, etc.

Why Tokyo Looks Different From Space

When observed from space at night, most cities look very similar.

porto-night

Porto, Portugal

istanbul-night

Istanbul, Turkey

moscow-night

Moscow, Russia

But Tokyo looks very different.

tokyo-night

Unlike most major cities, Tokyo still uses mercury-vapour lamps (which were invented in 1901) rather than sodium-vapour lamps (which were invented in 1920) for its street lighting. The spectra of light emitted by mercury- and sodium-vapour lamps are very different:

sodium-spectrummercury-spectrum

Above: the sodium spectrum; Below: the mercury spectrum.

The overall colour of light produced by a sodium-vapour lamp is a bright yellow,* whereas the colour of light produced by a mercury-vapour lamp is a bright turquoise-white.

helsinki-streetSource: naystin

tokyo-streetSource: sinkdd

In the photographs above, Helsinki (top) is using sodium-vapour bulbs for its street lighting (though it still has some mercury-vapour lamps it is replacing those), and Tokyo (bottom) is using mercury-vapour bulbs. In Berlin, the division between the old East German and West German parts is still visible from space due to the different types of bulbs used in their streetlamps.

berlin-night

West Germany (on the left of the image) uses mercury-vapour bulbs, and East Germany (on the right) uses sodium-vapour bulbs.

* Light from a sodium-vapour lamp is almost monochromatic, at 589.3?nm. Optical telescope users prefer sodium-vapour light pollution because it is easier to filter out.