I was working on a data analysis that included telephone numbers, and I noticed an intriguing filter on the data. A database analyst had specified this:
phone_number BETWEEN ‘2010000000’ and ‘9899999999’
A glance at the North American Numbering Plan, which defines valid telephone numbers in North America, showed me why the analyst chose these two endpoints: 201 is a valid area code, and 999 is not. However, this filter did not accomplish everything the analyst intended.
This particular set of data to be filtered includes a number of invalid telephone numbers, including ‘40’, ‘0’, ‘0000000000’, and several seven-digit telephone numbers. The condition above will successfully filter out the ‘0000000000’ and ‘0’ telephone numbers, but ‘40’ and most seven-digit numbers would remain in the resulting query results.
The reason for this surrounds each phone number I’ve mentioned: the tick character, or single quotation mark. Numeric data — which may be sent to functions or used in calculations — is not surrounded by tick marks. The pair of ticks delimits string data, which is interpreted by the computer as text rather than as numbers. When the computer sorts a set of strings, it does not see numeric digits as numbers that can be used for calculations. Rather, to the computer, a numeric digit is no more than a code, just as each letter and punctuation mark is a code.
Consider: you have a stack of file folders to alphabetize and file. As you work through the stack, you group the As together, then the Bs and the Cs and so on. You may recall similar exercises in school, and you know to put “Mom” before “Mother” and “Fascinating” before “Fat.”
When you reach the folder labeled ‘2,010,000,000’ you place it at the top of the stack, before the letter A folders. Then you reach the folder labeled ‘40’, and you stop to think. What do you do with this? If you were sorting these as numbers, then forty should come first as it’s smaller than two billion. But you aren’t sorting numbers; you are alphabetizing. When we alphabetize, we look at each letter on its own. Two billion begins with 2, and forty begins with 4. Which of those comes first? The 2 does.
Two billion comes before forty in the same way that antidisestablishmentarianism comes before dog.
As a result, the filter
phone_number BETWEEN ‘2010000000’ and ‘9899999999’ will include
- any string of digits that begins with 2011 to 2999
- any string of digits with a first digit between 3 and 8
- any string of digits that begins with 98
- and several other configurations.
The phone number ‘40’ will be included, as will ‘5784923’ and ‘8675309’ and ‘80000 Leagues Under the Sea’. The number ‘9888888888’ will be included, but ‘99’ will be filtered out.
The upshot is this: Be aware of how the computer sees, compares, and sorts the data, so that you always know the answer.
40 > 2,010,000,000 when it is a string like ‘40’ or ‘2010000000’.