7.6. Input data mangling

Input data are not stored as a literal chunk of text by refdb. If you import a dataset, and later retrieve it using the same format, the dataset is not guaranteed to be the same character by character. Instead, the data are sliced up, sometimes slightly modified, and sometimes refdb creates additional information. This section tries to explain what happens to your data behind the scenes, and why this is good for you.

7.6.1. Information that refdb generates for you

In a few cases refdb fills in some default values if the datasets do not specify them. This happens in the following cases:

7.6.2. Information that refdb mangles

Citation keys are supposed to work as ID values in SGML and XML documents. To avoid any character encoding hassles, only the first 127 characters of the US-ASCII character set are permitted. These characters work in most character encodings. Any characters outside of the permitted range will simply be deleted from a citation key, regardless of whether it is specified by the dataset or generated from the author name.

Periodical names and author/editor names receive some special treatment in order to make them usable for refdb. Both periodical names and person names should be provided in a particular format. However, if you retrieve your data from an electronic source instead of writing them from scratch, the names may not conform to the rules. In order to make best use of these data, refdb attempts to normalize the incoming periodical and person names until they conform to the rules.

There is a good reason for this normalization. Consider a periodical name like "The Journal of Biological Chemistry" . Different electronic sources may abbreviate this as one of:

Although a human reader does not have a hard time to guess that we're looking at the same journal in all three cases, a database is too stupid to understand this. If you add the periodical abbreviations as they are, you'll end up having three different journal entries. As a consequence, a query like getref :JO:='J.Biol.Chem.' will miss two out of three papers published in that journal. This is not a good thing.

7.6.2.1. Periodical names

refdb normalizes abbreviated periodical names like this: First, the name is tokenized. Separators are periods and spaces. If a token has a trailing period, it is assumed to be an abbreviated word and used as such. If a token has no trailing period, the token is compared to an internal list of unabbreviated words (see the listword and addword commands for further information about this list). If a match is found, no period is added. If no match is found, the token is assumed to be an abbreviation of something else and a period is added. Spaces after periods will be removed as one separator is sufficient. If we consider the three versions of the journal name above, all versions would be normalized to the first one.

7.6.2.2. Person names

The names of authors and editors are normalized like this: Everything to the left of the first comma is assumed to be the last name and remains untouched. The next item (separated by either a space, a period, or both) is assumed to be the firstname. If it consists of one capital letter, a period is added and any trailing spaces are removed. If the firstname is spelled out, it is used as such. All following name parts to the left of the second comma, if any, are assumed to be middle names. Each part receives the same treatment as a firstname. Finally, everything to the right of the second comma, if any, is assumed to be a honorific or lineage part and used as such. All spaces following either a period or a comma are removed. A few examples should make this procedure clear:

  • "Miller, John S" -> "Miller,John S."

  • "Chun, H-K" -> "Chun,H.-K."

  • "Delorie, DJ" -> "Delorie,DJ"

  • "Doe, J S" -> "Doe,J.S."

  • "Random,Jane,Jr." -> "Random,Jane,Jr."

The last example shows that your data will not be modified as long as they stick to the input format.