Input data mangling

Input data are not stored as a literal chunk of text by RefDB. If you import a dataset, and later retrieve it using the same format, the dataset is not guaranteed to be the same character by character. Instead, the data are sliced up, sometimes slightly modified, and sometimes RefDB creates additional information. This section tries to explain what happens to your data behind the scenes, and why this is good for you.

Information that RefDB generates for you

In a few cases RefDB fills in some default values if the datasets do not specify them. This happens in the following cases:

  • Each reference and extended note will be assigned a unique numeric identifier. This is mainly used internally, but you can also retrieve references and extended notes by their ID. The ID is always created by the database server, there is no way to enforce specific IDs for your datasets.

  • Both references and extended notes require a unique alphanumeric key. With a few limitations this is an arbitrary string consisting of letters (at least one) and optional digits. If you do not specify a citation key, RefDB will create one automatically. In the case of references, the publication year is appended to the last name of the first author. If this string is not unique, a sequential suffix starting at "a" through "z", then "aa" and so forth, is tested until a unique string is found. The same algorithm is used for extended note keys, but instead of an author name the user name serves as the base.

  • If no reprint status is specified, RefDB inserts "NOT IN FILE" as the default value.

  • If your extended notes do not specify a date, RefDB will use the current date and insert that instead.

Information that RefDB mangles

Citation keys are supposed to work as ID values in SGML and XML documents. To avoid any character encoding hassles, only the first 127 characters of the US-ASCII character set are permitted. These characters work in most character encodings. Some special characters which are not allowed as part of an XML attribute value are stripped. Non-ASCII characters are converted to a reasonable ASCII equivalent, or they are dropped if no replacement is possible.

Periodical names and author/editor names receive some special treatment in order to make them usable for RefDB. Both periodical names and person names should be provided in a particular format. However, if you retrieve your data from an electronic source instead of writing them from scratch, the names may not conform to the rules. In order to make best use of these data, RefDB attempts to normalize the incoming periodical and person names until they conform to the rules.

There is a good reason for this normalization. Consider a periodical name like "The Journal of Biological Chemistry" . Different electronic sources may abbreviate this as one of:

  • J.Biol.Chem.

  • J. Biol. Chem.

  • J Biol Chem

Although a human reader does not have a hard time to guess that we're looking at the same journal in all three cases, a database is too stupid to understand this. If you add the periodical abbreviations as they are, you'll end up having three different journal entries. As a consequence, a query like getref :JO:='J.Biol.Chem.' will miss two out of three papers published in that journal. This is not a good thing.

Periodical names

RefDB normalizes abbreviated periodical names like this: First, the name is tokenized. Separators are periods and spaces. If a token has a trailing period, it is assumed to be an abbreviated word and used as such. If a token has no trailing period, the token is compared to an internal list of unabbreviated words (see the listword and addword commands for further information about this list). If a match is found, no period is added. If no match is found, the token is assumed to be an abbreviation of something else and a period is added. Spaces after periods will be removed as one separator is sufficient. If we consider the three versions of the journal name above, all versions would be normalized to the first one.

Person names

The names of authors and editors are normalized like this: Everything to the left of the first comma is assumed to be the last name and remains untouched. The next item (separated by either a space, a period, or both) is assumed to be the firstname. If it consists of one capital letter, a period is added and any trailing spaces are removed. If the firstname is spelled out, it is used as such. All following name parts to the left of the second comma, if any, are assumed to be middle names. Each part receives the same treatment as a firstname. Finally, everything to the right of the second comma, if any, is assumed to be a honorific or lineage part and used as such. All spaces following either a period or a comma are removed. A few examples should make this procedure clear:

  • "Miller, John S" -> "Miller,John S."

  • "Chun, H-K" -> "Chun,H.-K."

  • "Delorie, DJ" -> "Delorie,DJ"

  • "Doe, J S" -> "Doe,J.S."

  • "Random,Jane,Jr." -> "Random,Jane,Jr."

The last example shows that your data will not be modified as long as they stick to the input format.