Adding references

If you're new to refdb and don't have a database yet, you'll want to start by adding a couple of references. This chapter first teaches you how to add references in the main input formats RIS and risx. The subsequent sections cover the import of data from various data sources like PubMed, BibTeX, or Z39.50 servers.

Tip: If you're really new to refdb but have access to an existing database, e.g. the one your department built, you might want to get acquainted with refdb by retrieving existing references. Retrieving references does not alter the database, so this is safe to play around. Once you feel comfortable, return to this section.

How to create RIS datasets

The RIS format is a plain-text tagged file format used by most Windows- and Mac-based reference management tools. A variety of other data formats supported by refdb can be converted to RIS using the conversion tools described in the following sections.

What a RIS dataset looks like

The main advantage of RIS are:

  • The tag set is well suited for managing references and offprints used by scientists. There is no overhead for library capabilities.

  • As a plain-text, tagged format it is easy to create and edit with basically any text editor.

To get started, we'll assume you have some reference data in RIS format handy. This could be your existing Reference Manager or EndNote database exported to RIS. However, for your first experiments the example file shipped with refdb is just fine. This file is usually installed as /usr/local/share/refdb/examples/testrefs.ris (ask your administrator if there is no such file).

At first you should have a look at the data, just to know what they look like. Use either a pager or a text editor to display the example file:

~$  less /usr/local/share/refdb/examples/testrefs.ris
            (1)
            TY  - BOOK                                    (2)
            ID  - smith1975metalloporphyrins              (3)
            T1  - Porphyrins and metalloporphyrins        (4)
            A1  - Smith,K.M.                              (5)
            Y1  - 1975///                                 (6)  
            KW  - Porphyrins                              (7)
            KW  - Metalloporphyrins
            KW  - Spectrophotometry [methods]
            KW  - spectroscopy
            RP  - NOT IN FILE                             (8)
            CY  - Amsterdam                               (9)
            PB  - Elsevier Scientific Publishing Company  (10)
            ER  -                                         (11) 
         

Ok, I've cheated a little. This reference is not the first and only one you'll see if you run that command, but it is the shortest one, good enough to discuss the general anatomy of a RIS dataset (in order to see this reference in the real example file, scroll just about halfway down).

We can easily discover these main features of the RIS format:

  • A RIS file can contain one or more datasets.

  • Each dataset starts with an empty line (more precisely, with a linefeed character). This also means that each file containing RIS datasets starts with an empty line.

  • The tags start at the beginning of a line and consist of two uppercase characters, followed by a double space, a dash, and another space.

  • The TY tag is always the first one, and the ER tag is always the last one in each dataset

  • The sequence of the other tags is arbitrary with one exception: The sequence of the author fields (A1 or AU) determines the sequence of the authors in bibliographies or citations. The sequence should follow the sequence in the original publication.

Now let us look at the RIS dataset shown above. It contains the following tags:

(1)
The mandatory empty first line
(2)
The type tag is always the first one. The type of the reference is specified using one of about two dozen predefined types. This reference describes a complete book. Other common types are JOUR for a journal article and CHAP for a chapter in a book. See below for a full listing.
(3)
This field contains an optional citation key. This is an easy-to-remember name for a reference, often constructed from the first author, the publication year, and maybe a word or two from the title. If you do not provide a citation key, refdb will create one for you.
(4)
This is the title of the publication, in this case the title of the book. Other title tags are available. E.g. if you cite a chapter in a book which is part of a series, T2 and T3 can be used to code the book title and the series title, respectively, whereas T1 would be the chapter title.
(5)
The name of the author, written in the order Last,First Middle,Suffix, or Last,F.M.,Suffix. If a publication has several authors, list each of them on a separate line with the A1 tag.
(6)
The publication date in the format YYYY/MM/DD/other info. You can leave out any information that is not available, but you must keep the slashes regardless.
(7)
A keyword that allows to retrieve the reference by topic. You can specify as many keywords as you like by putting each one on a separate KW line.
(8)
This denotes the reprint status. This field can hold only the values "NOT IN FILE", "ON REQUEST", or "IN FILE".
(9)
The publication place.
(10)
The name of the publishing company.
(11)
This is the mandatory last tag of each reference. Please be aware that this tag consists of "ER", two spaces, a dash, and another space, just like all other tags.

The following list gives an overview over all available tags and their use.

TY

This tag specifies the type of the reference and must be the first tag of each RIS dataset, preceeded by a newline.

Format: This can be any of the following strings:

  • ABST (abstract reference)

  • ADVS (audiovisual material)

  • ART (art work)

  • BILL (bill/resolution)

  • BOOK (whole book reference)

  • CASE (case)

  • CHAP (book chapter reference)

  • COMP (computer program)

  • CONF (conference proceeding)

  • CTLG (catalog)

  • DATA (data file)

  • ELEC (electronic citation)

  • GEN (generic)

  • ICOMM (internet communication)

  • INPR (in press reference)

  • JFULL (journal - full)

  • JOUR (journal reference)

  • MAP (map)

  • MGZN (magazine article)

  • MPCT (motion picture)

  • MUSIC (music score)

  • NEWS (newspaper)

  • PAMP (pamphlet)

  • PAT (patent)

  • PCOMM (personal communication)

  • RPRT (report)

  • SER (serial - book, monograph)

  • SLIDE (slide)

  • SOUND (sound recording)

  • STAT (statute)

  • THES (thesis/dissertation)

  • UNBILL (unenacted bill/resolution)

  • UNPB (unpublished work reference)

  • VIDEO (video recording)

ER

This empty tag denotes the end of the reference. It must be the last tag of each RIS dataset.

ID

This tag is used to uniquely identify the reference in the database. The value is either the unique ID that refdb generates when a reference is imported into a database, or a unique citation key. The latter can be supplied by the user. If no citation key is specified when adding a reference, refdb will automatically generate a unique citation key, based on the name of the first author and the publication year. refdb will create an unique ID value for internal use regardless of whether a citation key is provided or not.

Note: ID values are always numerical (e.g. "11"), whereas citation keys are alphanumerical (e.g. "Miller1999").

While you are free to choose any reasonable citation key (as long as it is unique within the database), you should not attempt to create a ID value manually. It is ignored when adding the dataset, but it may overwrite an existing entry if you update a reference. Along the same line, you should leave the ID tag alone if you retrieve a dataset from the database and plan to update it. The citation key in the retrieved data set is essential to match the modified data with the copy in the database.

ID Format: Integer >0.

Citation key Format: A string with up to 255 characters

TI

This is the title of a publication. For BOOK and UNPB references this is the same as the BT tag.

Format: A string with unlimited length.

T2

This is the secondary title of a publication, e.g. the book title for a CHAP reference.

Format: A string with unlimited length.

T3

This is the tertiary title of a publication, e.g. the series title for a CHAP reference.

Format: A string with unlimited length.

AU

Synonym: A1. This is the name of one author of the reference. If a reference has multiple authors, each author is specified with an AU tag on a separate line. The number of authors per RIS dataset is not limited. The sequence of the authors in the authorlist will be determined from the sequence as they appear in the RIS dataset.

Format: A string with up to 255 characters in the form: Lastname[,(F.|First)[(M.|Middle)[,Suffix]]]. First and middle names can either be abbreviated or spelled out. Some examples for valid entries:

  • King,B.B.

  • Benberg,Steven C.

  • Mellencamp,John Cougar,Jr.

  • Van Zandt,Steven

A2

Synonym: ED. This is the name of an editor of the reference, e.g. an editor of the book in which a CHAP reference was published. The same formatting requirements as for AU apply.

A3

This is the name of a series editor of the reference, e.g. an editor of a series of books in one of which a CHAP reference was published. The same formatting requirements as for AU apply.

PY

Synonym: Y1. This is the primary publication date.

Format: A string with the format "YYYY/MM/DD/otherinfo", where YYYY denotes the four-digit year, MM and DD denote the two-digit month and day, respectively, and otherinfo denotes any other information with up to 255 characters. If any of these parts is not available, it can be left out, but the slashes must be present. E.g. "1999///Christmas edition" is a valid string.

Y2

This is the secondary publication date.

Format: A string with the format "YYYY/MM/DD/otherinfo", where YYYY denotes the four-digit year, MM and DD denote the two-digit month and day, respectively, and otherinfo denotes any other information with up to 255 characters. If any of these parts is not available, it can be left out, but the slashes must be present. E.g. "1999///Christmas edition" is a valid string.

N1

The notes. This can be any form of additional information, like pointers to corrections or editorials, or just personal notes about the contents of the reference.

Format: A string with unlimited length

N2

Synonym: AB. The abstract of a reference.

Format: A string with unlimited length

KW

A keyword. If a publication has multiple keywords, each goes on a separate line preceeded with this tag. Keywords are crucial to find references in larger databases.

Format: A string with up to 255 characters

RP

The reprint status of a reference. This can be any of the following strings:

  • IN FILE

  • NOT IN FILE

  • ON REQUEST MM/DD/YY

AV

The availability information. This is a hint where you can find an offprint or a file containing the reference.

Format: A string with up to 255 characters. This can either be a plain-text description like "methods folder, second drawer from top in the green cabinet on the yellow hallway", or an URL pointing to a file. In the latter case, this field has to start with the string "PATH:" with no space between this and the path proper. Using this feature requires some thought and is therefore explained in a separate section.

SP

The start page of the reference

Format: A string with up to 255 characters

EP

The end page of the reference

Format: A string with up to 255 characters

JO

The abbreviated name of a journal.

Format: A string with up to 255 characters. The journal words should be separated by a single space without a period after abbreviated words. If you use periods, these should not be followed by spaces.

JF

The full name of a journal.

Format: A string with up to 255 characters

J1

The abbreviated name of a journal (user abbreviation 1).

Format: A string with up to 255 characters

J2

The abbreviated name of a journal (user abbreviation 2).

Format: A string with up to 255 characters

VL

The volume of the journal.

Format: A string with up to 255 characters

IS

The issue of the journal

Format: A string with up to 255 characters

CY

City of publication of a book.

Format: A string with up to 255 characters

PB

Name of the publisher or the publishing company.

Format: A string with up to 255 characters

SN

The ISBN or ISSN number.

Format: A string with up to 255 characters

AD

The contact address, usually the any combination of postal or email address and the phone or fax number of the corresponding author.

Format: A string of unlimited length

UR

The URL of an online version of the reference.

Format: A string with up to 255 characters

U1 through U5

The user-defined fields 1 through 5. These fields are not intended to be filled with random bits of information. Each database should have a set of rules what information is to be stored in these fields.

A possible use for these fields is some relevance indicator (e.g. "*" means low, "*****" means high relevance).

You may also use one of these fields to create the equivalents of "folders" that some other reference databases praise as the panacea to organize your references. Just assign the same value to one of these fields for all references that belong to the same folder. Retrieve them by specifying this value in addition to your other search criteria.

Format: A string with up to 255 characters

M1 through M3

The miscellaneous fields 1 through 3. The distinction between Ux and Mx fields is somewhat unclear, and maybe only the inventors of the RIS format have a vague idea why there are two different types of fields for user-defined information.

Format: A string with up to 255 characters

Creating a RIS dataset from scratch

As noted previously, you can use any text editor that creates Unix-style line endings (linefeed) to create and edit RIS files. refdb ships a ris-mode for Emacs which makes the task a little more pleasant. Ask your administrator whether this mode is available on your system.

Creating a dataset is basically monkey business. The only intelligence required is to use the correct tags for your chunks of bibliographic information. Use the example RIS file as guidelines for the most common reference types journal article, book chapter, and book. We'll look at a few issues related to these references first.

The first issue is certainly the type of the reference. There are pretty clear cases if you look at the list, but there's some bordercases too which are not covered by the predefined types. You should keep in mind that refdb does not restrict the fields available for a particular reference type. You can fill any available field for any reference type. The only restriction is that the bibliography styles for most reference types use only a subset of the available fields. E.g. a bibliography entry of a journal article will not show a series editor even if you filled in the A3 field, whereas the bibliography entry of a book chapter might show the series editor if the book the chapter is published in is a part of a series.

The general rule is to use the closest matching type, to be consistent in this decision (all similar bordercases should use the same type), and to use the GEN type if nothing else helps. Most bibliography styles display all available fields for the GEN type.

To people new to bibliographic software, the various levels of titles and authors is often confusing. RIS offers three levels of authors:

AU (synonym: A1)

The author of a publication. This is the person (or the persons) responsible for the smallest unit of the publication you're looking at.

ED (synonym: A2)

The editor of a collection of publications.

A3

The editor of a series of collections of publications.

Lets consider a few examples. If your reference contains a journal article, published in some scientific journal, the AU fields contain the names of those who wrote the article. The same holds true for the authors of a chapter published in a book like "Methods in Enzymology". The chapter authors would be in the AU fields, the volume editors in the ED field. The editors of the whole "Methods in Enzymology" series of books would be entered in A3 fields. However, if your reference points to one particular volume of "Methods in Enzymology" as a whole, you'd rather put the volume editors in the AU fields. The same logic holds true for the title fields:

TI (synonym: T1)

The title of the smallest unit of publication you're looking at.

A2

The title of the collection of publications

A3

The title of the series of collections of publications

Using our previous example, an article published in "Methods in Enzymology" might have a TI field "An apparatus to turn urine into gold". The T2 field would be the title of the volume, e.g. "Alchemy and related techniques", whereas the T3 field would contain "Methods in Enzymology". However, if your reference points to the "Alchemy and related techniques" volume as a whole, this title would go into the T1 field.

Retrieving datasets from PubMed

The primary source of reference data in the biomedical field is the PubMed database maintained by the National Center for Biotechnology Information. This section shows the simplest and most common way to retrieve bibliographic information about interesting articles from this database using a web browser (other methods use web service clients or email subscription services, but this is beyond the scope of this tutorial).

After visiting the site with your favourite web browser, select "PubMed" from the drop-down box called "Search" and type a query in the provided field. Something like "Doe J 2002" to find articles published by J. Doe in 2002. After hitting "Enter" you'll receive a list of publications matching your query. Select the ones you're interested in by clicking the check box right next to the publication (convenience beating logic, you can also check none of the boxes in order to retrieve all publications). Select "XML" from the drop-down box next to the Display button and hit the latter. You'll receive the list of the publications in the Pubmed XML format. Now click the Save button on that page and save the information to a plain-text file, e.g. pm001.xml. You could then return to the search, run a few more queries, and save your results in additional files according to this pattern.

We'll use the Perl script med2ris.pl to turn our XML data into RIS data. This tool either reads Pubmed data from standard input or from files specified as arguments. The result will be sent to standard output, so you can either view it with a pager or write it to a file.

~$ med2ris.pl < pm001.xml | less

This command converts the data in the file pm001.xml and displays the result in a pager.

~$ med2ris.pl pm*.xml > pm.ris

This command converts all files that match the pattern pm*.xml, like pm001.xml or pmnew.xml, and writes the resulting RIS datasets to pm.ris.

Now you should add your personal information to each dataset, as outlined above. Then you could go ahead and add the references to your default database with refdbc:

refdbc: addref pm.ris

Importing BibTeX datasets

If you have a BibTeX database that you want to import into refdb, you'll have to convert these data to RIS first. This is again best done by using one of the converters shipped with refdb. The tool bib2ris will by default convert all standard BibTeX fields to the RIS equivalents. If you used non-standard fields in your BibTeX database, bib2ris can be configured to import these too, but this requires additional entries in the ~/.bib2risrc configuration file. The manual has all the details, but for the purposes of this tutorial we'll assume that you only use an "ABSTRACT" field as the only additional non-standard field.

Just like refdbc, bib2ris reads configuration files at startup which you can modify to permanently set some defaults. The syntax of the configuration file is the same as outlined above, but the only line we have to enter at this time is the following:

nsf_abstract N2
         

This will tell bib2ris to import your non-standard abstract field (this is case-insensitive, so your BibTeX file might use ABSTRACT or Abstract as well) into the N2 RIS field. Use the following command to see the results:

~$ bib2ris < myrefs.bib | less

This command will convert the contents of myrefs.bib in the current directory and display the result in a pager.

~$ bib2ris *.bib > myrefs.ris

This command reads the data from all *.bib files in the current directory and redirects the output into the file myrefs.ris.

Now there is a little issue with the data generated by bib2ris: they still contain all TeX markup that you may have used in your input data. If you want to use refdb only to maintain references for LaTeX files, this is probably ok, but if you want to use the data for SGML and XML documents too, it is necessary to strip the TeX markup before adding the references to the database. To this end, run the RIS file through the Perl script tex2mail shipped with refdb.

~$ tex2mail -noindent -ragged -linelength 65535 -ris < myrefs.ris > myrefs-notex.ris

As described in the previous section, you should now add your personal information and then use refdbc to add the datasets to your database.

Retrieving datasets from a Z39.50 server

Many libraries allow remote access to their electronic catalogs via the Z39.50 protocol. With a suitable client you can search the catalogs and retrieve the bibliographic information of interesting references to your local computer. For this tutorial we'll use the free client provided in the YAZ toolkit, although you could use any other client as well. One of the largest libraries accessible via the Z39.50 protocol is the Library of Congress. We'll try to find computer books written by some Mr. Knuth in this library.

The following command connects to the library using the host name "z3950.loc.gov", the port 7090, and the database name "Voyager". All this information is usually provided either online or in a printed pamphlet about electronic catalog access published by libraries offering Z39.50 services:

           ~$ yaz-client z3950.loc.gov:7090/Voyager
Sent initrequest.
Connection accepted by target.
ID     : 34
Name   : Voyager LMS - Z39.50 Server
Version: 1.13
Options: search present
Elapsed: 5.176489
Z>

Now you can go ahead and type a query. The full query syntax of Z39.50 is beyond the scope of this tutorial, but the following query retrieves entries with the authorname "knuth" and the topic "computer":

Z> f @and @attr 1=1003 knuth @attr 1=4 @attr 5=1 computer
Sent searchRequest.
Received SearchResponse.
Search was a success.
Number of hits: 28
records returned: 0
Elapsed: 5.187307
Z>

We've found 28 entries that match our search pattern. We could just go ahead and display some or all of them, but we'd like to write them to a file, so we let our client dump all retrieved references to the file knuth.loc.usmarc and make sure the data are retrieved as "usmarc" (again, the library should be able to inform you which formats are available). Other formats acceptable for refdb are "ukmarc" and "unimarc".

Z> set_marcdump knuth.loc.usmarc
Z> format usmarc
Z>

Now we retrieve all of the matching entries. The show command uses an argument like X+Y, where X is the record number where the retrieval should start and Y is the number of consecutive records to be retrieved. The data will be displayed on the screen and written to our file in the background.

Z> show 1+28
Sent presentRequest (1+28).
Records: 28
[VOYAGER]Record type: USmarc
[...real data left out for brevity...]
nextResultSetPosition = 29
Elapsed: 8.438264
Z>

Finally we can leave the client by typing:

Z> quit
~$ 
       

If you attempt to open the resulting MARC file with a text editor or display it with a pager, you'll notice a couple of strange characters. MARC is a binary data format which is not supposed to be readable as plain text. If you want to display the file in a human-readable form, use the tool marcdump (this is part of the MARC::Record perl module which is required for the marc2ris.pl converter shipped with refdb; if there is no marcdump on your system, ask your administrator):

         ~$ marcdump knuth.loc.usmarc | less

The structure of a MARC record is quite complex. It is divided into a leader, fields with three-digit names, indicators, and subfields. No need to understand it at this point, though.

refdb ships the Perl script marc2ris.pl which attempts to convert MARC datasets to the RIS format like this:

~$ marc2ris.pl knuth.loc.usmarc > knuth.loc.ris

This will convert the references we downloaded to corresponding references in RIS format which will be written to the file knuth.loc.ris. If you retrieved the data in a different format, use the -t command line option to specify the input file format: "marc21", which is equivalent to USMARC, "unimarc", or "ukmarc". Now you can proceed as described above and add the contents of this RIS file to your database.

How to create risx datasets

risx datasets basically carry the same information as RIS datasets, but they use an XML format instead of tagged lines.

What a risx dataset looks like

The main advantages of risx are:

  • The XML format is a good target for transformation of other bibliographic data in SMGL or XML formats.

  • XML can be edited using either a general-purpose editor or, even more conveniently, with any XML editor.

  • XML datasets can be validated, i.e. checked for completeness and for an appropriate structure.

Now lets have a look what the same dataset we had in RIS format above would look like in risx:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE ris PUBLIC "-//Markus Hoenicka//DTD Ris V1.0.2//EN" "http://refdb.sourceforge.net/dtd/risx.dtd">
<ris>
  <entry type="BOOK" citekey="smith1975metalloporphyrins">
    <publication>
      <title>Porphyrins and metalloporphyrins</title>
      <author>
        <lastname>Smith</lastname>
        <firstname>K</firstname>
        <middlename>M</middlename>
      </author>
      <pubinfo>
        <pubdate>
          <date>
            <year>1975</year>
          </date>
        </pubdate>
        <city>Amsterdam</city>
        <publisher>Elsevier Scientific Publishing Company</publisher>
      </pubinfo>
    </publication>
    <libinfo user="markus">
      <reprint status="NOTINFILE" />
    </libinfo>
    <contents>
      <keyword>Porphyrins</keyword>
      <keyword>Metalloporphyrins</keyword>
      <keyword>Spectrophotometry [methods]</keyword>
      <keyword>spectroscopy</keyword>
    </contents>
  </entry>
</ris>

This is a complete file with just one reference (you could add more entry elements for additional references). As you can see, the entry is a lot more verbose compared to RIS due to the spelled-out start and end tags. However, modern XML editors compensate this verbosity with nifty feature like tab completion and automatic insertion of end tags.

You will notice three large blocks of data in this dataset:

  • The publication element contains the bulk of the bibliographic data proper, like the title, the author(s), and the publication date. Simple entry types like the book we see here make do with one level of bibliographic information. Complex types need more than one level. A journal article needs the part element for the article proper and the publication element for the information related to the journal. A book published as a part of the series needs the set element for the series information in addition to the publication element.

  • The libinfo element contains the information specific to one refdb user, like availability information and personal notes. As a matter of fact, a risx dataset can contain an unlimited number of libinfo elements, one per user of the system. See also the section about global and personal fields.

  • The contents element holds the abstract (not shown here) and keywords.

For further information please visit the documentation of the risx DTD.

Writing risx datasets from scratch

Using your favourite XML editor, writing a risx dataset from scratch should not be exceedingly difficult. The editor should prompt you for required elements and attributes, and refuse to enter an invalid structure. See the example datasets shipped with refdb to get an idea what different entry types should look like.

Transforming SGML and XML bibliographic data to risx

This topic is somewhat beyond the scope of this introductory tutorial, but if you're familiar with SGML or XML transformations in general, this should not be too hard either. Each input bibliographic format will require a custom DSSSL (for SGML data) or XSLT (for XML data) stylesheet that transforms the data to risx.

Validating risx datasets

If you write risx datasets from scratch or develop your own stylesheets for SGML/XML transformations, it is strongly recommended to validate the results of your laborious efforts. refdbd uses a non-validating parser to map the risx data to the appropriate database columns. If your input data are invalid, the results might not be to your liking. Two tools come in handy to validate your input data:

onsgmls

This tool is part of the OpenJade package of SGML/XML tools. The following command can be used to validate a risx document:

~$ onsgmls -wxml -s /usr/local/share/refdb/declarations/xml.dcl risxrefs.xml
xmllint

This tool is part of the libxml2 package. Use it like this to validate a risx document:

~$ xmllint --valid  --noout risxrefs.xml

Global and personal fields

refdb differs from other reference management tools because a main goal of its design is to encourage people to share their references. However, you may have figured from the tag list above that some of these entries only make sense if they can be maintained by each user individually. This is precisely the approach used by refdb: The "hard" bibliographic data are global and identical for each user. The "soft" personal data, which are the only ones likely to change after the reference was added anyway, are maintained for each user individually. These personal fields are:

Even if you use one of the import filters described below or if you use RIS files exported from other bibliographic software, you should take the time to fill these fields with useful values. If you don't specify values, the AV and N1 field will be blank (this is ok), and the RP field will have the default value "NOT IN FILE".

Character encodings

One seemingly intimidating detail about reference data is the character encoding issue. At the lowest level, a computer doesn't understand anything but two states of a bit: off and on, usually represented as 0 (zero) and 1 (one). Concatenating several bits still doesn't make a text, but a series of binary numbers at best. This is why even text strings are represented as numbers in a computer's memory. Simply put, a character encoding is a lookup table that tells a computer which character to print if it encounters a particular binary number in a byte sequence that represents a text. The well-known ASCII encoding is understood by most computers but covers only 127 characters. Other encodings like Latin-1 contain all ASCII characters plus many special characters used in European languages. Still other encodings are far more versatile in that they allow to encode all characters used by recent and extinct languages. These are the various forms of the Unicode character encodings.

Although many encodings are known by several names, each character encoding has a preferred name which is usually identical with the MIME encoding name (the one you sometimes see in the header of emails). The names are case-insensitive, but otherwise the spelling must match precisely. E.g. UTF-16 and utf-16 are both recognized, whereas utf16 is incorrect.

Now where do these character encodings come into play? First of all, your reference database uses one particular character encoding, set by your administrator. All data that you add to this database will be converted to that encoding, and all data that you retrieve from this database will have to be converted from that encoding if necessary. To find out what encoding your database uses, run this command in refdbc:

refdbc: whichdb
Current database: refs
Number of references: 34
Highest reference ID: 34
Number of notes: 0
Highest note ID: 0
Encoding: UTF-8
Database type: risx
Database server: pgsql
Created: 2004-02-07 20:39:02 UTC
Using refdb version: 0.9.4-pre6
Last modified: 2004-02-09 21:43:10 UTC

In this example, the database uses the UTF-8 encoding, one of the most versatile Unicode encodings. Now, how do you convert your input data to e.g. UTF-8? Fortunately you don't have to, at least in most cases, as refdb does this for you on the fly. It just needs to know what encoding your input data use. The two input data formats employ two different ways to specify the character encoding:

RIS

RIS data do not have a built-in mechanism to record the character encoding. You will have to tell refdb explicitly. You do this by using the -E encoding option with the addref command. Allowed are all encodings that your operating system can deal with. The most common examples are UTF-8, ASCII, and ISO-8859-1 through -15 (the various character sets for European languages).

risx

As shown in the risx example above, each file containing risx data should specify the encoding in the processing instructions (the very first line of an XML file). Allowed are only four encodings: UTF-8, UTF-16, ASCII, and ISO-8859-1.

After you've added your data to the database, you're not yet done with encodings. The same issue pops up when you retrieve datasets from the database. By default, refdb sends the data using the same encoding as the database uses. However, you can retrieve the data using any encoding that your platform supports.

How to add and update references

In most cases you have a new set of references and want to add them to your database. Assuming your RIS references are stored in a file newrefs.ris in the present working directory, all you need to do is:

refdbc: addref newrefs.ris

To simply add the references in the example RIS file, use this command:

refdbc: addref /usr/local/share/refdb/examples/testrefs.ris

We have not used the -E option to select an encoding here, as the example data uses the ISO-8859-1 encoding which we have set as the default in our config file. If the input data were encoded in UTF-8, we'd do the following instead:

refdbc: addref -E UTF-8 /usr/local/share/refdb/examples/testrefs.ris

If you have references in the risx format instead, you'll have to tell refdb so. Do this with the -t risx switch:

refdbc: addref -t risx newrefs.xml

To add the example risx data, use this command:

refdbc: addref -t risx /usr/local/share/refdb/examples/testrefs.xml

Remember that risx data carry their encoding information in the processing instructions, so there is no need for using the -E option.

refdbc will try to add the references stored in these files to your default database. The diagnostic messages will be displayed in your pager, so if you send a dozen or so references it might take a few seconds until the results are displayed. You'll see a message for each reference found in the input file: Which ID was assigned, which citation key was used (or generated if you didn't specify one), and whether the operation was successful. Adding references usually fails only for two reasons:

Once a reference is added to the database, you might still feel the urge to change it. Be it that you would like to add further keywords or that your personal information, like the reprint status, have changed since. The most straightforward way to change a reference is to retrieve it in either RIS or risx format, save it to a file, edit it, and send the updated copy back to where it came from. The following sequence of commands shows this, assuming we want to use the RIS format:

Note: In this example we'll assume that you already know the ID of the reference that you want to change. You'll learn later how to find a reference by all sorts of criteria like authors, keywords and the like. This section also has additional information on the getref command used here.

refdbc: getref -t ris -o editme.ris :ID:=7
2595 byte written to /usr/home/markus/refdb/editme.ris
refdbc:

Now you can use your favourite text editor to change the file editme.ris as you see fit. For this exercise we'll just change the reprint status (the RP field) from "NOT IN FILE" to "IN FILE". When you're done, save the file and go back to refdbc:

refdbc: updateref -P editme.ris
Updating input set 1 successful
1 dataset(s) updated, 0 added, 0 failed
1 dataset(s) sent.

Now what was that -P switch good for? This switch tells refdb that it should only update your personal information of this reference, i.e. the RP, AV, and N1 fields. This is a lot faster than updating the whole reference. It is also more secure as you might have changed the file somewhere else accidentally without noticing. On the other hand, if you e.g. correct a typo you noticed in the title (TI) field, you must not use the -P switch.