refdb handbook: covers version 0.9.6
Prev	Chapter 15. Reference data conversion tools	Next

15.7. marc2ris

This Perl script attempts to extract the information useful to refdb from MARC datasets. MARC (Machine Readable Catalogue Format) is a standard originating from the 1960s and is widely used by libraries and bibliographic agencies. Most libraries that offer Z39.50 access can provide the records in at least one MARC format (like with most other "standards" there's a couple to choose from). Currently the following MARC dialects are supported:

MARC21: This is an attempt to consolidate existing MARC variants (mainly USMARC and CANMARC) and will most likely be the format supported by all libraries in the near future. The format is described on the Library of Congress MARC pages.
UNIMARC: This is the European equivalent of a standardization attempt. The specification can be found here.
UKMARC: This format is fairly close to the USMARC variant and is mainly used by libraries in the United Kingdom and in Ireland. Libraries supporting this format may switch to MARC21 in the future. Unfortunately there is no online description of this format, but this PDF document describes the main differences between USMARC and UKMARC.

15.7.1. Starting marc2ris

There's a variety of ways to run this script. By default the script reads USMARC data from stdin and sends RIS data to stdout, like this:

~$ perl marc2ris < foo.marc | less

Note: You can save some typing if (on Unix) the first line of the script points correctly to your Perl interpreter or if (on Windows) the filename suffix .pl is associated with the Perl interpreter. The following examples use this shorter invocation.

Alternatively you can specify one or more input files as arguments. Instead of displaying the results with a pager like in the previous command, we'll send the output to a file this time:

~$ marc2ris foo.marc bar.marc > foobar.ris

In either case you can specify an output file instead of sending the data to stdout. The following command will do exactly the same as the previous one:

~$ marc2ris -o foobar.ris foo.marc bar.marc

marc2ris accepts the following command line options and arguments:

marc2ris [-e log-destination] [-h] [-l log-level] [-L log-file] [-m] [-o outfile] [-O outfile] [-t input_type] [-u] [file...]

The -h option displays a brief usage screen and exits.

The -m option switches on additional MARC output, see below for details.

The -o and -O options send the RIS output to a file instead of to stdout. The lowercase option will overwrite an existing file, whereas the uppercase option will append to an existing file with the specified name.

The -t option allows to specify the MARC input type. The default is MARC21. Other available types are UNIMARC and UKMARC.

Use the -u option to request Unicode output. marc2ris attempts to convert the input data into Unicode (unless the dataset explicitly states that it already uses Unicode). Use this option with care as some MARC variants do not state the character encoding explicitly.

Note: The conversion routine supplied by the MARC::Record module uses a character conversion table designed for USMARC. This may or may not work with other MARC variants.

15.7.2. marc2ris data mangling

The purpose of the MARC format is entirely different from the purpose of the RIS format, so you shouldn't be too surprised that the import of MARC data is somewhat rough at the edges. The filter apparently deals fine with quite a lot of datasets, but the following shortcomings are known (and more are likely to be discovered by the interested reader):

Some fields, like 846, are currently ignored completely. This, of course, is bound to change.
Author names specified in the natural order, i.e. something like First Middle Last, are not normalized due to the problems with multiple middle or last names. Author names in the inverse order, i.e. something like Last, First Middle, are normalized correctly in most cases. Handling of non-European names is a matter of trial and error.
Character set handling is somewhat limited. Only the unaltered input character encoding or UTF-8 are available for the output data.

That said, there is still some hope. The -m command line option switches on additional MARC output. That is, the generated output will contain interspersed lines that show the contents of the original MARC fields used to generate the following RIS line or lines. For example, the following output snippet shows how marc2ris generated the author lines from the MARC input:

<marc>empty author field (100)
<marc>:Author(Ind1): 1
<marc>:Author($a): Ershov, A. P.
<marc>:Author($b): 
<marc>:Author($c): 
<marc>:Author(Ind1): 1
<marc>:Author($a): Knuth, Donald Ervin,
<marc>:Author($b): 
<marc>:Author($c): 
AU  - Ershov,A.P.
AU  - Knuth,Donald Ervin

If you feel marc2ris does not translate your data appropriately, the easiest way might be to use the -m switch and redirect the output into a file. Then you can analyze the situation and fix the RIS lines as you see fit. Finally you can strip the MARC lines off with a command like:

~$ grep -v "<marc>" < withmarc.ris > womarc.ris

15.7.3. The marc2ris configuration variables

Table 15-8. marc2risrc

Variable	Default	Comment
outfile	(none)	The default output file name.
outappend	t	Determines whether output is appended (`t`) to an existing file or overwrites (`f`) an existing file.
unmapped	t	If set to `t`, unknown tags in the input data will be output following a <unmapped> tag; the resulting data can be inspected and then be sent through sed to strip off these additional lines. If set to `f`, unknown tags will be gracefully ignored.
logfile	/var/log/med2ris.log	The full path of a custom log file. This is used only if logdest is set appropriately.
logdest	1	The destination of the log information. 0 = print to stderr; 1 = use the syslog facility; 2 = use a custom logfile. The latter needs a proper setting of logfile.
loglevel	6	The log level up to which messages will be sent. A low setting (0) allows only the most important messages, a high setting (7) allows all messages including debug messages. -1 means nothing will be logged.

Prev	Home	Next
db2ris	Up	Convert SGML and XML data to risx