med2ris.pl

This Perl script converts Pubmed reference data into RIS data. The converter understands both the tagged Pubmed format (which superficially resembles RIS) and the XML format according to the PubMedArticle DTD. In most cases med2ris.pl is able to automatically detect the input data type.

Starting med2ris.pl

Start the script with the following command:

[perl] med2ris.pl [-e dest] [-f enc] [-h] [-i] [-l level] [-L logfile] [-o file] [-O file] [-q] [-t enc] [-T type] [-y path] [infile...]

Note: Specifying the command interpreter perl on the command line is not necessary if it is in the default location /usr/bin/perl.

The -e option takes either a numeric (0|1|2) or a symbolic (stderr|syslog|file) argument to specify the log destination.

The -f and -t options select the input and output character encoding, respectively. Supported encodings are platform-dependent and can usually be retrieved by running man iconv or man iconv_open. If no encodings are specified, "ISO-8859-1" aka Latin-1 is assumed for both input and output.

The -h displays a brief usage message.

Set the -i option to output additional information about unknown or unused tags.

Use the -l option to set the log level to a numeric value between 0 and 7 or to a symbolic value (alert|crit|err|warning|notice|info|debug). If the log destination is "file", the -L option specifies the full path of a custom log file.

The -o and -O options cause med2ris.pl to write the output data into a file. The lowercase -o option will overwrite any existing file of the same name while the uppercase -O option will append the output to an existing file. If none of these options is used, the output will be written to stdout.

The -q option will cause med2ris.pl to skip the configuration file which is mainly useful for debugging purposes.

Use the -T option to override the automatic input data type detection. Possible values for type are "xml" and "tag" for the XML and tagged data formats, respectively.

The -y switch can be used to specify the location of the refdb shared data in case the automatic script configuration is not appropriate on your system.

The input data are read from stdin unless one or more filenames are specified on the command line. In the latter case, the output generated from all files will be sent to stdout or to the output file. med2ris.pl is also used as a CGI application to provide its data conversion service in the refdb web interface.

The following examples show the usage of med2ris.pl for file-based and stream-based in/output, respectively.

~# perl med2ris.pl -o out.ris pm*

This will convert all files in the current directory starting with pm and write the output into out.ris, overwriting any existing file with the same name.

Note: You can leave out the "perl" in the above command if your Perl interpreter is in the default location /usr/bin/perl, as shown in the next example.

~#  med2ris.pl -f "ISO-8859-1" -t "UTF-8" < pm001.txt >> out.ris

This will send the contents of pm001.txt to med2ris.pl and convert the contents. The result will be appended to the file out.ris. The input data are assumed to be Latin-1, whereas the output will be Unicode.

The med2ris.pl configuration variables

Depending on how med2ris.pl is run, it will consult two different configuration files. If it runs as a regular application, the file med2risrc will be used. If it runs as a CGI application, med2riscgirc will be used instead. This way you can use different configurations even if the user program and the CGI program run on the same computer.

Table 16-1. med2risrc

Variable Default Comment
outfile (none) The default output file name.
outappend t Determines whether output is appended (t) to an existing file or overwrites (f) an existing file.
unmapped t If set to t, unknown tags in the input data will be output following a <unmapped> tag; the resulting data can be inspected and then be sent through sed to strip off these additional lines. If set to f, unknown tags will be gracefully ignored.
from_enc ISO-8859-1 The character encoding of the input data
to_enc ISO-8859-1 The character encoding of the output data
logfile /var/log/med2ris.log The full path of a custom log file. This is used only if logdest is set appropriately.
logdest 1 The destination of the log information. 0 = print to stderr; 1 = use the syslog facility; 2 = use a custom logfile. The latter needs a proper setting of logfile.
loglevel 6 The log level up to which messages will be sent. A low setting (0) allows only the most important messages, a high setting (7) allows all messages including debug messages. -1 means nothing will be logged.

Table 16-2. med2riscgirc

Variable Default Comment
refdblib (none) The path of the directory containing shareable refdb files like DTDs, HTML templates etc.
outfile (none) The default output file name.
outappend t Determines whether output is appended (t) to an existing file or overwrites (f) an existing file.
unmapped t If set to t, unknown tags in the input data will be output following a <unmapped> tag; the resulting data can be inspected and then be sent through sed to strip off these additional lines. If set to f, unknown tags will be gracefully ignored.
from_enc ISO-8859-1 The character encoding of the input data
to_enc ISO-8859-1 The character encoding of the output data
logfile /var/log/med2ris.log The full path of a custom log file. This is used only if logdest is set appropriately.
logdest 1 The destination of the log information. 0 = print to stderr; 1 = use the syslog facility; 2 = use a custom logfile. The latter needs a proper setting of logfile.
loglevel 6 The log level up to which messages will be sent. A low setting (0) allows only the most important messages, a high setting (7) allows all messages including debug messages. -1 means nothing will be logged.

med2ris' behind-the-scenes data mangling

Keywords with multiple MeSH subheadings are split into multiple keywords with one MeSH subheading each. This simplifies searching for MeSH subheadings greatly.

med2ris does not validate the input files. That is, the input files must stick to the rules of the data sources, otherwise the conversion results are not predictable.