refdb handbook: covers version 0.8.5
Prev	Chapter 15. Reference data conversion tools	Next

nmed2ris

This input filter accepts a variety of Medline file formats and converts them to RIS format. The input is accepted in DOS (with CR/LF line endings) and Unix (LF) formats. nmed2ris is also used as a CGI application to provide its data conversion service in the refdb web interface.

Starting nmed2ris

Start nmed2ris with the command:

nmed2ris [-e log-destination] [-h] [-i t|f] [-l log-level] [-L log-file] [-o output-file | -O output-file] [-q] [-s data-source] [-v] [-f input-file | file]

Remember that you don't have to specify all these options each time if you define the values in nmed2risrc.

The -e option defines the destination of log output. In order for log output to appear at all, the log level has to be specified correctly with the -l option. A log-destination argument of 0 directs log output to stderr, 1 uses the syslog facility, 2 uses a custom log file. For the latter to work you have to specify a log filename with the -L option.

With the -h option nmed2ris displays a brief help screen and exits.

Use the -i option to determine what to do with unknown tags in the source files. An unknown tag may lead to unwanted loss of information. Use the value t to simply ignore unknown tags. If you use f instead, any unknown tag will generate an error.

The -l option determines the maximum log level that a log message may have to be logged. If you specify a high level (<=7), all sorts of messages including debug messages are logged. If you specify a low level (>=0), only critical errors are logged. Specify -1 to disable logging.

The -L option specifies a filename which is used as a custom log file if the -e option is set appropriately.

The -o and -O options specify a filename where the output gets written to or appended, respectively. If neither of these options is used, the output is written to stdout.

Use the -q option to temporarily switch off the settings in the init files. nmed2ris will then use the compile-time defaults unless you specify things with the command line switches (useful for debugging configuration file settings).

Specify the data source with the -s option. Valid arguments are BM (BioMedNet), CC (Current Contents on disk), GM (Grateful Med), HG (HealthGate), KF (Knowledge Finder), PM (PubMed).

-v prints the version information and brief licensing information, then exits.

All other command line parameters will be interpreted as input filenames.

nmed2ris can read the incoming data either from files or from stdin. You can specify an input file either with the -f command line option or simply as a list of filenames.

By default, nmed2ris writes the converted data to stdout. You can pipe this through a pager to see the results, pipe it into another application for further mangling, or redirect the output to a file. Alternatively you can use the -o and -O options to write the output to a file or append it to a file, respectively.

The following examples show the usage of nmed2ris for file-based and stream-based in/output, respectively.

~#  nmed2ris -o out.ris pm*

This will convert all files in the current directory starting with pm and write the output into out.ris, overwriting any existing file with the same name.

~#  nmed2ris -s PM < pm001.txt >> out.ris

This will direct the contents of pm001.txt to stdin of nmed2ris and convert the contents. The result will be appended to the file out.ris.

The nmed2ris configuration variables

Depending on how nmed2ris is run, it will consult two different configuration files. If it runs as a regular application, the file nmed2risrc will be used. If it runs as a CGI application, nmed2riscgirc will be used instead. This way you can use different configurions even if the user program and the CGI program run on the same computer.

Table 15-1. nmed2risrc

Variable	Default	Comment
medsource	PUBMED	The default source of input data.
outfile	(none)	The default output file name.
outappend	t	Determines whether output is appended (`t`) to an existing file or overwrites (`f`) an existing file.
ignoretag	t	If set to `t`, unknown tags in the input data will be silently ignored. If set to `f`, each unknown tag will generate an error message.
logfile	/var/log/nmed2ris.log	The full path of a custom log file. This is used only if logdest is set appropriately.
logdest	1	The destination of the log information. 0 = print to stderr; 1 = use the syslog facility; 2 = use a custom logfile. The latter needs a proper setting of logfile.
loglevel	6	The log level up to which messages will be sent. A low setting (0) allows only the most important messages, a high setting (7) allows all messages including debug messages. -1 means nothing will be logged.

Table 15-2. nmed2riscgirc

Variable	Default	Comment
refdblib	(none)	The path of the directory containing shareable refdb files like DTDs, HTML templates etc.
medsource	PUBMED	The default source of input data.
outfile	(none)	The default output file name.
outappend	t	Determines whether output is appended (`t`) to an existing file or overwrites (`f`) an existing file.
ignoretag	t	If set to `t`, unknown tags in the input data will be silently ignored. If set to `f`, each unknown tag will generate an error message.
logfile	/var/log/nmed2ris.log	The full path of a custom log file. This is used only if logdest is set appropriately.
logdest	1	The destination of the log information. 0 = print to stderr; 1 = use the syslog facility; 2 = use a custom logfile. The latter needs a proper setting of logfile.
loglevel	6	The log level up to which messages will be sent. A low setting (0) allows only the most important messages, a high setting (7) allows all messages including debug messages. -1 means nothing will be logged.

nmed2ris' behind-the-scenes data mangling

While the primary purpose of nmed2ris is the conversion of various Medline formats to the RIS format digestable for refdb, it does some useful things on the fly:

Regardless of the number of original input files, you'll have to deal with only one output file or output stream at stdout.
Different Medline providers use different formats for the MeSH subheadings. All supported formats are consolidated into a single format to get a consistent database.
Keywords with multiple MeSH subheadings are split into multiple keywords with one MeSH subheading each. This simplifies searching for MeSH subheadings greatly.

nmed2ris is no parser and does not validate the input files, though. That is, the input files must stick to the rules of the data sources, otherwise the conversion results are not predictable. nmed2ris will act according to "garbage in, garbage out" in most cases.

Data sources

nmed2ris currently supports data from the following source:

PubMed

nmed2ris also contains code to import data from other sources. This code may be out of date as it is currently not maintained.

The data source can be explicitly specified with an medsource entry in the init file or with the -s command line option. Either of these should be specified if you read the data from stdin, otherwise nmed2ris defaults to PubMed. Alternatively, you can use a semi-automatic datasource recognition by filename prefixes (this clearly doesn't work for input on stdin). When downloading the files from any of the online sources, simply prefix the filenames with a case-insensitive two-letter code denoting the datasorce:

BM: BioMedNet
CC: Current Contents
GM: Grateful Med
HG: HealthGate
KF: KnowledgeFinder
PM: PubMed

Thus pm001.txt would be recognized as a PubMed input file. A PubMed file not starting with pm would need either the command line switch -s PM or the init file setting medsource PM.