CDF/5336

Input and Output Modules user guide

Fedor D. Ratnikov (Rutgers)

This is a changing document; this version has date 2/14/03 12:47 PM.

Look for the latest version: in cdfkits.fnal.gov

This is a hypertext document: cross-references will be lost on hardcopy.

 

Important note: This document describes DHInput and DHOutput modules as they were before CDF software release 4.8.3. Some commands have been changed since then. The up to date description of new commands may be found elsewhere.

Contents

Introduction

 

CDF AC++ DHInputModule  and DHOutputModule  are the preferred interfaces to access input data and direct output data in a standard AC++ analysis job.

They are able to access data archived by the DH system as well as user files on disk.

These modules are in package FrameMods/DH of the CDF offline CVS repository.

Explanation of main components and definitions of key terms are summarized in the end of this document.

 

Input (DHInputModule)

Include InputModule into executable

At the moment DHInputModule is included as the default in the addCDFrequiredModules that should be called in the AppUserBuild::AppUserBuild, so no special action is needed.

Specification of the input data

The commands for specifying input data are part of the DHInput module user interface. To use them, switch to communication with DHInput:

AC++> talk DHInput

Commands are: include, exclude and reset. Note that these commands affect the current input data list immediately and therefore are not commutative. The initial input data list is empty. Data units that could be included in or excluded from the input data list are dataset, fileset or file owned by the Data Handling system or by the user.

To get help online one could type

help

in the command prompt and follow instructions.

Command:

show include

displays the current input data list.

Include command

Any data unit may be included into the current input data list. Restriction on the list may be applied, to limit the list by run and section numbers and/or by use of the particular user book.

A data unit is specified by its name.

A run restriction is specified in the format  run=#, run<#, run>#, run<=# or run>=#, section restriction is specified in the format section=#, section<#, section>#, section<=# or section>=# 

# here is an integer number in either decimal  (like 12345) or hexadecimal (0xabcde) format.

section keyword may be truncated, for example: sect=#, sec<#

There might be not more than one run and/or section restriction per includes command. If more restricted data are necessary, they should be specified in separate include commands. Note that every include command adds corresponding files to the list of files to be processed by the job.

 

If the requested data are catalogued in a private book, the book name should also be specified in format  book=<book name>.

Note that run # and section # restrictions and book name are applied to the data request specified in the same line only.

 

All files of all filesets of listed datasets are added to the input data list.

 

All files of listed filesets are added to the input data list.

 

 

All listed files are added to the input data list. The file name is either the name of the DFC file or the full absolute user file name. Note, that user file name should include slash (/) in its name. Use ./blah.dst to specify file in the current directory.

User file name may include wildcards. For example, the following command adds all files with extension .dst in the directory /blah/blah into the input data list:

Remote files

Files on the remote computers might be directly accessed as well providing ROOTD server is running on remote site. The file specification must be prefixed by root://<hostname>/ string. For example, the following command adds all files with extension .dst in the directory /blah/blah on the remote computer fcdfsgi2.fnal.gov into the input data list:

Note that usually not every remote file is accessible even if ROOTD is running.

For example, only data in directories

are accessible from fcdfsgi2 remotely.

Exclude command

Any data unit may be excluded from the current data list. Run restrictions are not supported for the exclude command, but the private book definition is allowed:

Any file of any fileset of listed datasets is removed from the input data list.

 

Any file of listed filesets is removed from the input data list.

 

All listed files are removed from the input data list. File name is either name of the DFC file or full absolute user file name. Note, that user file name should start with /.

Reset command

Remove all items from the input data list.

Examples

The example Tcl file used for a validation job which specifies all data of the "RAW generic physics trigger" dataset   looks like:

talk DHInput

  include dataset "RAW generic physics trigger"

  show include

exit

 

Other possibilities are:

All files belonging to the specified dataset and containing any events from run 1234.

All files belonging to the specified dataset and containing any events from run section 56 of run 1234.

All files containing events of runs 0xabcdb, 0xabcdc, 0xabcdd (hexadecimal), belonging to the specified dataset.

All files contained in filesets CA1234.0 CA1234.1 CA1234.2.

File ab123456.7890cdef, collected from the DH system

All files on the disk, corresponding to the specified pattern.

All files on the disk of remote computer, corresponding to the specified pattern.

All files of the specified dataset that is defined in DFC user book "ratnikov".

Exclude file  ab123456.7890cdef  from the previously specified list (if it crashes your job, for example)

 

Input options

Input options are defined by setInput command. The format of the command is:

 

DHInput> setInput <parameter1> [<parameter2> ...]

where

<parameter> is either single keyword to specify an option or an expression keyword=value.

Single keyword options are the following:

Keyword

Action

help

print description of the command

list

print out current setting for the input options

inorder

events of every files are arranged before processing (default >= 4.1.0int3)

noorder

suppress event arrangement before processing (default < 4.1.0int3)

full

EDM event is read in full, all objects are extracted (default)

compact

EDM event is read as a compact TBuffer, no separate objects are extracted

launch

Launch tape staging jobs if requested data are not available on disk

raw

assume missing runsections in the input file, |rely on DFC information (need access to DFC)

 

Keyword=value choices are the following:

Keyword

Value type

Comment

timeout

int

timeout for waiting (sec) until next staging transaction is completed

log

name

the file to log contributing runsections and other relevant information

 

There are two old flags specifying the input mode of event processing:

This option guaranties that events are grouped according to conventions, all the necessary extra records are inserted and unnecessary records are suppressed. Therefore reordering is especially important when output data stream is produced. The File Content Catalog build up is very time efficient except the case of RAW data. But namely for RAW data reordering is extremely necessary as far as events of different runsections are mixed in those files. Do not disable reordering unless no output data streams are created and you do understand that processing events out of order is safe for your task.  

 

Filtering of events to be processed

Along with specifying input datasets, filesets and files, the user may restrict processing only to particular events from the input stream or may filter out undesirable events. A pair of numbers – the run number and trigger number (trigger counter is reset every new run) – uniquely specifies every event. Every event also belongs to particular run/section. Both the run/trigger pair and the run/section pair may be used for the restriction of the events to be processed.

Select ranges of run/trigger numbers to be processed  

A range of events is specified by run# and trigger# of the first event and run# and trigger# of the end event.

selectEvents command

Range may be specified in a format using run and event keywords. Possible relations are: =, >, <, >=, <=. The event keyword specifies trigger #.

Examples:

run=100 event=1111

specifies only the event #1111 of the run #100

run=100

specifies all of the run #100

run=100 event>=1111 event<=1120

specifies events of the run #100 with event # between 1111 and 1120 inclusive.

run>=100 event>=1111 run<=200 event<9999

specifies all events between event #1111 of the run #100 inclusive and event #9999 of the run #200 exclusive.

 

Note:

The selectEvents command has exactly the same effect as the runsTrigs command. It specifies the same set of ranges of events to be processed by the analysis modules. Examples of selectEvents commands follow the runsTrigs command specification.

 

runsTrigs command

Expression R1(T1):R2(T2) specifies all events in the range between the event with trigger T1 of run R1 and the event with trigger T2 of run R2. Note, that the range is semi open – the first event is included in and the end event is excluded from the range.

Wildcard * can substitute for the boundary for both run# or trigger#; * is equivalent to 0 for lower bound and to very large number for higher bound.

A short format of events range specification is allowed:

 

Short format

Full equivalent

R1:R2(T)

R1(*):R2(T)

R1(T):R2

R1(T):R2(*)

R(T1:T2)

R(T1):R(T2)

R1:R2

R1(*):R2(*)

R

R(*):R(*)

 

selectEvents and runTrigs commands examples:

Initial setting

Run/trigger range is set initially as a full range by default: *(*):*(*), so no selectEvents or runsTrigs command is necessary if all events in the input data list should be processed.

Set ranges of events to be selected:

selectEvents set <requirements>

runsTrigs set <range1> [<range2> ...]

Add ranges of events to be selected to the current selection:

selectEvents include <requirements>

runsTrigs add <range1> [<range2> ...]

Set ranges to be excluded from the current selection:

selectEvents exclude <requirements>

runsTrigs exclude <range1> [<range2> ...]

 

Example 1

selectEvents set run>=100 event>=1111 run<=200 event<=9999

runsTrigs set  100(1111):200(10000)

specifies all events of run 100 with trigger number greater or equal to 1111, all events of runs 101, 102, ..., 199 and all events of run 200 with trigger numbers smaller then 10000.

Example 2

selectEvents set run>=100 run<=200

runsTrigs set  100(*):200(*)

specifies all events of runs 100, 101, ..., 200.

Example 3

selectEvents set run=100 event=1111  

runsTrigs set  100(1111:1112)

specifies the only event with run #100 and event #1111

Example 4

selectEvents set run=100 event=1111  

selectEvents include run=200 event=2222  

selectEvents include run=300 event=3333  

specifies three different events. Note that first command is set and all following commands are include.

 

 Select ranges of runsections to be processed (runsSections command) 

The meaning of runsection has been changed. The run number and the section number relative to the run now specify the single section of the data. Section numeration starts with 1 for every new run. The corresponding user interface to select a particular run/section range is not set up yet.

Parallel data processing

When particular data sample is processed by several parallel jobs, input data list must be split into sections, each section being processed by separate job. User can use splitInput command to split input data list into several bunches of files. Then user has two options:

Example:

talk DHInput

  include dataset ...

  splitInput slots=10 this=5

exit

Input data will be split into 10 sections and section #5 will be immediately processed by this job.

 

Example:

talk DHInput

  include dataset ...

  splitInput slots=10 save=myinput

exit

Input data will be split into 10 sections and corresponding tcl files myinput_10_0.tcl, myinput_10_1.tcl, ..., myinput_10_9.tcl will be created. User can request further processing of particular section of data via tcl command like

source myinput_10_5.tcl

 

First approach is a bit simpler but it assumes that input data list is not changing. If, for example, extra fileset is added to the dataset between start of first and last jobs, data splitting could be different for different jobs and some data could therefore stay unprocessed or be processed twice. Second approach guarantees that all generated data sections are consistent and correspond to dataset snapshot at the moment splitInput command is issued.

Navigating in the input file

As far as data are written in the ROOT format and ROOT file is essentially the direct access file, one can travel in the input file randomly, jumping from one event to another. Nevertheless, even for random read-only access some conditions should be satisfied. For example, if the next read event belongs to the different run, Begin-Of-Run condition should be generated first and appropriate BOR record being sent to the beginRun methods of active modules.

jump to particular event

talk DHInput

  jump to run=12345 event=67890

exit

set up the next event to be processed (jumps just before this event). This command slightly complains and does nothing if requested event is not available in the file.

If run=... specificator is omitted, the run# of the next event in the chain is used as a reference run# for the search. It should cause no problems for mono run files. For multi run files the good idea is to specify desired run# explicitly anyway.  

jump relatively

Commands:

talk DHInput

  jump 5

exit

jumps over 5 next events being ready to read 6-th one. Command "jump 1" means exactly “skip next event, please”, Note that only physics events are counted, therefore states after "jump 5" and "cont -nev 5" could be quite different. The reason is extra control records that are taken into account in the later case and are ignored in the former one.

Commands:

talk DHInput

  jump -5

exit

jumps over 5 events backwards. Command "jump -1" means exactly “reprocess last event, once again”, and command "jump -2" means  process event before the current one”.

Commands:

talk DHInput

  show jump

exit

displays the next event to be processed.

 

Important notes:

 

Hints to access events of the Commissioning Run

The effective and straightforward mechanism to select data of a particular run is implemented in the include command. For example,

talk DHInput

  include dataset "RAW generic physics trigger" run=12345

exit

requests data of run #12345 only.

 

Reconstructed data of the commissioning run are also collected in the Data Handling system in the dataset "Commisioning Run, Test Physics A, Production" (dataset id "aphysa").  It is a secondary dataset.

Files of primary (RAW) datasets like "RAW generic physics trigger" never contain more than one run. This is not true for secondary and subsequent selections. To guarantee processing of particular run only, one should specify run number both in include command and with runsTrigs command.

The request to process reconstructed data of single run looks therefore like:

talk DHInput

  include dataset "Commisioning Run, Test Physics A, Production" run=12345

  selectEvents set run=12345

exit

Specifics for RAW data access

RAW data contain no information about contributing Empty Runsections in the data file. The only reliable information about all runsections contributing into RAW file is corresponding file record in the DFC. Producing consistent output file requires that information about missed runsections to be retrieved from the DFC and be delivered to the output module. DHInput can be instructed to do it by setInput raw option.

Using this option one should be aware of:

Data sample luminosity calculation

Primary information about integrated luminosity is kept in the RUNSECTIONS table of the DataFileCatalog, so to retrieve the integral of the luminosity for the given data sample one should know exact list of runsections contributing to the sample.

Determination of the luminosity for the processing data is therefore a two-step procedure.

·       When processing (creating) data sample, the logging should be initiated with setInput log=<log-file-name> command. List of all contributing runsections will be written into the file.

·       Process log file with stand alone utility DHLumi which retrieves the necessary information from the database and integrates it.

DHLumi utility is not currently released.

Output (DHOutputModule)

Include OutputModule into executable

At the moment DHOutputModule is included as the default in the addCDFrequiredModules that should be called in the AppUserBuild::AppUserBuild, so no special action is needed.

Options for output streams

The main items that may be controlled for the Output Module are:

Destination directory

One may specify the directory where the output files will be collected. The user is responsible for choosing a directory, which is large enough for the output data.

One may also request the necessary disk space to be borrowed from Data Handling Disk Inventory Manager. In this case output files are essentially temporary and have to be archived by the DH system before becoming readable again.

Output file names

Files that are to be archived in the DH system should have standard names that are composed from dataset name and first run/sections containing in the file. The OutputModule may generate those standard names. Alternatively, if data will not go to the DH system, file names may be selected quite arbitrarily.

Grouping data into files

The DH system applies some requirements to the size of output files. The optimal size of the files is specified to be about 1 Gbyte. The OutputModule closes and releases the current output file when it exceeds this size. The OutputModule takes care to keep entire sections in a single file, so the current file is closed and new one is opened as soon as file size exceeds threshold and all events of the last section are processed. If a custom output file name is specified, subsequent files differ by an extra suffix specifying a sequential number for the file.

Release output files to DIM

Output files may be released to the DIM for subsequent archiving. If output files are created in a user-controlled area, they stay on disk for the possible immediate use. If output files are created in the DIM controlled area, they are organized into filesets by the DH system as part of the archiving.

Insert FILE record into DFC

When a file is released to the DIM for archiving, the corresponding FILE record should be created in the DFC. This record contains the name of the file and some file statistics like first and last events of the file, runs and sections contributing to this file, number of events etc. The OutputModule collects all the necessary information and creates such a record before the file is released to the DIM for archiving.  

Specification of the output stream

To write data, an output stream should be declared, specified and connected to a particular data path. The commands to do this are defined in the DHGeneralOutputModule. So, first one should switch to communication with DHGeneralOutputModule:

AC++> talk DHOutput

Declaration

DHOutput> output create <stream name> <destination>

where

<stream name> is any name to be used in following to distinguish this particular stream.

<destination> is either output data set ID or full name of the output file. Any <destination> containing / or . is treated as name of the file, otherwise name is treated as a dataset ID of the output dataset, which will be used for automatic file name assignment. If destination is a filename, the output directory  and output file name are extracted from this destination.

Specification

DHOutput> output setstream <stream name> <parameter1> [<parameter2> ...]

where

<stream name> is a name of the stream defined in the stream declaration. Wildcard  * may be used to address all declared streams.

<parameter> is either single keyword to specify an option or an expression keyword=value.

 

Single keyword options are the following:

Keyword

Action

list

print out current setting for the output stream

dfc

use default Data File Catalog to fill file records

nodfc

do not fill file record to any Data File Catalog

anysize

no restrictions on size of output file

compress

Use ROOT compression for output file

savecatalog

write File Content Catalog into the file

 

Keyword=value choices are the following:

Keyword

Value type

Comment

book

Name

specify Data File Catalog book keeping track of the output data

dfc

Name

specify Data File Catalog keeping track of the output data

file  

Name

output is written to specified file or its successors. DFC is not filled

flush

int

period in Kb when output file should be flushed. Flushed events can be still read in, even if writing job has crashed without closing the file

pathname

Name

directory where the output files should be stored

size

int 

desired size of the output file in Kb.

skiptill

Int

ignore data with run/section less than specified. Useful for crash recovery

Note: Some options affect other options. It is difficult to describe all the relations. It is better use list option after the stream definition to check out the actual parameters that will be used for the output stream.

Connection

DHOutput> output paths <stream name> <path1> [<path2> ...]

where

<stream name> is a name of the stream defined in the stream declaration.

<path> is a name of AC++ data processing path to be connected to this output stream 

Example 1

DHOutput> output create mystream mydirectory/myfile

DHOutput> output paths mystream mypath

This is equivalent and backward compatible with the old fashioned FileOutput request. The stream with name mystream is connected to the AC++ data path mypath. No any communication with the DFC or DIM will be done. The output file will be mydirectory/myfile without any restrictions on its size (Note, the slash before myfile is important. Use ./myfile to write into the current directory). The File Content Catalog will not be written to the data file.

Example 2

DHOutput> output create mystream mydset

DHOutput> output setstream mystream dfc=production_file_catalog size=1000000

DHOutput> output setstream mystream pathname=mydirectory list

DHOutput> output paths mystream mypath

The stream with name mystream is connected to the AC++ data path mypath. The Data will be assigned to dataset mydset and corresponding output file names will be generated. Output files will be collected in directory mydirectory, files will be closed when size exceeds 1Gb. A FILE record will be added to the DFC specified by the name production_file_catalog in the iomap.txt file and files will be sent to DIM for archiving. The current choices will be printed out. Note, that files in mydirectory will not be deleted after archiving by the DIM.

Example 3

DHOutput> output create mystream mydirectory/myfile

DHOutput> output setstream mystream nodfc anysize savecatalog list

DHOutput> output paths mystream mypath

The stream with name mystream is connected to the AC++ data path mypath. No any communication with DFC or DIM is done. The output file will be mydirectory/myfile without any restrictions on its size (Note, slash before myfile is important. Use ./myfile to write into current directory). The File Content Catalog will be also written to the data file.

Command line specification of the input and output files

In simple cases input and output files may be specified using command line options of the executable, avoiding specifications in the .tcl files or even avoiding .tcl files themselves. These options are:

            -i            to specify list of input files

            -o            to specify output file

For example for executable myprogram, the command line could look like:

myprogram -i <input file 1> [<input file 2> ...] -o <output file>

Only static user input and output files may be specified via command line options.

Example:

AC++ -i ./file1.dst ./file2 ./file3.dst -o ./file1_3.dst

copies events of files file1.dst, file2.dst and file3.dst to the file file1_3.dst.

File Content Catalog

The File Content Catalog is an object containing information (trigger#, section#, and run#) about events in the main event tree of the ROOT file. It also contains a list of empty runsections for the file. The File Content Catalog is a stand-alone object that itself may be saved into and retrieved from the ROOT file. It may be used for:

The OutputModule creates File Content Catalog in all cases. One can request to save it into the output file to use it in the following read accesses. File Content Catalog is necessary for the Input Module in two cases:

If the Input Module requests the use of the File Content Catalog, and catalog is not available in the file, input file is scanned and new catalog is created. It naturally does not work for unpuffed mode. In that case one should take care to save catalog into file on the preceding stage of data processing.  Note that runsections are almost for sure overlapped in the RAW data files.

Terms and definitions

Data File Catalog (DFC)

Data base keeping track of available datasets, filesets, files and runsections as well as their properties and relations. It is an Oracle database at Fermilab and can have a different implementation at remote sites. More details about CDF DFC may be found elsewhere. Customized tools are available to access DFC using the command line or the API.

Dataset

Set of data of similar properties. The output streams from the Production or Reconstruction Farm are examples of datasets. The dataset can be specified either by its full name or its short 6-character dataset ID. A dataset has to be predefined in the DFC before DHOutput can be used to add files to it.

Data Handling (DH) system

Hardware and  software system providing storage and access to CDF data. Core components of the system are:

·       Data File Catalog keeping track of luminosity, data quality, and location of data.

·       Disk inventory Manager (DIM) and Stager to control the use of DH disk pool for data input/output, and to move data between disk and tape.

·       I/O modules interfacing data processing and DH system

·       Storage hardware including disk and tape pools

Empty section

All runsections contributing to the primary dataset should be taken into account when luminosity for the secondary data set is calculated. An empty runsection is a runsection, contributing to the luminosity calculations but containing no events in the current data sample (all events have been filtered out by the previous selection passes).

Event

Event is uniquely identified by the run number and trigger number of the event within the run. The L2 trigger assigns the trigger number sequentially. Run number and trigger number are kept inside an event. Note that an event may appear in more than one dataset.

File in the DFC

Any file referenced in the Data File Catalog and managed by DH system. DFC file is specified by the name in the DFC. Any DFC file belongs to exactly one fileset that belongs to exactly one dataset.

File name convention

Any file stored in the Data Handling system has a standard name that is uniquely defined by the ID of the dataset containing this file and by the minimal run number and section number of the events contained in the file. If dataset ID is abcdef, minimal event run#/section# are rrrrrr/ssss respectively (hex representation is used), then corresponding file name is afrrrrrr.ssssbcde. If, for example, file belongs to dataset "RAW generic physics trigger" with dataet ID aphysr and contains events starting with run 0x194b3, section 0x25, the corresponding file name is ar0194b3.0025phys. Note that events of many sections of many runs may be contained in the file, but an only minimal run/section combination defines the name of the file.

Fileset

A set of several files of one dataset. The number of files is defined by tape properties. Fileset is the data unit that is moved to tape and staged back by DH system. A fileset belongs to exactly one dataset. Even if only one file of a fileset is requested from DH system, the full fileset containing this file is staged from tape.

Input data

A set of data from the Data File Catalog and/or user data files.

iomap.txt file

A text file containing relationships between the logical name of the database used by the AC++ job and actual database location and account. More details about iomap.txt file may be found elsewhere.

Run

A set of data collected continuously in similar conditions. Run is a set of sections. Shift crew decides start and finish of the run. Run number is kept inside an event.

Section (Runsection)

A data taking period of an order of minute long. Section sequential number is assigned online, starting with 1 for every new run. Information about data properties, like luminosity, data quality etc. is kept in the DFC for each Run/Section pair. All events of one section of one run for one dataset are contained in the same file. Section number is kept inside an event.

User file

Any disk file outside DH system. User file is specified by the full absolute pathname. Any file name containing slash (/) is assumed to be the name of user file.