Hierarchical Data Format version 5 (HDF5) is a file format capable of archiving large data sets. Maintained by The HDF Group, this powerful open source package is used by many government and commercial organizations. HDF5 has bindings to various programming languages (e.g., C++ and Java) and packages (e.g., MATLAB and LabVIEW). With HDF’s focus on data storage and CloudTurbine’s focus on data streaming, potential synergy exists between these related but distinct orthogonal technologies.
To explore the relationship between HDF5 and CT, we developed a prototype tool called HDF5toCT which reads data from an HDF5 file and stores it in CloudTurbine (CT) format. As an ancillary application, source code for HDF5toCT resides in its own GitHub repository (not the standard CloudTurbine repository) at https://github.com/jpw-erigo/HDF5toCT.
Usage information is shown below.
java -jar HDF5toCT.jar -h
-af,--attrtofile Write attributes to file (not standard CT output).
-b <basetime> Base time (seconds since epoch) to be added to all timestamps
from the HDF5 file; default = 1.4832468E9
-e <password> Encrypt the CT source using the given password.
-f <autoFlush> Flush interval (sec); amount of data per block; default = 1.0
-g,--gzip GZIP output data; data will also be ZIP'ed if this option is selected.
-h,--help Print this message.
-hrt,--hirestime Use high resolution (microsecond) time for CT data.
-i,--infile <hdf5file> Full path of the input HDF5 file.
-nz,--nozip Turn off ZIP output.
-p,--pack Pack data.
NOTE: Make sure "hdf5_java.dll" is in the same directory as HDF5toCT.jar
Our goal was to support the conversion of a limited subset of HDF files to CT; namely, those with the following structure:
- datasets must reside under the top parent group in the file
- each dataset must be of a “compound” datatype which contains 2 channels named “time” and either “data” or “value”
- the dataspace of each dataset must be a 1-D array of entries (i.e., data must be stored in a 1-D array of compound elements where each element contains 2 channels – “time” and either “data” or “value”)
A Java Native Interface (JNI) is included with the HDF5 distribution. This provides the interface to the underlying HDF5 API and library (i.e., there is no native Java HDF5 API and thus the JNI library must be used). HDF5toCT uses the open source HDF-Java Object package (from HDFView) to access HDF attributes.
As a “filesystem” of sorts, HDF stores data in a “channel (i.e. dataset) then time” hierarchy. That is, the HDF files contain one or more datasets which themselves are containers for time/value data. For space and performance efficiency, CT typically stores data in a “time then channel” hierarchy. That is, channel data is “multiplexed” in time folders. HDF5toCT converts from HDF’s “channel then time” format to CT’s “time then channel” format by reading the content from all HDF datasets into memory and then writing out the data in time-sorted format to CT. Based on the size of the HDF datasets, CloudTurbine read efficiency may still be challenged since data is individually time-stamped (every data point has its own timestamp). For additional efficiency, HDF5toCT supports CT’s packed-block option (use the “-p” command line flag) where timestamps are linearly assigned over a block of data, significantly reducing the required number of timestamp folders. In addition, for space efficiency, HDF5toCT supports CT’s ZIP’ed or GZIP’ed block options: ZIP is turned on by default (it is disabled using the “-nz” command line flag); the “-g” command line flag enables GZIP.
The “-af” or “–attrtofile” option specifies to store HDF5 attributes (metadata) in files alongside (not integrated within) the CloudTurbine file structure. With this option, data viewers will need to know where to access the attribute files since they will not be accessible via the CloudTurbine API. When this flag is not specified, metadata is stored at a single timestamp within the CloudTurbine hierarchy. In either case, attributes are saved in JSON format, with all of the attributes associated with a single HDF5 object stored together in an output file.
HDF5toCT includes a base time option, “-b”, to allow the user to supply a base time (seconds since epoch) which will be added to each timestamp from the HDF5 file. Use this flag to convert relative timestamps to CT format timestamps (seconds or milliseconds since epoch). By default, the base time offset value is 1483246800.0 (corresponds to January 1, 2017 12:00:00 AM GMT-05:00).
Use the “-e” option to encrypt the output CT source (“data at rest” encryption).
The “-f” option specifies how much data (in seconds) is stored in a data block. When data is being ZIP’ed, this flag specifies how much data is stored in each ZIP file and how frequently the ZIP files are flushed to disk. The default value is 1.0 second.
The “-g” or “–gzip” flags specify to save data to GZIP output files; data is also ZIP’ed if this option is selected. Cannot be combined with the “-nz” option.
If the “-h” or “–help” option is included on the command line, then the above usage information is printed and the program immediately exists.
The “-hrt” or “–hirestime” command line option turns on CT high resolution times (16-digit timestamps representing microseconds since epoch) for data sampled faster than 1000Hz.
Specify the full path to the input HDF5 file using either the “-i” or “–infile” command line flag. This is a required option.
By default, HDF5toCT stores output data in ZIP files. Use either the “-nz” or “–nozip” option to turn off this default and store the data in uncompressed files. This will result in the full CloudTurbine hierarchy being directly written to the file system.
Specify the “-p” or “-pack” flag to pack blocks of CloudTurbine data. With this option, each block (i.e., each ZIP file) will contain a single file per channel; all of the data for a channel over the time range of the block will be stored in this single file. With packed blocks, timestamps are interpolated over the data block (i.e., individual point times are not stored), thus packed data can save considerable storage space. In addition to storage space savings, CloudTurbine access to packed channel data is very efficient. Since individual point times are not saved with the data, packing is most optimal for periodically sampled data with regular “delta-T” time steps between samples.