Data Storage¶
Processed data files can be stored on-disk using the GcmsStore
object
located in the gcmstools.datastore
module. Not only does this create a
convenient storage solution for processed data sets, it is also necessary when
running calibrations on a group of related data sets. The file is a HDF
file, which is an open-source high performance data storage container
optimized for numerical data. Creation and manipulation of this file is
controlled using a combination of two Python libraries: PyTables and
Pandas. PyTables provides a high-level interface to create and modify HDF
files, and Pandas is a very powerful package for working with tabular data.
Both of these project have extensive documentation of their many advanced
features, so little detail on their usage is provided here.
About GcmsStore Implementation¶
The GcmsStore
object is a subclass of the Pandas HDFStore
object, so
it will contain all of the functions described for that object. The
GcmsStore
class simply adds a number of custom functions specific for GCMS
data sets.
Create/Open the Container¶
A gcmstools GcmsStore
object must be created with a file name argument.
If a file with this name already exists, it will be opened for appending or
modification. The default behavior is to compress all the data going into this
file using the ‘blosc’ compression library and the highest compression level
(9). See the Pandas HDFStore
documentation for other accepted keyword
arguments, especially the compression arguments if different values are
required.
In : from gcmstools.datastore import GcmsStore
In : h5 = GcmsStore('data.h5')
Closing the File¶
In general, you will want to close the HDF file when you’re done, although this is not strictly necessary.
In : h5.close() # Only do this when you're done
Recompressing the HDF File¶
HDF files are designed to be written once and read many times. If you are
repeatedly adding new files to the HDF storage container, the file size may
become much larger than seems necessary. You can recompress the file using the
compress
method (which first closes the HDF file).
In : h5.compress() # This closes the file as well.
Adding Data¶
Added files to this storage container is done using the append_files
method, which can take either a single data object or a list of objects, if
you have many objects to add at one time.
In : h5.append_files(data)
HDF Appending: datasample1.CDF
In : h5.append_files([otherdata1, otherdata2])
HDF Appending: otherdata1.CDF
HDF Appending: otherdata2.CDF
Data files can be added at any stage of the processing chain; however, the calibration process will not work properly if you don’t reference/fit the data first. You can add an already existing data file as well. The GcmsStore object will check if that file is different than the saved version before overwriting the existing object. If it is not changed, then the file will be skipped.
Viewing the File List¶
You can see a list of the files that are stored in this file by viewing the
files
attribute, which is a Pandas DataFrame.
In : h5.files
name filename
0 datasample1 datasample1.CDF
1 otherdata1 otherdata1.CDF
2 otherdata2 otherdata2.CDF
There are two name columns in this table: “name” and “filename”. The latter is the full file and path name as given when the GcmsFile object was created. Keep in mind that the path information may not be correct if you’ve moved the location of this storage file. In order to efficiently store the data on disk, the full file name is internally simplified the “name”. This simplification removes the path and file extension from the file name. In addition, it replaces all “.”, “-”, and spaces characters with “_”. If the file name starts with a number, the prefix “num” is added.
Warning
You will encounter problems if two or more file names simplify to the same “name”. However, if you’re file naming system does not produce unique file names for different data sets, you will most certainly have more problems in the long run than just using these programs.
Extracting Stored Data¶
You can extract data from the storage file using the extract_gcms
method.
This function takes one argument which is the name of the dataset that you
want to extract. This name can be either the simplified name or the full
filename (with or without the path). The extracted data is the same file
object type as you stored originally.
In : extracted = h5.extract_gcms('datasample1')
In : extracted.filetype
Out: "AiaFile"
Stored Data Tables¶
This HDF data file may contain a number of Pandas data tables (DataFrames)
with information about the files, calibration, etc. A list of currently
available tables can be obtained by directly examining the GcmsStore
by
directly examining the GcmsStore
instance. (Note: you won’t see these
attributes using tab completion.)
In : h5
Out:
<class 'pandas.io.pytables.HDFStore'>
File path: data.h5
/calibration frame (shape->[6,8])
/calinput frame (shape->[30,9])
/datacal frame (shape->[49,6])
/files frame (shape->[1,2])
Directly viewing these tables is trivial.
In : h5.calibration
Out:
Start Stop Standard slope intercept r \
Compound
benzene 2.9 3.5 NaN 38629.931565 -367129.586850 0.998767
phenol 14.6 15.1 NaN 30248.192619 65329.897933 0.999136
...
p stderr
Compound
benzene 0.000052 1108.344872
phenol 0.000030 726.257380
...
More information on using these tables is provided in Appendix B.