Resource metadata
Metadata can be associated with type of DataResource
.
Note that a DataResource
is related, but distinct from a Resource
. The latter is for tracking the various file-based resources in the database; it knows about the file location, size, and the type of the resource (as a string field). Since it represents a database table, it does not do perform validation, etc. on the actual files that have the data. However, the DataResource
is a base class from which the many specialized "types" of resources derive. For instance, an IntegerMatrix
derives from a DataResource
. Thus, instead of being a database record, a DataResource
captures the expected format and behavior of the resource. For instance, the children classes of DataResource
contain validators and parsers.
Associated with each DataResource
is some metadata. The specification may expand to incorporate additional fields, but at minimum, it should contain:
-
An
ObservationSet
. For a FastQ file representing a single sample (most common case), theObservationSet
would have a single item (of typeObservation
) containing information about that particular sample. For a count matrix of size (p, N), theObservationSet
would have N items (again, of typeObservation
) giving information about the samples in the columns. -
A
FeatureSet
. This is a collection of covariates corresponding to a singleObservation
. AFeature
is something that is measured (e.g. read counts for a gene). For a count matrix of size (p, N), theFeatureSet
would have p items (of typeFeature
) and correspond to the p genes measured for a single sample. For a sequence-based file like a FastQ, this would simply be null; perhaps there are alternative intepretations of this concept, but the point is that the field can be null. A table of differentially expressed genes would have aFeatureSet
, but not anObservationSet
; in this case theFeature
s are the genes and we are given information like log-fold change and p-value. -
A parent operation. As an analysis workflow can be represented as a directed, acyclic graph (DAG), we would like to track the flow of data and operations on the data. Tracking the "parent" of a
DataResource
allows us to determine which operation generated the data and hence reconstruct the full DAG. The original input files would have a null parent.