About Workspace metadata
As described elsewhere, all analyses occur in the context of a user's Workspace
; the Workspace
allows users to organize files and analyses logically. To operate on the potentially multiple files contained in a user's Workspace
, we are obligated to track metadata that spans across the data resources and is maintained at the level of the user's Workspace
. We are typically required to provide this metadata to analyses/Operation
s, including information about samples (Observation
s), genes (Feature
s), and possibly other annotation data.
A Workspace
can have multiple file/Resource
objects associated with it, each of which has its own unique metadata. Therefore, we conceive of "workspace metadata" which is composed from the union of the individual Resource
's metadata.
Consider two Resource
instances in a Workspace
. The first (Resource
"A") is data generated by a user and has six samples, which we will denote as S1,...,S6; S1-S3 are wild-type and S4-S6 are mutant. The ObservationSet
associated with Resource
A could look like:
{
"elements": [
{
"id": "S1",
"attributes": {
"genotype": {
"attribute_type": "String",
"value": "WT"
}
}
},
...
{
"id": "S6",
"attributes": {
"genotype": {
"attribute_type": "String",
"value": "mutant"
}
}
}
]
}
The other (Resource
B) is public-domain data and also has six samples, which we will denote as P1,...,P6. The ObservationSet
associated with Resource
B could look like:
{
"elements": [
{
"id": "P1",
"attributes": {}
},
...
{
"id": "P6",
"attributes": {}
}
]
}
(Note that for simplicity/brevity these samples don't have any annotations/attributes for this example).
Now, as far as the Workspace
is concerned, there are 12 Observation
instances by performing a union of the Observation
s contained in the ObservationSet
associated with each Resource
.
In the course of performing an analysis, the user might wish to create meaningful "groups" of these samples. Perhaps they merge the two count matrices underlying Resource
s A and B, and perform a principal component analysis (PCA) on the output. They then note a clustering of the samples which they perceive as meaningful:
(Via the dynamic user interface, we imagine the user selecting the five samples in the grey ellipse-- two of the public "P" samples, P3 and P4 cluster with the WT samples). They can then choose to create a new ObservationSet
from those five samples:
{
"elements": [
{
"id": "S4",
"attributes": {
"genotype": {
"attribute_type": "String",
"value": "mutant"
}
}
},
{
"id": "S5",
"attributes": {
"genotype": {
"attribute_type": "String",
"value": "mutant"
}
}
},
{
"id": "S6",
"attributes": {
"genotype": {
"attribute_type": "String",
"value": "mutant"
}
}
},
{
"id": "P3",
"attributes": {}
},
{
"id": "P4",
"attributes": {}
}
]
}
This information regarding user-defined groupings can be cached client-side; the API will not keep the additional information about groupings that the user has defined. However, the ultimate "source" of the metadata is provided by the Workspace
, which maintains the ObservationSet
s, FeatureSet
s, and possibly other metadata. The creation and visualization of custom ObservationSet
s is merely a convenience provided by the front-end (if available). After all, in direct requests to the API for analyses that require OperationSet
s, the requester can create those at will.
Using the metadata for analyses
After the user has created their own ObservationSet
instances, they can use them for analyses such as a differential expression analysis. For instance, the inputs to such an analysis would be an expression matrix (perhaps the result of merging the "S" and "P" samples/Observation
s) and two ObservationSet
instances. The payload to start such an analysis (sent to /api/operations/run/
) would look something like:
{
"operation_id": <UUID for Operation>,
"workspace_id": <UUID for Workspace>,
"inputs": {
"count_matrix": <UUID for merged Resource>,
"groupA": <ObservationSet with S4,S5,S6,P3,P4>,
"groupB": <ObservationSet with S1,S2,S3,P5,P6>
}
}
Additional user-supplied metadata
Finally, the users might want to add additional annotations to their metadata. For instance, assuming we still are working with Resource
instances A and B, we could upload an additional Resource
with type "ANN" (for annotation) and add it to this Workspace
. For instance, maybe it looks like:
sample | sex | age |
---|---|---|
S1 | M | 43 |
S2 | F | 44 |
S3 | F | 54 |
S4 | F | 33 |
S5 | M | 65 |
S6 | F | 58 |
Users can then create custom groups of samples
This would then incorporate into the existing Observation
instances so they now would look like:
{
"elements": [
{
"id": "S1",
"attributes": {
"age": {
"attribute_type": "Integer",
"value": 43
},
"sex": {
"attribute_type": "UnrestrictedString",
"value": "M"
}
}
},
...
{
"id": "S6",
"attributes": {
"age": {
"attribute_type": "Integer",
"value": 58
},
"sex": {
"attribute_type": "UnrestrictedString",
"value": "F"
}
}
}
]
}
Backend endpoints
To provide a "single source of truth", there will be a "workspace metadata" endpoint at /api/workspace/<UUID>/metadata/observations/
which will track the union of all the Resource
metadata (specifically Observation
s) in the Workspace
. Note that there is currently no endpoint for querying Feature
s since there are often too many to make this feasible.
The front-end will maintain the various user selections (formerly "sample sets", now ObservationSet
) but the full set of available Observation
instances will be kept on the backend.
Using the example above, a request to /api/workspace/<UUID>/metadata/observations/
would return:
{
"elements": [
{
"id": "S1",
"attributes": {
"age": {
"attribute_type": "Integer",
"value": 43
},
"sex": {
"attribute_type": "UnrestrictedString",
"value": "M"
}
}
},
...
{
"id": "S6",
"attributes": {
"age": {
"attribute_type": "Integer",
"value": 58
},
"sex": {
"attribute_type": "UnrestrictedString",
"value": "F"
}
}
},
{
"id": "P1",
"attributes": {}
},
...
{
"id": "P6",
"attributes": {}
}
]
}