Workspace metadata

About Workspace metadata

As described elsewhere, all analyses occur in the context of a user's Workspace; the Workspace allows users to organize files and analyses logically. To operate on the potentially multiple files contained in a user's Workspace, we are obligated to track metadata that spans across the data resources and is maintained at the level of the user's Workspace. We are typically required to provide this metadata to analyses/Operations, including information about samples (Observations), genes (Features), and possibly other annotation data.

A Workspace can have multiple file/Resource objects associated with it, each of which has its own unique metadata. Therefore, we conceive of "workspace metadata" which is composed from the union of the individual Resource's metadata.

Consider two Resource instances in a Workspace. The first (Resource "A") is data generated by a user and has six samples, which we will denote as S1,...,S6; S1-S3 are wild-type and S4-S6 are mutant. The ObservationSet associated with Resource A could look like:

{
    "elements": [
        {
            "id": "S1",
            "attributes": {
                "genotype": {
                    "attribute_type": "String",
                    "value": "WT"
                }
            }
        },
        ...
        {
            "id": "S6",
            "attributes": {
                "genotype": {
                    "attribute_type": "String",
                    "value": "mutant"
                }
            }
        }
    ]
}

The other (Resource B) is public-domain data and also has six samples, which we will denote as P1,...,P6. The ObservationSet associated with Resource B could look like:

{
    "elements": [
        {
            "id": "P1",
            "attributes": {}
        },
        ...
        {
            "id": "P6",
            "attributes": {}
        }
    ]
}

(Note that for simplicity/brevity these samples don't have any annotations/attributes for this example).

Now, as far as the Workspace is concerned, there are 12 Observation instances by performing a union of the Observations contained in the ObservationSet associated with each Resource.

In the course of performing an analysis, the user might wish to create meaningful "groups" of these samples. Perhaps they merge the two count matrices underlying Resources A and B, and perform a principal component analysis (PCA) on the output. They then note a clustering of the samples which they perceive as meaningful:

(Via the dynamic user interface, we imagine the user selecting the five samples in the grey ellipse-- two of the public "P" samples, P3 and P4 cluster with the WT samples). They can then choose to create a new ObservationSet from those five samples:

{
    "elements": [
        {
            "id": "S4",
            "attributes": {
                "genotype": {
                    "attribute_type": "String",
                    "value": "mutant"
                }
            }
        },
        {
            "id": "S5",
            "attributes": {
                "genotype": {
                    "attribute_type": "String",
                    "value": "mutant"
                }
            }
        },
        {
            "id": "S6",
            "attributes": {
                "genotype": {
                    "attribute_type": "String",
                    "value": "mutant"
                }
            }
        },
        {
            "id": "P3",
            "attributes": {}
        },        
        {
            "id": "P4",
            "attributes": {}
        }
    ]
}

This information regarding user-defined groupings can be cached client-side; the API will not keep the additional information about groupings that the user has defined. However, the ultimate "source" of the metadata is provided by the Workspace, which maintains the ObservationSets, FeatureSets, and possibly other metadata. The creation and visualization of custom ObservationSets is merely a convenience provided by the front-end (if available). After all, in direct requests to the API for analyses that require OperationSets, the requester can create those at will.

Using the metadata for analyses

After the user has created their own ObservationSet instances, they can use them for analyses such as a differential expression analysis. For instance, the inputs to such an analysis would be an expression matrix (perhaps the result of merging the "S" and "P" samples/Observations) and two ObservationSet instances. The payload to start such an analysis (sent to /api/operations/run/) would look something like:

{
    "operation_id": <UUID for Operation>,
    "workspace_id": <UUID for Workspace>,
    "inputs": {
        "count_matrix": <UUID for merged Resource>,
        "groupA": <ObservationSet with S4,S5,S6,P3,P4>,
        "groupB": <ObservationSet with S1,S2,S3,P5,P6>
    }
}

Additional user-supplied metadata

Finally, the users might want to add additional annotations to their metadata. For instance, assuming we still are working with Resource instances A and B, we could upload an additional Resource with type "ANN" (for annotation) and add it to this Workspace. For instance, maybe it looks like:

sample	sex	age
S1	M	43
S2	F	44
S3	F	54
S4	F	33
S5	M	65
S6	F	58

Users can then create custom groups of samples This would then incorporate into the existing Observation instances so they now would look like:

{
    "elements": [
        {
            "id": "S1",
            "attributes": {
                "age": {
                    "attribute_type": "Integer",
                    "value": 43
                    },
                "sex": {
                    "attribute_type": "UnrestrictedString",
                    "value": "M"
                    }
            }
        },
        ...
        {
            "id": "S6",
            "attributes": {
                "age": {
                    "attribute_type": "Integer",
                    "value": 58
                    },
                "sex": {
                    "attribute_type": "UnrestrictedString",
                    "value": "F"
                    }
            }
        }
    ]
}

Backend endpoints

To provide a "single source of truth", there will be a "workspace metadata" endpoint at /api/workspace/<UUID>/metadata/observations/ which will track the union of all the Resource metadata (specifically Observations) in the Workspace. Note that there is currently no endpoint for querying Features since there are often too many to make this feasible.

The front-end will maintain the various user selections (formerly "sample sets", now ObservationSet) but the full set of available Observation instances will be kept on the backend.

Using the example above, a request to /api/workspace/<UUID>/metadata/observations/ would return:

{
    "elements": [
        {
            "id": "S1",
            "attributes": {
                "age": {
                    "attribute_type": "Integer",
                    "value": 43
                    },
                "sex": {
                    "attribute_type": "UnrestrictedString",
                    "value": "M"
                    }
            }
        },
        ...
        {
            "id": "S6",
            "attributes": {
                "age": {
                    "attribute_type": "Integer",
                    "value": 58
                    },
                "sex": {
                    "attribute_type": "UnrestrictedString",
                    "value": "F"
                    }
            }
        },
        {
            "id": "P1",
            "attributes": {}
        }, 
        ...
        {
            "id": "P6",
            "attributes": {}
        }
    ]
}