Elements, Observations, and Features

We adopt the convention from statistical learning of referring to Observations and Features of data. Both of these data structures derive from the BaseElement class, which captures their common structure and behavior. Specialization for each can be overridden in the child classes.

In the context of a biological experimental, Observations are synonymous with samples. Further, each Observation can have Features associated with it (e.g. gene expressions for 30,000 genes). One can think of Observations and Features as comprising the columns and rows of a two-dimensional matrix. Note that in our convention, due to the typical format of expression matrices, we take each column to represent an Observation and each row to represent a Feature.

We use Observations and Features to hold metadata (as key-value pairs) about data that we manipulating in WebMEV. For instance, given a typical gene expression matrix we have information about only the names of the Observations/samples and Features/genes. We can then specify attributes to annotate the Observations and Features, allowing users to define experimental groups, or specify other information useful for visualization or filtering.

These data structures have similar (if not exactly the same) behavior but we separate them for future compatability in case specialization of each class is needed.

class data_structures.element.BaseElement(val, **kwargs)

A BaseElement is a base class from which we can derive both Observation and Features. For the purposes of clarity and potential customization, we keep those entities separate.

As a type of attribute, an Element (using an Observation below) would look like:

{
    "id": <string identifier>,
    "attributes": {
        "keyA": <Attribute>,
        "keyB": <Attribute>
    }
}

We require that all Element instances be created with an identifier. Equality (e.g. in set operations) is checked using this identifier member

The nested attributes are objects that dictate a simple attribute For instance:

{
    "id": <string identifier>,
    "attributes": {
        "stage": {
            "attribute_type": "String",
            "value": "IV"
        },
        "age": {
            "attribute_type": "PositiveInteger",
            "value": 5
        }
    }
}

The nested dict attributes CAN be empty.

In situations like annotation tables where certain rows may not have values (but others do), we want to be able to permit null attributes if the constructor is explicitly passed the permit_null_attributes kwarg

class data_structures.observation.Observation(val, **kwargs)

An Observation is the generalization of a "sample" in the typical context of biological studies. One may think of samples and observations as interchangeable concepts. We call it an observation so that we are not limited by this convention, however.

Observation instances act as metadata and can be used to filter and subset the data to which it is associated/attached.

An Observation is structured as:

{
    "id": <string identifier>,
    "attributes": {
        "keyA": <Attribute>,
        "keyB": <Attribute>
    }
}
class data_structures.feature.Feature(val, **kwargs)

A Feature can also be referred to as a covariate or variable. These are measurements one can make about an Observation. For example, in the genomics context, a sample can have 30,000+ genes which we call "features" here. In the statistical learning context, these are feature vectors.

Feature instances act as metadata and can be used to filter and subset the data to which it is associated/attached. For example, we can imagine filtering by genes/features which have a particular value, such as those genes where the attribute "oncogene" is set to "true"

A Feature is structured as:

{
    "id": <string identifier>,
    "attributes": {
        "keyA": <Attribute>,
        "keyB": <Attribute>
    }
}