Resource types

A Resource represents some generic notion of data and its resource_type field/member is a string identifier that identifies the specific format of the data. Resource types allow us to specify the format of input/output files of Operations. Therefore, we can predictably present options for those inputs and allow Resources to flow from one analysis to another.

The string identifiers map to concrete classes that implement validation methods for the Resource. For example, the string I_MTX indicates that the Resource is an integer matrix. When a new Resource is added (via upload or directly by an admin via the API), the validation method is called. Similarly, if a user tries to change the resource_type, it will trigger the validation process.

Current resource_types fall into several broad categories:

Table-based formats
Sequence-based formats
JSON
General. Not a true type, but rather denotes that a better, more specific type cannot be specified.

Table-based formats

Table-based formats are any matrix-like format, such as a typical CSV file. In addition to being a common file format for expression matrices and similar experimental data, this covers a wide variety of standard formats encountered in computational biology, including GTF annotation files and BED files.

Specific types are shown below in the auto-generated documentation, but we touch on some of the more general descriptions immediately below.

During validation, the primitive data types contained in each column are determined using Python's Pandas library, which refers to these as "dtypes"; for example, a column identified as int64 certainly qualifies as an integer type. If the column contains any non-integers (but all numbers), Pandas automatically converts it to a float type (e.g. float64) which allows us to easily validate the content of each column.

We enforce that specific sub-types of this general table-based format adhere to our expectations. For instance, an expression matrix requires a first row which contains samples/observation names. Furthermore, the first column should correspond to gene identifiers (Features more generally). While we cannot exhaustively validate every file, we make certain reasonable assumptions. For example, if the first row is all numbers, we assume that a header is missing. Certainly one could name their samples with numeric identifiers, but we enforce that they need to be strings. Failure to conform to these expectations will result in the file failing to validate. Users should be informed of the failure with a helpful message for resolution.

Also note that while the user may submit files in a format such as CSV, we internally convert to a common format (e.g. TSV) so that downstream tools can avoid having to include multiple file-parsing schemes.

Since table-based formats naturally lend themselves to arrays of atomic items (i.e. each row as a "record"), the contents of table-based formats can be requested in a paginated manner via the API.

Sequence-based formats

Sequence-based formats are formats like FastQ, Fasta, or SAM/BAM. These types of files cannot reasonably be validated up front, so any Operations which use these should plan on gracefully handling problems with their format.

JSON

For data that is not easily represented in a table-based format, we retain JSON as a general format. We use Python's internal json library to enforce the format of these files. Any failure to parse the file results in a validation error.

Note that the contents of array-based JSON files can be paginated, but general JSON objected-based resources cannot. An example of the former is:

[
    {
        "keyA": 1,
        "some_value": "abc"
    },
    ...
    {
        "keyA": 8,
        "some_value": "xyz"
    }
]

These can be paginated so that each internal "object" (e.g. {"keyA": 1,"some_value":"abc"}) is a record.

General

Generally this format should be avoided as it allows un-validated/unrestricted data formats to be passed around. However, for certain types (such as a tarball of many files), we sometimes have no other reasonable option.

Table-based resource types

class resource_types.table_types.TableResource()

The TableResource is the most generic form of a delimited file. Any type of data that can be represented as rows and columns.

This or any of the more specific subclasses can be contained in files saved in CSV, TSV, or Excel (xls/xlsx) format. If in Excel format, the data of interest must reside in the first sheet of the workbook.

Note that unless you create a "specialized" implementation (e.g. like for a BED file), then we assume you have features as rows and observables as columns.

class resource_types.table_types.Matrix()

A Matrix is a delimited table-based file that has only numeric types. These types can be mixed, like floats and integers

class resource_types.table_types.IntegerMatrix()

An IntegerMatrix further specializes the Matrix to admit only integers.

class resource_types.table_types.RnaSeqCountMatrix()

A very-explicit class (mainly for making things user-friendly) where we provide specialized behavior/messages specific to count matrices generated from RNA-seq data. The same as an integer matrix, but named to be suggestive for WebMEV users.

class resource_types.table_types.ElementTable()

An ElementTable captures common behavior of tables which annotate Observations (AnnotationTable) or Features (FeatureTable)

It's effectively an abstract class. See the derived classes which implement the specific behavior for Observations or Features.

class resource_types.table_types.AnnotationTable()

An AnnotationTable is a special type of table that will be responsible for annotating Observations/samples (e.g. adding sample names and associated attributes like experimental group or other covariates)

The first column will give the sample names and the remaining columns will each individually represent different covariates associated with that sample.

For example, if we received the following table:

sample	genotype	treatment
A	WT	Y
B	WT	N

Then this table can be used to add Attributes to the corresponding Observations. Note that the backend doesn't manage this. Instead, the front-end will be responsible for taking the AnnotationTable and creating/modifying Observations.

class resource_types.table_types.FeatureTable()

A FeatureTable is a type of table that has aggregate information about the features, but does not have any "observations" in the columns. An example would be the results of a differential expression analysis. Each row corresponds to a gene (feature) and the columns are information about that gene (such as p-value).

Another example could be a table of metadata about genes (e.g. pathways or perhaps a mapping to a different gene identifier).

The first column will give the feature/gene identifiers and the remaining columns will have information about that gene

class resource_types.table_types.BEDFile()

A file format that corresponds to the BED format. This is the minimal BED format, which has:

chromosome
start position
end position

Additional columns are ignored.

By default, BED files do NOT contain headers and we enforce that here.

Sequence-based formats

class resource_types.sequence_types.SequenceResource()

This class is used to represent sequence-based files such as Fasta, Fastq, SAM/BAM

We cannot (reasonably) locally validate the contents of these files quickly or exhaustively, so minimal validation is performed remotely

class resource_types.sequence_types.FastAResource()

This type is for compressed Fasta files

class resource_types.sequence_types.FastQResource()

This resource type is for gzip-compressed Fastq files

class resource_types.sequence_types.AlignedSequenceResource()

This resource type is for SAM/BAM files.