Resources
Resources represent data in some file-based format. They come in two types-- those owned by specific users (Resource) and those that are user-independent and associated with analysis operations (OperationResource). Examples of the latter include files for reference genomes, aligner indexes, or other analysis-specific files that a user does not need to maintain or directly interact with.
Much of the information regarding Resource instances is provided in the auto-generated docstring below, but here we highlight some key elements of the Resource model. Namely, the kinds of operations users and admins can take to create, delete, or otherwise manipulated Resources via the API.
Resource creation
-
Regular MEV users can only create
Resourceinstances by uploading files, either via a direct method (upload from local machine) or by using one our cloud-based uploaders (e.g. Dropbox). They can't do this via the API. -
Admins can "override" and create
Resourceinstances manually via the API. -
Regardless of who created the
Resource, the validation process is started asynchronously. We cannot assume that the files are properly validated, even if the request was initiated by an admin. -
Upon creation of the
Resource, it is immediately set to "inactive" (is_active = False) while we validate the particular type. -
Resourceinstances have a single owner, which is the owner who uploaded the file, or directly specified by the admin in the API request.OperationResources do not have owners, but instead maintain a foreign-key relationship with their associatedOperation.
Resource "type" and "format"
-
To properly parse and validate a file, a
Resourceis required to have:- a "type" (e.g. an integer matrix) which we call a
resource_type. This describes what the file represents (e.g. BED file, expression matrix, etc.). Upon creation,resource_typeis set toNonewhich indicates that theResourcehas not been validated. - a "format" which tells us how the data is stored (e.g. TSV, CSV, Excel)
- a "type" (e.g. an integer matrix) which we call a
-
We need both the type and format to proceed with validation.
-
The type and format of the
Resourcecan be specified immediately following the file upload or at any other time (i.e. users can change the type if they desire). Each request to change type initiates an asynchronous validation process. Note that we can only validate certain types of files, like expression matrices. Validation of sequence-based files such as FastQ and BAM is not feasible and thus we skip validation. -
If the validation fails, we revert back to the previous successfully validated type and format. If the type was previously
None(as with a new upload), we simply revert back toNoneand inform the user the validation failed. -
Succesfully validated files are changed to a convenient internal representation. For instance, we accept expression matrices in multiple formats (e.g. CSV, TSV, XLSX). However, to avoid each analysis
Operationfrom having to parse many potential file formats, we internally convert it to a consistent format, such as TSV. Thus, all the downstream tools expect that the validated resource passed as an input is saved in a TSV/tab-delimited format.
Resources and metadata
Depending on the type of Resource, we are able to infer and extract metadata from the file based on the format. For example, given a validated Resource that represents an RNA-seq count matrix, we assume that the column headers represent samples (Observations) and the rows represent genes (Features). These metadata allow us to create subsets of the Observations and Features for creating experimental contrasts and other typical analysis tasks. More on Observations and Features is described elsewhere.
Resources and Workspaces
-
Resourceinstances are initially "unattached" meaning they are associated with their owner, but have not been associated with any user workspaces. -
When a user chooses to "add" a
Resourceto aWorkspace, we append theWorkspaceto the set ofWorkspaceinstances associated with thatResource. That is, eachResourcetracks whichWorkspaces it is associated with. This is accomplished via a many-to-many mapping in the database. -
Users can remove a
Resourcefrom aWorkspace, but only if it has NOT been used for any portions of the analysis. We want to retain the completeness of the analysis, so deleting files that are part of the analysis "tree" would create gaps. Note that removing aResourcefrom aWorkspacedoes not delete a file- it only modifies theworkspacesfield on theResourcedatabase instance.
Deletion of Resources
-
Resources can only be deleted from the file manager on the "home" screen (i.e. not in the Workspace view) in the UI. -
If a
Resourceis associated/attached to one or moreWorkspaces, then you cannot delete theResource. -
A
Resourcecan only be deleted if: - It is associated with zero
Workspaces - It is not used in any
OperationTechnically, we only need the first case. If aResourcehas been used in anOperation, we don't allow the user to remove it from theWorkspace. Thus, a file being associated with zeroWorkspaces means that it has not been used in anyOperations
Notes related to backend implementation
-
In general, the
is_active = Falseflag disallows any updating of theResourceattributes via the API. All post/patch/put requests will return a 400 status. This prevents multiple requests from interfering with an ongoing background process, such as validation. -
Users cannot change the
pathmember. The actual storage of the files should not matter to the users so they are unable to change thepathmember.
api.models.abstract_resource.AbstractResource(*args, **kwargs)This is the base class which holds common fields for both the user-owned
Resource model and the user-independent OperationResource model.
api.models.resource.Resource(*args, **kwargs)A Resource is an abstraction of data. It represents some
piece of data we are analyzing or manipulating in the course of
an analysis workflow.
Resources are most often represented by flat files, but their
physical storage is not important. They could be stored locally
or in cloud storage accessible to MEV.
Various "types" of Resources implement specific constraints
on the data that are important for tracking inputs and outputs of
analyses. For example, if an analysis module needs to operate on
a matrix of integers, we can enforce that the only Resources
available as inputs are those identified (and verified) as
IntegerMatrix "types".
Note that we store all types of Resources in the database as
a single table and maintain the notion of "type" by a
string-field identifier. Creating specific database tables for
each type of Resource would be unnecessary. By connecting the
string stored in the database with a concrete implementation class
we can check the type of the Resource.
Resources are not active (is_active flag in the database) until their "type"
has been verified. API users will submit the intended type with
the request and the backend will check that. Violations are
reported and the Resource remains inactive (is_active=False).
api.models.operation_resource.OperationResource(*args, **kwargs)An OperationResource is a specialization of a Resource
which is not owned by anyone specific, but is rather associated
with a single Operation. Used for things like genome indexes, etc.
where the user is not responsible for supplying or maintaining the
resource.
Note that it maintains a reference to the Operation input field
it corresponds to. This front-end components to easily map the OperationResource
to the proper input field for user selection.
The is_active and is_public fields default to True.