Resources
Resource
s represent data in some file-based format. They come in two types-- those owned by specific users (Resource
) and those that are user-independent and associated with analysis operations (OperationResource
). Examples of the latter include files for reference genomes, aligner indexes, or other analysis-specific files that a user does not need to maintain or directly interact with.
Much of the information regarding Resource
instances is provided in the auto-generated docstring below, but here we highlight some key elements of the Resource
model. Namely, the kinds of operations users and admins can take to create, delete, or otherwise manipulated Resource
s via the API.
Resource creation
-
Regular MEV users can only create
Resource
instances by uploading files, either via a direct method (upload from local machine) or by using one our cloud-based uploaders (e.g. Dropbox). They can't do this via the API. -
Admins can "override" and create
Resource
instances manually via the API. -
Regardless of who created the
Resource
, the validation process is started asynchronously. We cannot assume that the files are properly validated, even if the request was initiated by an admin. -
Upon creation of the
Resource
, it is immediately set to "inactive" (is_active = False
) while we validate the particular type. -
Resource
instances have a single owner, which is the owner who uploaded the file, or directly specified by the admin in the API request.OperationResource
s do not have owners, but instead maintain a foreign-key relationship with their associatedOperation
.
Resource "type" and "format"
-
To properly parse and validate a file, a
Resource
is required to have:- a "type" (e.g. an integer matrix) which we call a
resource_type
. This describes what the file represents (e.g. BED file, expression matrix, etc.). Upon creation,resource_type
is set toNone
which indicates that theResource
has not been validated. - a "format" which tells us how the data is stored (e.g. TSV, CSV, Excel)
- a "type" (e.g. an integer matrix) which we call a
-
We need both the type and format to proceed with validation.
-
The type and format of the
Resource
can be specified immediately following the file upload or at any other time (i.e. users can change the type if they desire). Each request to change type initiates an asynchronous validation process. Note that we can only validate certain types of files, like expression matrices. Validation of sequence-based files such as FastQ and BAM is not feasible and thus we skip validation. -
If the validation fails, we revert back to the previous successfully validated type and format. If the type was previously
None
(as with a new upload), we simply revert back toNone
and inform the user the validation failed. -
Succesfully validated files are changed to a convenient internal representation. For instance, we accept expression matrices in multiple formats (e.g. CSV, TSV, XLSX). However, to avoid each analysis
Operation
from having to parse many potential file formats, we internally convert it to a consistent format, such as TSV. Thus, all the downstream tools expect that the validated resource passed as an input is saved in a TSV/tab-delimited format.
Resources and metadata
Depending on the type of Resource
, we are able to infer and extract metadata from the file based on the format. For example, given a validated Resource
that represents an RNA-seq count matrix, we assume that the column headers represent samples (Observation
s) and the rows represent genes (Feature
s). These metadata allow us to create subsets of the Observation
s and Feature
s for creating experimental contrasts and other typical analysis tasks. More on Observation
s and Feature
s is described elsewhere.
Resources and Workspaces
-
Resource
instances are initially "unattached" meaning they are associated with their owner, but have not been associated with any user workspaces. -
When a user chooses to "add" a
Resource
to aWorkspace
, we append theWorkspace
to the set ofWorkspace
instances associated with thatResource
. That is, eachResource
tracks whichWorkspace
s it is associated with. This is accomplished via a many-to-many mapping in the database. -
Users can remove a
Resource
from aWorkspace
, but only if it has NOT been used for any portions of the analysis. We want to retain the completeness of the analysis, so deleting files that are part of the analysis "tree" would create gaps. Note that removing aResource
from aWorkspace
does not delete a file- it only modifies theworkspaces
field on theResource
database instance.
Deletion of Resources
-
Resource
s can only be deleted from the file manager on the "home" screen (i.e. not in the Workspace view) in the UI. -
If a
Resource
is associated/attached to one or moreWorkspace
s, then you cannot delete theResource
. -
A
Resource
can only be deleted if: - It is associated with zero
Workspace
s - It is not used in any
Operation
Technically, we only need the first case. If aResource
has been used in anOperation
, we don't allow the user to remove it from theWorkspace
. Thus, a file being associated with zeroWorkspace
s means that it has not been used in anyOperation
s
Notes related to backend implementation
-
In general, the
is_active = False
flag disallows any updating of theResource
attributes via the API. All post/patch/put requests will return a 400 status. This prevents multiple requests from interfering with an ongoing background process, such as validation. -
Users cannot change the
path
member. The actual storage of the files should not matter to the users so they are unable to change thepath
member.
api.models.abstract_resource.AbstractResource
(*args, **kwargs)This is the base class which holds common fields for both the user-owned
Resource
model and the user-independent OperationResource
model.
api.models.resource.Resource
(*args, **kwargs)A Resource
is an abstraction of data. It represents some
piece of data we are analyzing or manipulating in the course of
an analysis workflow.
Resource
s are most often represented by flat files, but their
physical storage is not important. They could be stored locally
or in cloud storage accessible to MEV.
Various "types" of Resource
s implement specific constraints
on the data that are important for tracking inputs and outputs of
analyses. For example, if an analysis module needs to operate on
a matrix of integers, we can enforce that the only Resource
s
available as inputs are those identified (and verified) as
IntegerMatrix
"types".
Note that we store all types of Resource
s in the database as
a single table and maintain the notion of "type" by a
string-field identifier. Creating specific database tables for
each type of Resource
would be unnecessary. By connecting the
string stored in the database with a concrete implementation class
we can check the type of the Resource
.
Resource
s are not active (is_active
flag in the database) until their "type"
has been verified. API users will submit the intended type with
the request and the backend will check that. Violations are
reported and the Resource
remains inactive (is_active=False
).
api.models.operation_resource.OperationResource
(*args, **kwargs)An OperationResource
is a specialization of a Resource
which is not owned by anyone specific, but is rather associated
with a single Operation
. Used for things like genome indexes, etc.
where the user is not responsible for supplying or maintaining the
resource.
Note that it maintains a reference to the Operation
input field
it corresponds to. This front-end components to easily map the OperationResource
to the proper input field for user selection.
The is_active
and is_public
fields default to True
.