Trellis v1.3 Update Design Doc

Functions

Using a single function to launch jobs

Current method for adding new bioinformatics (or other) tasks to Trellis is to create a new Cloud Function specifically tailored to launch jobs of that type of task (e.g. “samtools flagstat”). Limitations of generating separate functions for each task include:

  • Copying of a lot of boilerplate code across functions

  • Potential to differences in boilerplate code across functions

  • Changing mechanisms for launching jobs requires changing every launcher function

  • Creating a new launcher function is kind of an obtuse process, requiring knowledge of Python and the ‘trellisdata’ package. If you didn’t have an example to look at it, it would be a huge pain.

Pre Trellis v1.3 architecture

graph TD gcs[CloudStorage] -- TRIGGERS --> create-blob-node create-blob-node -- QUERY-REQUEST --> db-query db-query -- QUERY-RESULT --> check-triggers check-triggers -- QUERY-REQUEST --> db-query db-query -- QUERY-RESULT --> job-launcher job-launcher -- JOB-RESULT --> create-job-node create-job-node -- QUERY-REQUEST --> db-query

What are the elements of a standard launcher function?

Inputs. Inputs to all jobs are nodes or relationship structures: (node)-[relationship]->(node). I already define these patterns in a standard way in the database trigger YAML, so I can replicate that in a launcher configuration file. And, I think YAML configuration files are going to be the default method for configuring Trellis.

  • Database trigger configuration: database-triggers.yaml

  • Database query configuration: database-queries.yaml

  • Job launcher configuration: job-launchers.yaml

Also, in regards to terminology I’m thinking of a job as an instance of a task.

Trellis configuration. Information that is stored in the Trellis configuration object and is uniform across Trellis (e.g. project, regions, buckets).

Dsub configuration. This might be the most challenging part; defining the input, output, and environment variables.

Virtual machine configuration. Information regarding CPUs, memory, and disk size can vary between tasks and should be specificied in the configuration of each task.

Jobs are going to have different numbers of nodes as inputs.

  • fastq-to-ubam: 2 nodes connected by relationship

  • gatk-5-dollar: 2-16 nodes, not connected

  • extract-mapped-reads: 1 node

How do I know which job to launch?

When every job had its own launcher function, the job was determined by the pubsub topic that the database query result was published to. The topic was defined as part of the database query. How will I choose the job if all query results are routed throug the same function?

I could update the QueryResponse classes to also include a field with the task to be launched.

Trellisdata Python Package

Operation Grapher

  • Added static methods _is_query_valid(query) and _is_trigger_valid(trigger) to the OperationGrapher class. These methods run on initialization of a new OperationGrapher instance and check for the presence and type of required and optional fields. Currently, there is no validation on the content.

Database

Relate Fastq read/mate pairs

The old query for getting Fastq mate pairs (paired-end sequencing) required matching Fastqs by sample and read group (composite index) and then collecting into different groups based on the “matePair” node property. This resulted in queries that were unwieldy and hard to decipher (decypher).

Database schema

Diagrams generated using Sphinxcontrib-mermaid extension.

Old model

graph TD seq[PersonalisSequencing] -- GENERATED --> r1[Fastq R1] seq --GENERATED --> r2[Fastq R2]

New model

graph TD seq[PersonalisSequencing] -- GENERATED --> r1[Fastq R1] seq -- GENERATED --> r2[Fastq R2] r1 -- HAS_MATE_PAIR --> r2

Cypher query

Old model (from database-triggers.py)

MATCH (n:Fastq {sample: $sample, readGroup: $read_group})
WHERE NOT (n)-[:WAS_USED_BY]->(:JobRequest:FastqToUbam)
WITH n.sample AS sample,
     n.matePair AS matePair,
     COLLECT(n) AS matePairNodes
# In the case of duplicate Fastqs, only use one from each sequencing mate pair group.
WITH sample, COLLECT(head(matePairNodes)) AS uniqueMatePairs
# Check that there are a pair of Fastqs in the read group
WHERE size(uniqueMatePairs) = 2 
CREATE (j:JobRequest:FastqToUbam {
          sample:sample,
          nodeCreated: datetime(),
          nodeCreatedEpoch: datetime().epochSeconds,
          name: "fastq-to-ubam",
          eventId: $event_id })
WITH uniqueMatePairs, j, sample, j.eventId AS eventId
UNWIND uniqueMatePairs AS uniqueMatePair
MERGE (uniqueMatePair)-[:WAS_USED_BY]->(j)
RETURN DISTINCT(uniqueMatePairs) AS nodes

New model

MATCH (:PersonalisSequencing {sample: $sample})-[:GENERATED]->(r1:Fastq $read_group)-[rel:HAS_MATE_PAIR]->(r2:Fastq)
WHERE NOT (r1)-[:WAS_USED_BY]->(:JobRequest {name: "fastq-to-ubam"})
AND NOT (r2)-[:WAS_USED_BY]->(:JobRequest {name: "fastq-to-ubam"})
CREATE (job_request:JobRequest {
          name: 'fastq-to-ubam',
          sample: $sample,
          nodeCreated: datetime(),
          nodeCreatedEpoch: datetime().epochSeconds,
          eventId: $event_id)
WITH r1, rel, r2, job_request 
LIMIT 1
MERGE (r1)-[:WAS_USED_BY]->(job_request)
MERGE (r2)-[:WAS_USED_BY]->(job_request)
RETURN r1, rel, r2

Incorporating OMOP graph model

One of the challenges we face when expanding the scope of our use cases is how to incorporate phenotype data into the database. The Observational Medical Outcomes Parternships (OMOP) Common Data Model (CDM) is used by the Stanford Research Data Repository (STARR) and seemed like a good place to start. Unfortunately, OMOP is designed for relational databases. However, a group at Northwestern University has drafted a graph model implementation of the OMOP CDM.

Ours is a research use case, as opposed to clinical, and we don’t have much of the data that would go into a clinical database, so it’s not clear how useful this model will be for us. But, I also don’t know of a better alternative, so we are going to start from this model and see how it goes.

Right now we are only using a small set of the nodes and relationships described by the graph OMOP CDM.

  • Node labels (5): Concept, ConceptClass, Domain, Vocabulary, ConditionOccurrence

  • Relationship types (5): HAS_CONCEPT, BELONGS_TO_CLASS, USES_VOCABULARY, IN_DOMAIN, HAS_CONDITION_OCCURRENCE

Subset of OMOP graph model added to Trellis

graph TD person[Person] -- HAS_CONDITION_OCCURRENCE --> occurrence[Condition Occurrence] occurrence -- HAS_CONCEPT --> concept[Concept] concept -- IN_DOMAIN --> domain[Domain] domain -- HAS_CONCEPT --> concept concept -- BELONGS_TO_CLASS --> conceptclass[Concept Class] conceptclass -- HAS_CONCEPT --> concept concept -- USES_VOCABULARY --> vocab[Vocabulary] vocab -- HAS_CONCEPT --> concept

I sometimes find it difficult to interpret OMOP because the model is so abstract. But that’s the point, it’s general enough to fit all kinds of medical information. Right now, we are only using this model subset to describe phenotypes associated with Abdominal Aortic Aneurysm (AAA), so let’s look what the AAA instances of these nodes look like.

Example of AAA diagnosis represented with OMOP

graph TD person[Person] -- HAS_CONDITION_OCCURRENCE --> occurrence[Condition Occurrence] occurrence -- HAS_CONCEPT --> concept[Abdominal Aortic Aneurysm] concept -- IN_DOMAIN --> domain[Condition] domain -- HAS_CONCEPT --> concept concept -- BELONGS_TO_CLASS --> conceptclass[Clinical Finding] conceptclass -- HAS_CONCEPT --> concept concept -- USES_VOCABULARY --> vocab[SNOMED] vocab -- HAS_CONCEPT --> concept

The script I used to update the database with AAA data can be found in the trellis-mvp-data-modelling repository.

=======

Node label ontology

graph TD root --> Blob Blob --> Fastq Blob --> Index root --> Person root --> GcpInstance GcpInstance --> CromwellAttemtp GcpInstance --> DsubJob

Deployment configuration

Previously, the Terraform files for configuring Trellis were stored in a separate repository. I don’t think that’s a good setup. The deployment configuration and application version are intricately linked and should be tracked together. Todo: Add the Terraform resources to the Trellis functions repository.