Implementing a real-time HIV molecular epidemiology platform

Simon Frost, Mukarram Hossain, Art Poon, Sergei Kosakovsky Pond, Steven Weaver, Erik Volz, Trevor Bedford, Richard Neher, Sten Vermund, Timothy Sterling, Ann Dennis, Marcia Kalish

Introduction

Pathogen sequence data contain information about their evolutionary past
- Timing and dynamics of spread of epidemics
High density sampling and rapid sequencing technologies allow us to examine current epidemiological dynamics
- Need to have the appropriate infrastructure
When combined with models, we can assess the impact of interventions
- Changes in incidence/population structure may be reflected in the sequence data

Real-time molecular epidemiology

With increased availability, decreased cost, and higher throughput of sequence data, there is increased interest in (near) real-time molecular epidemiology of pathogens
Needs
- Samples
- Sequencing
- Algorithms
- Platforms

Platforms

Disease Bioportal (UC Davis)
- Originally for FMDV, now includes influenza A
Microreact (David Aanensen, Imperial/Sanger)
- Predominantly bacterial pathogens
nextflu/nextstrain (Trevor Bedford/Richard Neher)
- Originally influenza
- Extended to Ebola, Zika, and other acute viruses
HIV-trace (Sergei Pond/Steven Weaver/Joel Wertheim)
- HIV, but just for clustering sequences
Epimatics (Art Poon)
- Not open source

Real-time surveillance of HIV

At first glance, HIV does not stand out as a good candidate for real time surveillance
- Chronic infection, with diagnosis long after infection
- Typically does not cause outbreaks

Challenges in implementation

Most molecular epidemiological analyses are done manually
- Lack of reproducibility
- Labour-intensive
Data governance issues favour a decentralised system
- Cannot send data ‘out’
- Have to implement all tools within the same system
Databases may be large
- Need to have fast algorithms
- ‘Online’ approaches to facilitate real-time analyses
Data are complex
- Need to offer reports, visualisations, etc..

Why not nextstrain?

Pros:
- Winner of the NIH/Wellcome Trust Open Science Prize
Cons:
- Linked to specific database platform
- Runs analyses in batch, rather than ‘online’
- Light on data processing
- Moving target
Enter nextHIV…

nextHIV

Platform for real-time HIV surveillance
- Sequence data
- Metadata
- Allows multiple sequences per individual
Takes ideas and the ‘good bits’ from other platforms
- Benefits of open source
- In collaboration with Trevor, Richard, Sergei, etc.
Emphasis is on speed and to deploy locally

Application

Since 2001, the Vanderbilt Comprehensive Care Clinic (VCCC) has been generating HIV sequence data for clinical care
Most patients included in the North American AIDS Cohort Collaboration on Research and Design
- Validated clinical endpoints, demographics, treatment data and regimens
2,915 individuals, 4,727 sequences with demographic data
773 individuals, 1505 sequences without demographics
Over 170,000 PR/RT sequences from the US in public databases
- Very few come from Tennessee

nextHIV workflow

Imports data into a database
Aligns it against a standard HIV reference strain
Generates drug resistance interpretations
Compares sequences against a large reference database
Infers subtypes
Generates clusters of related sequences
Generates phylogenies
Generates interactive reports

Alignment

Multiple sequence alignment can be slow for large numbers of sequences
Map to the reference HXB2 using the Python library BioExt (Hepler/Weaver/Pond)
- Fast (and parallelised)
- Codon-aware
- Allows us to use HXB2 numbering
Aligns over 20 sequences per second on a standard laptop

Drug resistance

We employ the Stanford drug resistance scoring scheme
Rather than use a web service, the scheme is implemented locally
‘Naive’ individuals determined using regimen data

Transmitted NNRTI resistance

Distance calculations

For clustering we use an implementation of the TN93 distance by Sergei
- Part of HIV-TRACE
- Refactored into a Python/C library
Copes with ambiguous nucleotides systematically
Compute all pairwise distances between cohort sequences
Identify closest matches from a reference alignment

Clustering

To help overcome some of the problems with analysing large sequence datasets, many HIV studies focus on clusters of similar sequences
One simple approach we have used is to calculate the distance to the nearest sequence
- Closer sequences = more clustering
This approach does not involve choosing a threshold

Clustering

Clustering by gender

Clustering in PANGEA village simulations

Clustering by risk group

Clustering by age

Subtype assignment

Whilst SCUEAL and Rega v3 are arguably the standard, we would like something much faster
Candidates:
- COMET
  - Uses a prediction by partial matching approach
  - Good performance
  - Not open source
- STAR
  - Uses a position-specific scoring matrix
  - Not actively maintained

nextHIV subtying pipeline

Subtype of closest (TN93) reference
pyCOMET
- An open source implementation of COMET in Python
- Alignment free
PSSM
- An open source PSSM method along the lines of STAR
- Requires alignment

Non-B subtypes

Phylogenetic reconstruction

We use RAxML and IQTREE for phylogenetic reconstruction
- Both can add single sequences to an existing tree
For molecular epidemiological purposes, we would like time-calibrated trees
- We employ treedater (Volz)
  - Fast, relaxed clock approach

treedater performance

Phylogenetic processing

In many settings, we may have many sequences from a given subtype
- Results in phylogenies that are difficult to interpret
We consider smaller subtrees:
- Slice the phylogeny at a given time
- Carve into subtrees that are monophyletic with respect to the reference sequences

Subtype B

Parsimonious introductions

Cluster distributions

2783 subtype B sequences from unique patients
931 clusters

Largest monophyletic cluster

Characteristics of largest cluster

Young
- Median 26 yo
African American
- 90/105
Male, MSM
- 91/105 MSM
Geographically dispersed
- Majority come from 3 ZIP3 codes

Tip lengths

Reporting

Considerations
- Interactivity is desirable
- Self-contained (detached from source data)
- Easily customiseable
R Markdown document
- Easy to include interactive ‘widgets’

Configuration

Configuration is performed with a simply formatted text file

sequence:
  table: 'sequences'
  column: 'SEQUENCE'
  name: 'SEQNAME'
  pid: 'CFAR_PID'
  ordering: 'AGE_AT_SEQUENCE'
  collection_date: 'YEAR_OF_SEQUENCE'
  startpos: 0
  endpos: 1497
  processed_table: 'processed_sequences'
  processed_name: 'SEQUENCE_ALIGNED'

Implementation

Open source (Python/R)
Can use different database backends to store data
Is ‘containerised’ to allow running on Windows, Mac, Linux servers
- Currently Docker container
Many analyses run in parallel (if on a multicore machine)

Work in progress

NGS/WGS
Geographical analyses
Standards for interoperability?

Summary

We have developed a platform for real-time HIV surveillance
- Could be used for other viruses too
Alongside algorithmic development
- Time calibrated trees (Erik)
- Subtyping (Mukarram)
- Counting introductions

Implementing a real-time HIV molecular epidemiology platform

Simon Frost, Mukarram Hossain, Art Poon, Sergei Kosakovsky Pond, Steven Weaver, Erik Volz, Trevor Bedford, Richard Neher, Sten Vermund, Timothy Sterling, Ann Dennis, Marcia Kalish

Introduction

Real-time molecular epidemiology

Platforms

Real-time surveillance of HIV

Challenges in implementation

Why not nextstrain?

nextHIV

Application

nextHIV workflow

Alignment

Drug resistance

Transmitted NNRTI resistance

Distance calculations

Clustering

Clustering

Clustering by gender

Clustering in PANGEA village simulations

Clustering by risk group

Clustering by age

Subtype assignment

nextHIV subtying pipeline

Non-B subtypes

Phylogenetic reconstruction

treedater performance

Phylogenetic processing

Subtype B

Parsimonious introductions

Cluster distributions

Largest monophyletic cluster

Characteristics of largest cluster

Tip lengths

Reporting

Configuration

Implementation

Work in progress

Summary

Thanks!

sdwfrost@gmail.com @sdwfrost http://github.com/sdwfrost