Step-by-step instructions to interact HEST-1k

This tutorial will guide you to:

Read HEST data
Visualized the spots over a downscaled version of the WSI
Saving HESTData into Pyramidal Tif and anndata

This tutorial assumes that the user has already downloaded HEST-1k (in its entirety or partially).

Read HEST

from hest import iter_hest

# Iterate through a subset of hest
for st in iter_hest('../hest_data', id_list=['TENX105']):

    # ST (adata):
    adata = st.adata
    print('\n* Scanpy adata:')
    print(adata)

    # WSI:
    wsi = st.wsi
    print('\n* WSI:')
    print(wsi)
    
    # Shapes:
    shapes = st.shapes
    print('\n* Shapes:')
    print(shapes)
    
    # Tissue segmentation
    tissue_contours = st.tissue_contours
    print('\n* Tissue contours:')
    print(tissue_contours)
    
    # Conversion to SpatialData
    sdata = st.to_spatial_data()
    print('\n* SpatialData conversion:')
    print(sdata)

st.adata is a spatial scanpy object containing the following:

Observations (st.adata.obs)

in_tissue: Indicator if the observation is within the tissue (in_tissue comes from the initial Visium/Xenium run and might not be accurate, prefer the segmentation obtained by st.segment_tissue() instead).
pxl_col_in_fullres: Pixel column position of the patch/spot centroid in the full resolution image.
pxl_row_in_fullres: Pixel row position of the patch/spot centroid in the full resolution image.
array_col: Patch/spot column position in the array.
array_row: Patch/spot row position in the array.
n_counts: Number of counts for each observation.
n_genes_by_counts: Number of genes detected by counts in each observation.
log1p_n_genes_by_counts: Log-transformed number of genes detected by counts.
total_counts: Total counts per observation.
log1p_total_counts: Log-transformed total counts.
pct_counts_in_top_50_genes: Percentage of counts in the top 50 genes.
pct_counts_in_top_100_genes: Percentage of counts in the top 100 genes.
pct_counts_in_top_200_genes: Percentage of counts in the top 200 genes.
pct_counts_in_top_500_genes: Percentage of counts in the top 500 genes.
total_counts_mito: Total mitochondrial counts per observation. (note that this field might not be accurate)
log1p_total_counts_mito: Log-transformed total mitochondrial counts. (note that this field might not be accurate)
pct_counts_mito: Percentage of counts that are mitochondrial. (note that this field might not be accurate)

Variables (st.adata.var)

n_cells_by_counts: Number of cells detected by counts for each variable.
mean_counts: Mean counts per variable.
log1p_mean_counts: Log-transformed mean counts.
pct_dropout_by_counts: Percentage of dropout events by counts.
total_counts: Total counts per variable.
log1p_total_counts: Log-transformed total counts.
mito: Indicator if the gene is mitochondrial. (note that this field might not be accurate)

Unstructured (st.adata.uns)

spatial: Contains a downscaled version of the full resolution image in st.adata.uns['spatial']['ST']['images']['downscaled_fullres']

Observation-wise Multidimensional (st.adata.obsm)

spatial: Pixel coordinates of spots/patches centroids on the full resolution image. (first column is x axis, second column is y axis)

Visualizing the spots over a downscaled version of the WSI

# visualize the spots over a downscaled version of the full resolution image
save_dir = '.'
st.save_spatial_plot(save_dir)

Saving to pyramidal tiff and h5

Save HESTData object to .tiff + expression .h5ad and a metadata file.

# Warning saving a large image to pyramidal tiff (>1GB) can be slow on a hard drive.
st.save(save_dir, pyramidal=True)

Tissue segmentation

We integrated 2 tissue segmentation methods:

Image processing-based using Otsu thresholding
Deep learning-based using a fine-tuned DeepLabV3 ResNet50

save_dir = '.'

name = 'tissue_seg_otsu'
st.segment_tissue(method='otsu') 
st.save_tissue_seg_pkl(save_dir, name)

name = 'tissue_seg_deep'
st.segment_tissue(method='deep') 

Changing the Patching/pooling size

Patching

You can change the size of patches around the spots with dump_patches:

# directory where the patch .h5 will be saved
patch_save_dir = './processed'
new_patch_size = 224


st.dump_patches(
    patch_save_dir,
    name='demo',
    target_patch_size=new_patch_size,
    target_pixel_size=0.5 # pixel size of the patches in um/px after rescaling
)

Changing the Pooling size

You can change the pooling size of Xenium/Visium HD samples stored in HESTData.adata with the following:

from hest.readers import pool_transcripts_xenium
from hest import iter_hest

new_spot_size = 200
new_patch_size = 224
patch_save_dir = './processed'

# Iterate through a subset of hest
for st in iter_hest('../hest_data', id_list=['TENX105'], load_transcripts=True,):
    print(st.transcript_df)

    st.adata = pool_transcripts_xenium(
                    st.transcript_df, # Feel free to convert st.transcript_df to a Dask DataFrame if you are working with limited RAM.
                    st.pixel_size, 
                    key_x='he_x',
                    key_y='he_y',
                    spot_size_um=new_spot_size
                )


    # We change the spots, so we need to re-extract the patches
    st.dump_patches(
        patch_save_dir,
        name='demo',
        target_patch_size=new_patch_size, # target patch size in 224
        target_pixel_size=0.5 # pixel size of the patches in um/px after rescaling
    )

Batch effect visualization

import pandas as pd
from hest.batch_effect import filter_hest_stromal_housekeeping, get_silhouette_score, plot_umap

meta_df = pd.read_csv('../assets/HEST_v1_2_0.csv')


meta_df = meta_df[
    (meta_df['st_technology'] == 'Visium') & 
    (meta_df['disease_state'] == 'Cancer') & 
    (meta_df['oncotree_code'] == 'IDC') & 
    (meta_df['species'] == 'Homo sapiens')
]

# We filter spots, such that:
# - we only keep the most stable housekeeping genes (based on https://housekeeping.unicamp.br/?download)
# - we only keep the spots under the stroma (based on CellViT segmentation)
adata_list = filter_hest_stromal_housekeeping(meta_df, hest_dir='../hest_data')

labels = meta_df['id'].values
score = get_silhouette_score(adata_list, labels=labels)
# (-1: strong overlap between clusters, can be indicative of a small batch effect, 1: poor overlap between clusters, can be indicative of a strong batch effect)
print(f'Silhouette score is {score}')


plot_umap(adata_list, labels, 'batch_whole_tissue.png', umap_kwargs={'min_dist': 0.6})

Loading transcripts (for Xenium only)

The transcript dataframe contains the following columns:

cell_id (default Xenium Explorer flag): id of cell containing this transcript
he_x, he_y: x, y coordinate or transcript on the HE WSI in pixels (aligned with a non-rigid transformation)
feature_name: name of transcript
qv (default Xenium Explorer flag): Phred-scaled quality value (Q-Score) estimating the probability of incorrect call

from hest import iter_hest

# Iterate through a subset of hest
for st in iter_hest('../hest_data', id_list=['TENX105'], load_transcripts=True):
    print(st.transcript_df)

Warning: The segmentation contained in ‘xenium_seg/’ (accessed with st.get_shapes(‘xenium_cell’, ‘he’) or st.get_shapes(‘xenium_nucleus’, ‘he’)) is different than the CellViT segmentation. It was obtained by the Xenium instrument directly by segmenting the DAPI stained image, the DAPI segmented cells were then aligned to H&E with a non rigid transform.

Linking transcripts with cells

Each transcript is associated with a cell id (derived by the Xenium analyzer directly). See column cell_id in the transcript dataframe.

transcripts = st.transcript_df
shapes = st.get_shapes('xenium_cell', 'he').shapes

cell_id = 1

# Print the transcripts associated with a particular cell
print(transcripts[transcripts['cell_id'] == cell_id])
print(shapes[shapes.index == cell_id])