Step-by-step instructions to assemble HEST data

I. Visium reader

This tutorial will guide you to convert a legacy Visium sample into a HEST-compatible object.

Download Visium sample from NCBI

%%bash
# As an example, download the files from the following NCBI study:
# https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM6215674)

mkdir downloads
cd downloads
wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM6215nnn/GSM6215674/suppl/GSM6215674%5FS13.png.gz
gunzip GSM6215674_S13.png.gz
wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM6215nnn/GSM6215674/suppl/GSM6215674%5FS13%5Ffiltered%5Ffeature%5Fbc%5Fmatrix.h5

Create HESTData object from the image and count matrix

The library performs:

  • Creation of AnnData object

  • Creation of OpenSlide object

  • Automatic fiducial detection for spot alignment

Troubleshooting:

If you encounter: SystemError: ffi_prep_closure(): bad user_data (it seems that the version of the libffi library. Attempt: pip install --force-reinstall --no-binary :all: cffi

from hest import VisiumReader

fullres_img_path = 'downloads/GSM6215674_S13.png'
bc_matrix_path = 'downloads/GSM6215674_S13_filtered_feature_bc_matrix.h5'

st = VisiumReader().read(
    fullres_img_path, # path to a full res image
    bc_matrix_path, # path to filtered_feature_bc_matrix.h5
    save_autoalign=True # pass this argument to visualize the fiducial autodetection
)
alignment file is  None
AnnData object with n_obs × n_vars = 1182 × 32298
    var: 'gene_ids', 'feature_types', 'genome'
/home/guillaumejaume/anaconda3/envs/hest/lib/python3.11/site-packages/anndata/_core/anndata.py:1825: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
/home/guillaumejaume/anaconda3/envs/hest/lib/python3.11/site-packages/anndata/_core/anndata.py:1825: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
trim the barcodes
no tissue_positions_list.csv/tissue_positions.csv or alignment file found
attempt fiducial auto alignment...

0: 608x640 1 filled_hexagon, 1 open_hexagon, 1 hourglass, 2 triangles, 10.6ms
Speed: 1.7ms preprocess, 10.6ms inference, 7.9ms postprocess per image at shape (1, 3, 608, 640)
/home/guillaumejaume/Documents/project_rosai/HEST/src/hest/autoalign.py:111: UserWarning: You passed both c and facecolor/facecolors for the markers. c has precedence over facecolor/facecolors. This behavior may change in the future.
  ax.scatter(x, y, c='b', s=15, facecolors='none', linewidth=0.2)
/home/guillaumejaume/Documents/project_rosai/HEST/src/hest/autoalign.py:126: UserWarning: Setting the 'color' property will override the edgecolor or facecolor properties.
  circles2 = [plt.Circle(aligned_fiducials[i] * factor, radius=s_fid, color='r', facecolor='none', linewidth=0.3) for i in range(len(aligned_fiducials))]
../_images/e48f34ba81620179950bc71fde18354887d0f7eb84b2da1ab86497c022a5cb81.png
st.save(path='processed')
saving to pyramidal tiff... can be slow

You can also visualize an overlay of the aligned spots on the downscaled WSI

st.save_spatial_plot(save_path='processed')
/home/guillaumejaume/Documents/project_rosai/HEST/src/hest/HESTData.py:1583: FutureWarning: Use `squidpy.pl.spatial_scatter` instead.
  fig = sc.pl.spatial(adata, show=False, img_key="downscaled_fullres", color=[key], title=f"in_tissue spots", return_fig=True, **pl_kwargs)
../_images/0b4c905af81e8b0513c81d500f5586398796680bd84fdd744d3d4e14a2882f18.png

When should I provide an alignment file and when should I use autoalignment?

Step 1: check if a tissue_positions.csv/tissue_position_list.csv already provides a correct alignment

In most cases, if a spatial/ folder containing a tissue_positions.csv or tissue_position_list.csv is available you don’t need any autoalignment/alignment file.

Try the following:

st = VisiumReader().read(fullres_img_path, bc_matric_path, spatial_coord_path=spatial_path), where spatial_path is contains tissue_positions.csv or tissue_position_list.csv. You can manually inspect the alignment by saving a visualization plot that takes the full resolution image, downscale it and overlays it with the spots (using st.save_spatial_plot(save_dir)). If the alignment is incorrect, try step 2.

Step 2: check if a .json alignment file is provided

If a .json alignment file is available, try: VisiumReader().read(fullres_img_path, bc_matric_path, spatial_coord_path=spatial_path, alignment_file_path=align_path). You can also omit the spatial_coord_path as VisiumReader().read(fullres_img_path, bc_matric_path, alignment_file_path=align_path)

Step 3: attempt auto-alignment

If at least 3 corner fiducials are not cropped out and are reasonably visible, you can attempt an autoalignment with VisiumReader().read(fullres_img_path, bc_matric_path. (if no spatial folder and no alignment_file_path is provided, it will attempt autoalignment by default, you can also force auto-alignment by passing autoalign='always').

Examples:

from hest import VisiumReader

fullres_img_path = 'my_path/image.tif'
bc_matrix_path = 'my_path/filtered_bc_matrix.h5'
spatial_coord_path = 'my_path/spatial'
alignment_file_path = 'my_path/alignment.txt'

st = VisiumReader().read(
    fullres_img_path, # path to a full res image
    bc_matrix_path, # path to filtered_feature_bc_matrix.h5
    spatial_coord_path=spatial_coord_path # path to a space ranger spatial/ folder containing either a tissue_positions.csv or tissue_position_list.csv
)

# if no spatial folder is provided, but you have an alignment file
st = VisiumReader().read(
    fullres_img_path, # path to a full res image
    bc_matrix_path, # path to filtered_feature_bc_matrix.h5
    alignment_file_path=alignment_file_path # path to a .json alignment file
)

# if both the alignment file and the spatial folder are missing, attempt auto-alignment
st = VisiumReader().read(
    fullres_img_path, # path to a full res image
    bc_matrix_path, # path to filtered_feature_bc_matrix.h5
)
alignment file is  None
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[4], line 8
      5 spatial_coord_path = 'my_path/spatial'
      6 alignment_file_path = 'my_path/alignment.txt'
----> 8 st = VisiumReader().read(
      9     fullres_img_path, # path to a full res image
     10     bc_matrix_path, # path to filtered_feature_bc_matrix.h5
     11     spatial_coord_path=spatial_coord_path # path to a space ranger spatial/ folder containing either a tissue_positions.csv or tissue_position_list.csv
     12 )
     14 # if no spatial folder is provided, but you have an alignment file
     15 st = VisiumReader().read(
     16     fullres_img_path, # path to a full res image
     17     bc_matrix_path, # path to filtered_feature_bc_matrix.h5
     18     alignment_file_path=alignment_file_path # path to a .json alignment file
     19 )

File ~/Documents/project_rosai/HEST/src/hest/readers.py:374, in VisiumReader.read(self, img_path, filtered_bc_matrix_path, raw_bc_matrix_path, spatial_coord_path, alignment_file_path, mex_path, scanpy_h5_path, metric_file_path, custom_adata, autoalign, save_autoalign)
    371 check_arg(autoalign, 'autoalign', ['always', 'never', 'auto'])
    373 if filtered_bc_matrix_path is not None:
--> 374     adata = sc.read_10x_h5(filtered_bc_matrix_path)
    375 elif raw_bc_matrix_path is not None:
    376     adata = sc.read_10x_h5(raw_bc_matrix_path)

    [... skipping hidden 1 frame]

File ~/anaconda3/envs/hest/lib/python3.11/site-packages/scanpy/readwrite.py:221, in read_10x_h5(filename, genome, gex_only, backup_url)
    219 if not is_present:
    220     logg.debug(f"... did not find original file {path}")
--> 221 with h5py.File(str(path), "r") as f:
    222     v3 = "/matrix" in f
    223 if v3:

File ~/anaconda3/envs/hest/lib/python3.11/site-packages/h5py/_hl/files.py:566, in File.__init__(self, name, mode, driver, libver, userblock_size, swmr, rdcc_nslots, rdcc_nbytes, rdcc_w0, track_order, fs_strategy, fs_persist, fs_threshold, fs_page_size, page_buf_size, min_meta_keep, min_raw_keep, locking, alignment_threshold, alignment_interval, meta_block_size, track_times, **kwds)
    557     fapl = make_fapl(driver, libver, rdcc_nslots, rdcc_nbytes, rdcc_w0,
    558                      locking, page_buf_size, min_meta_keep, min_raw_keep,
    559                      alignment_threshold=alignment_threshold,
    560                      alignment_interval=alignment_interval,
    561                      meta_block_size=meta_block_size,
    562                      **kwds)
    563     fcpl = make_fcpl(track_order=track_order, track_times=track_times,
    564                      fs_strategy=fs_strategy, fs_persist=fs_persist,
    565                      fs_threshold=fs_threshold, fs_page_size=fs_page_size)
--> 566     fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
    568 if isinstance(libver, tuple):
    569     self._libver = libver

File ~/anaconda3/envs/hest/lib/python3.11/site-packages/h5py/_hl/files.py:241, in make_fid(name, mode, userblock_size, fapl, fcpl, swmr)
    239     if swmr and swmr_support:
    240         flags |= h5f.ACC_SWMR_READ
--> 241     fid = h5f.open(name, flags, fapl=fapl)
    242 elif mode == 'r+':
    243     fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)

File h5py/_objects.pyx:54, in h5py._objects.with_phil.wrapper()

File h5py/_objects.pyx:55, in h5py._objects.with_phil.wrapper()

File h5py/h5f.pyx:104, in h5py.h5f.open()

FileNotFoundError: [Errno 2] Unable to synchronously open file (unable to open file: name = 'my_path/filtered_bc_matrix.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

Auto read

Given that visium_dir contains a full resolution image and all the necessary Visium files such as the filtered_bc_matrix.h5 and the spatial folder, VisiumReader.auto_read(path) should be able to automatically read the sample. Prefer read for a more fine grain control.

from hest import VisiumReader

# Set this to a folder containing a full-res image, filtered matrix and spatial files.
visium_dir = "PATH_TO_VISIUM_DIR"

# attempt autoread
st = VisiumReader().auto_read(visium_dir)

II. Xenium reader

Download Xenium sample from 10x genomics website

Download the following xenium files and place them in the same directory

https://www.10xgenomics.com/datasets/human-skin-data-xenium-human-multi-tissue-and-cancer-panel-1-standard

%%bash

mkdir downloads
cd downloads
wget https://cf.10xgenomics.com/samples/xenium/1.9.0/Xenium_V1_hSkin_nondiseased_section_1_FFPE/Xenium_V1_hSkin_nondiseased_section_1_FFPE_outs.zip
unzip Xenium_V1_hSkin_nondiseased_section_1_FFPE_outs.zip -d Xenium_V1_hSkin_nondiseased_section_1_FFPE_outs
%%bash

cd downloads/Xenium_V1_hSkin_nondiseased_section_1_FFPE_outs
wget https://cf.10xgenomics.com/samples/xenium/1.9.0/Xenium_V1_hSkin_nondiseased_section_1_FFPE/Xenium_V1_hSkin_nondiseased_section_1_FFPE_he_image.ome.tif
wget https://cf.10xgenomics.com/samples/xenium/1.9.0/Xenium_V1_hSkin_nondiseased_section_1_FFPE/Xenium_V1_hSkin_nondiseased_section_1_FFPE_he_imagealignment.csv
from hest import XeniumReader

xenium_folder_path = 'downloads/Xenium_V1_hSkin_nondiseased_section_1_FFPE_outs'

st = XeniumReader().auto_read(xenium_folder_path)

Working with larger than RAM Xenium samples (Xenium 5k)

We support larger than RAM transcripts pooling powered by dask. Dask will automatically chunk the data such that it never has to hold the entire transcript dataframe in memory.

Dask will attempt to process one partition per thread. To avoid loading large partitions on systems having a low amount of RAM, we advise using a high number of partitions (>100), as well as a single worker and a low number of threads (<4 depending on RAM available).

Note: feel free to open the dask dashboard to visualize workers, partitions and resources (usually on http://localhost:8787)

from dask.distributed import LocalCluster, Client

cluster = LocalCluster(
    "127.0.0.1:8786",
    n_workers=1, # increase depending on RAM available
    memory_limit="20GB", # dask will kill the worker if this is exceeded
    threads_per_worker=1, # increase depending on RAM available
)
client = Client(cluster)
print('dashboard is available at: ', client.dashboard_link)

st = XeniumReader().auto_read(
    xenium_folder_path, 
    use_dask=True, 
    nb_partitions=100
)
st.save_spatial_plot('./')
st.segment_tissue()

We then compute patches centered around pseudo-visium transcript bins.

Warning: note that patches might be larger than transcript bins.

st.dump_patches('patches', target_patch_size=224, target_pixel_size=0.5)
st.save('save', save_img=False)

Adding new samples to HEST

This section explains how to format new samples for HuggingFace

I. Sample preparation

1. Download raw datasets in structured folders

Create the following folder structure:

my_data/
    xenium/
        dataset_name_1/
            subseries1/
                sample_1/
                    ...
                sample_2/
                    ...
                ...
        dataset_name_2/
            sample_1/
                ...
    visium-hd/
        ...
    visium/
        ...

Then, download corresponding files for xenium, visium, visium-hd… See reader examples above or refer to the doc for the list of required files.

2. Add metadata rows to CSV

Fill columns in the sample CSV, specifically: dataset_title, subseries should match with the folder structure.

II. Sample processing

1. Process raw samples

Set the path to your sample CSV in tutorials/scripts/1_process_raw_samples.py, also modify the memory_limit, we recommend setting dask memory limit to at least 20GB for Xenium 5k, eventhough 30GB is safer.

Process and save raw samples by launching python tutorials/scripts/1_process_raw_samples.py.

Processing Xenium 5k will take time, you can monitor progress on the dask dashboard (usually at localhost:8787 after launch).

2. Check Xenium DAPI/HE alignment and realign if necessary

By default, the Xenium platform uses a single affine transform for the whole WSI. For some samples, the resulting alignment might be unsatisfactory. In order to visualize the affine alignment, either:

  • open the resulting .geojson files (in processed/) in QuPath (might not work with QuPath >=0.6)

    or

  • launch tutorials/scripts/2_check_xenium_alignment.py.

3. Micro-align DAPI to HE with Valis

If the alignment from steps (1-2) is unsatisfactory, we highly recommend using Valis non-rigid micro-alignment for sub-cellular alignment precision. This is crucial for correctly mapping transcripts and cells to H&E.

a. Register DAPI to HE with Valis

We provide a modified Valis version for simplified pythonic use and improved precision, please check-out the original repository here.

Open 3a_microalign_xenium.py, in this example, we use morphology_focus_0000.ome.tif as the DAPI slide, feel free to use morphology_focus.ome.tif.

Monitor alignment quality at results/{sample_name}/{date}/overlaps/, check both _rigid_overlap.png and micro_reg.png.

b. Warp transcripts, cells and nuclei using Valis and Dask

See 3b_microalign_xenium.py to warp transcripts, nuclei and cells from DAPI to H&E.

Then re-run step 3.b, in order to compare quality.

4. Segment with CellViT

See 4_segment_cellvit.py to segment with CellViT.

5. Copy processed files to HEST_results/

See 5_copy_processed.py to copy files to the structure expected by huggingface.

6. Generate a new HEST_vX_Y_Z.csv sheet

See 6_generate_new_meta.py to create a new HEST_vX_Y_Z.csv. Once created, copy it to the HEST_results/ folder.

7. Upload to HuggingFace

See 7_upload_huggingface.py to upload to HuggingFace via a PR.