Step-by-step instructions to download HEST-1k

This tutorial will guide you to:

Download HEST-1k in its entirety (scanpy, whole-slide images, patches, nuclear segmentation, alignment preview)
Download some samples of HEST-1k
Download samples with some attributes (e.g., all breast cancer cases)
Inspect freshly downloaded samples

Instructions for Setting Up HuggingFace Account and Token

1. Create an Account on HuggingFace

Follow the instructions provided on the HuggingFace sign-up page.

2. Accept terms of use of HEST

Go to HEST HuggingFace page
Request access (access will be automatically granted)
At this stage, you can already manually inspect the data by navigating in the Files and version

3. Create a Hugging Face Token

Go to Settings: Navigate to your profile settings by clicking on your profile picture in the top right corner and selecting Settings from the dropdown menu.
Access Tokens: In the settings menu, find and click on Access tokens.
Create New Token:
- Click on New token.
- Set the token name (e.g., hest).
- Set the access level to Write.
- Click on Create.
Copy Token: After the token is created, copy it to your clipboard. You will need this token for authentication.

4. Logging

Install the python library datasets and run cell below. If successful, you should see:

Your token has been saved to /home/usr/.cache/huggingface/token
Login successful

%%bash
pip install datasets

from huggingface_hub import login

login(token="YOUR HUGGING FACE TOKEN")

Download full HEST-1k (~1TB)

import os
import zipfile

from huggingface_hub import snapshot_download
from tqdm import tqdm

def download_hest(patterns, local_dir):
    repo_id = 'MahmoodLab/hest'
    snapshot_download(repo_id=repo_id, allow_patterns=patterns, repo_type="dataset", local_dir=local_dir)

    seg_dir = os.path.join(local_dir, 'cellvit_seg')
    if os.path.exists(seg_dir):
        print('Unzipping cell vit segmentation...')
        for filename in tqdm([s for s in os.listdir(seg_dir) if s.endswith('.zip')]):
            path_zip = os.path.join(seg_dir, filename)
                        
            with zipfile.ZipFile(path_zip, 'r') as zip_ref:
                zip_ref.extractall(seg_dir)

local_dir='../hest_data' # hest will be dowloaded to this folder

# Note that the full dataset is around 1TB of data
download_hest('*', local_dir)

Download HEST-1k based on sample IDs

local_dir='../hest_data' # hest will be dowloaded to this folder

ids_to_query = ['INT1'] # list of ids to query

list_patterns = [f"*{id}[_.]**" for id in ids_to_query]
download_hest(list_patterns, local_dir)

Download HEST-1k based on metadata keys (e.g., organ, technology, oncotree code)

import pandas as pd

local_dir='../hest_data' # hest will be dowloaded to this folder

meta_df = pd.read_csv("hf://datasets/MahmoodLab/hest/HEST_v1_3_0.csv")

# Filter the dataframe by organ, oncotree code...
meta_df = meta_df[meta_df['oncotree_code'] == 'IDC']
meta_df = meta_df[meta_df['organ'] == 'Breast']

ids_to_query = meta_df['id'].values

list_patterns = [f"*{id}[_.]**" for id in ids_to_query]
download_hest(list_patterns, local_dir)

Inspect freshly downloaded samples

For each sample, we provide:

wsis/: H&E-stained whole slide images in pyramidal Generic TIFF (or pyramidal Generic BigTIFF if >4.1GB)
st/: Spatial transcriptomics expressions in a scanpy .h5ad object
metadata/: Metadata
spatial_plots/: Overlay of the WSI with the st spots
thumbnails/: Downscaled version of the WSI
tissue_seg/: Tissue segmentation masks:
- {id}.geojson: Tissue segmentation mask
- {id}_vis.jpg: Visualization of the tissue mask on the downscaled WSI
pixel_size_vis/: Visualization of the pixel size
patches/: 256x256 H&E patches (0.5µm/px) extracted around ST spots in a .h5 object optimized for deep-learning. Each patch is matched to the corresponding ST profile (see st/) with a barcode.
patches_vis/: Visualization of the mask and patches on a downscaled WSI.
transcripts/: individual transcripts aligned to H&E for xenium samples; read with pandas.read_parquet; aligned coordinates in pixel are in columns ['he_x', 'he_y']
cellvit_seg/: Cellvit nuclei segmentation
xenium_seg: xenium segmentation on DAPI and aligned to H&E

from hest import iter_hest
import pandas as pd

# Ex: inspect all the Invasive Lobular Carcinoma samples (ILC)
meta_df = pd.read_csv('../assets/HEST_v1_3_0.csv')

id_list = meta_df[meta_df['oncotree_code'] == 'ILC']['id'].values

print('load hest...')
# Iterate through a subset of hest
for st in iter_hest('../hest_data', id_list=id_list):
    print(st)