Step-by-step instructions to download HEST-1k

This tutorial will guide you to:

  • Download HEST-1k in its entirety (scanpy, whole-slide images, patches, nuclear segmentation, alignment preview)

  • Download some samples of HEST-1k

  • Download samples with some attributes (e.g., all breast cancer cases)

  • Inspect freshly downloaded samples

Instructions for Setting Up HuggingFace Account and Token

1. Create an Account on HuggingFace

Follow the instructions provided on the HuggingFace sign-up page.

2. Accept terms of use of HEST

  1. Go to HEST HuggingFace page

  2. Request access (access will be automatically granted)

  3. At this stage, you can already manually inspect the data by navigating in the Files and version

3. Create a Hugging Face Token

  1. Go to Settings: Navigate to your profile settings by clicking on your profile picture in the top right corner and selecting Settings from the dropdown menu.

  2. Access Tokens: In the settings menu, find and click on Access tokens.

  3. Create New Token:

    • Click on New token.

    • Set the token name (e.g., hest).

    • Set the access level to Write.

    • Click on Create.

  4. Copy Token: After the token is created, copy it to your clipboard. You will need this token for authentication.

4. Logging

Install the python library datasets and run cell below. If successful, you should see:

Your token has been saved to /home/usr/.cache/huggingface/token
Login successful
%%bash
pip install datasets
from huggingface_hub import login

login(token="YOUR HUGGING FACE TOKEN")

Download full HEST-1k (~1TB)

import os
import zipfile

from huggingface_hub import snapshot_download
from tqdm import tqdm

def download_hest(patterns, local_dir):
    repo_id = 'MahmoodLab/hest'
    snapshot_download(repo_id=repo_id, allow_patterns=patterns, repo_type="dataset", local_dir=local_dir)

    seg_dir = os.path.join(local_dir, 'cellvit_seg')
    if os.path.exists(seg_dir):
        print('Unzipping cell vit segmentation...')
        for filename in tqdm([s for s in os.listdir(seg_dir) if s.endswith('.zip')]):
            path_zip = os.path.join(seg_dir, filename)
                        
            with zipfile.ZipFile(path_zip, 'r') as zip_ref:
                zip_ref.extractall(seg_dir)
local_dir='../hest_data' # hest will be dowloaded to this folder

# Note that the full dataset is around 1TB of data
download_hest('*', local_dir)

Download HEST-1k based on sample IDs

local_dir='../hest_data' # hest will be dowloaded to this folder

ids_to_query = ['INT1'] # list of ids to query

list_patterns = [f"*{id}[_.]**" for id in ids_to_query]
download_hest(list_patterns, local_dir)

Download HEST-1k based on metadata keys (e.g., organ, technology, oncotree code)

import pandas as pd

local_dir='../hest_data' # hest will be dowloaded to this folder

meta_df = pd.read_csv("hf://datasets/MahmoodLab/hest/HEST_v1_3_0.csv")

# Filter the dataframe by organ, oncotree code...
meta_df = meta_df[meta_df['oncotree_code'] == 'IDC']
meta_df = meta_df[meta_df['organ'] == 'Breast']

ids_to_query = meta_df['id'].values

list_patterns = [f"*{id}[_.]**" for id in ids_to_query]
download_hest(list_patterns, local_dir)

Inspect freshly downloaded samples

For each sample, we provide:

  • wsis/: H&E-stained whole slide images in pyramidal Generic TIFF (or pyramidal Generic BigTIFF if >4.1GB)

  • st/: Spatial transcriptomics expressions in a scanpy .h5ad object

  • metadata/: Metadata

  • spatial_plots/: Overlay of the WSI with the st spots

  • thumbnails/: Downscaled version of the WSI

  • tissue_seg/: Tissue segmentation masks:

    • {id}.geojson: Tissue segmentation mask

    • {id}_vis.jpg: Visualization of the tissue mask on the downscaled WSI

  • pixel_size_vis/: Visualization of the pixel size

  • patches/: 256x256 H&E patches (0.5µm/px) extracted around ST spots in a .h5 object optimized for deep-learning. Each patch is matched to the corresponding ST profile (see st/) with a barcode.

  • patches_vis/: Visualization of the mask and patches on a downscaled WSI.

  • transcripts/: individual transcripts aligned to H&E for xenium samples; read with pandas.read_parquet; aligned coordinates in pixel are in columns ['he_x', 'he_y']

  • cellvit_seg/: Cellvit nuclei segmentation

  • xenium_seg: xenium segmentation on DAPI and aligned to H&E

from hest import iter_hest
import pandas as pd

# Ex: inspect all the Invasive Lobular Carcinoma samples (ILC)
meta_df = pd.read_csv('../assets/HEST_v1_3_0.csv')

id_list = meta_df[meta_df['oncotree_code'] == 'ILC']['id'].values

print('load hest...')
# Iterate through a subset of hest
for st in iter_hest('../hest_data', id_list=id_list):
    print(st)