Step-by-step instructions to download HEST-1k
This tutorial will guide you to:
Download HEST-1k in its entirety (scanpy, whole-slide images, patches, nuclear segmentation, alignment preview)
Download some samples of HEST-1k
Download samples with some attributes (e.g., all breast cancer cases)
Inspect freshly downloaded samples
Instructions for Setting Up HuggingFace Account and Token
1. Create an Account on HuggingFace
Follow the instructions provided on the HuggingFace sign-up page.
2. Accept terms of use of HEST
Go to HEST HuggingFace page
Request access (access will be automatically granted)
At this stage, you can already manually inspect the data by navigating in the
Files and version
3. Create a Hugging Face Token
Go to Settings: Navigate to your profile settings by clicking on your profile picture in the top right corner and selecting
Settingsfrom the dropdown menu.Access Tokens: In the settings menu, find and click on
Access tokens.Create New Token:
Click on
New token.Set the token name (e.g.,
hest).Set the access level to
Write.Click on
Create.
Copy Token: After the token is created, copy it to your clipboard. You will need this token for authentication.
4. Logging
Install the python library datasets and run cell below. If successful, you should see:
Your token has been saved to /home/usr/.cache/huggingface/token
Login successful
%%bash
pip install datasets
from huggingface_hub import login
login(token="YOUR HUGGING FACE TOKEN")
Download full HEST-1k (~1TB)
import os
import zipfile
from huggingface_hub import snapshot_download
from tqdm import tqdm
def download_hest(patterns, local_dir):
repo_id = 'MahmoodLab/hest'
snapshot_download(repo_id=repo_id, allow_patterns=patterns, repo_type="dataset", local_dir=local_dir)
seg_dir = os.path.join(local_dir, 'cellvit_seg')
if os.path.exists(seg_dir):
print('Unzipping cell vit segmentation...')
for filename in tqdm([s for s in os.listdir(seg_dir) if s.endswith('.zip')]):
path_zip = os.path.join(seg_dir, filename)
with zipfile.ZipFile(path_zip, 'r') as zip_ref:
zip_ref.extractall(seg_dir)
local_dir='../hest_data' # hest will be dowloaded to this folder
# Note that the full dataset is around 1TB of data
download_hest('*', local_dir)
Download HEST-1k based on sample IDs
local_dir='../hest_data' # hest will be dowloaded to this folder
ids_to_query = ['INT1'] # list of ids to query
list_patterns = [f"*{id}[_.]**" for id in ids_to_query]
download_hest(list_patterns, local_dir)
Download HEST-1k based on metadata keys (e.g., organ, technology, oncotree code)
import pandas as pd
local_dir='../hest_data' # hest will be dowloaded to this folder
meta_df = pd.read_csv("hf://datasets/MahmoodLab/hest/HEST_v1_3_0.csv")
# Filter the dataframe by organ, oncotree code...
meta_df = meta_df[meta_df['oncotree_code'] == 'IDC']
meta_df = meta_df[meta_df['organ'] == 'Breast']
ids_to_query = meta_df['id'].values
list_patterns = [f"*{id}[_.]**" for id in ids_to_query]
download_hest(list_patterns, local_dir)
Inspect freshly downloaded samples
For each sample, we provide:
wsis/: H&E-stained whole slide images in pyramidal Generic TIFF (or pyramidal Generic BigTIFF if >4.1GB)
st/: Spatial transcriptomics expressions in a scanpy .h5ad object
metadata/: Metadata
spatial_plots/: Overlay of the WSI with the st spots
thumbnails/: Downscaled version of the WSI
tissue_seg/: Tissue segmentation masks:
{id}.geojson: Tissue segmentation mask{id}_vis.jpg: Visualization of the tissue mask on the downscaled WSI
pixel_size_vis/: Visualization of the pixel size
patches/: 256x256 H&E patches (0.5µm/px) extracted around ST spots in a .h5 object optimized for deep-learning. Each patch is matched to the corresponding ST profile (see st/) with a barcode.
patches_vis/: Visualization of the mask and patches on a downscaled WSI.
transcripts/: individual transcripts aligned to H&E for xenium samples; read with pandas.read_parquet; aligned coordinates in pixel are in columns
['he_x', 'he_y']cellvit_seg/: Cellvit nuclei segmentation
xenium_seg: xenium segmentation on DAPI and aligned to H&E
from hest import iter_hest
import pandas as pd
# Ex: inspect all the Invasive Lobular Carcinoma samples (ILC)
meta_df = pd.read_csv('../assets/HEST_v1_3_0.csv')
id_list = meta_df[meta_df['oncotree_code'] == 'ILC']['id'].values
print('load hest...')
# Iterate through a subset of hest
for st in iter_hest('../hest_data', id_list=id_list):
print(st)