{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Step-by-step instructions to download HEST-1k \n", "\n", "This tutorial will guide you to:\n", "\n", "- Download HEST-1k in its entirety (scanpy, whole-slide images, patches, nuclear segmentation, alignment preview)\n", "- Download some samples of HEST-1k \n", "- Download samples with some attributes (e.g., all breast cancer cases) \n", "- Inspect freshly downloaded samples\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Instructions for Setting Up HuggingFace Account and Token\n", "\n", "#### 1. Create an Account on HuggingFace\n", "Follow the instructions provided on the [HuggingFace sign-up page](https://huggingface.co/join).\n", "\n", "#### 2. Accept terms of use of HEST\n", "\n", "1. Go to [HEST HuggingFace page](https://huggingface.co/datasets/MahmoodLab/hest)\n", "2. Request access (access will be automatically granted)\n", "3. At this stage, you can already manually inspect the data by navigating in the `Files and version`\n", "\n", "#### 3. Create a Hugging Face Token\n", "\n", "1. **Go to Settings:** Navigate to your profile settings by clicking on your profile picture in the top right corner and selecting `Settings` from the dropdown menu.\n", "\n", "2. **Access Tokens:** In the settings menu, find and click on `Access tokens`.\n", "\n", "3. **Create New Token:**\n", " - Click on `New token`.\n", " - Set the token name (e.g., `hest`).\n", " - Set the access level to `Write`.\n", " - Click on `Create`.\n", "\n", "4. **Copy Token:** After the token is created, copy it to your clipboard. You will need this token for authentication." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4. Logging\n", "\n", "Install the python library `datasets` and run cell below. If successful, you should see:\n", "\n", "```\n", "Your token has been saved to /home/usr/.cache/huggingface/token\n", "Login successful\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "pip install datasets" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from huggingface_hub import login\n", "\n", "login(token=\"YOUR HUGGING FACE TOKEN\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Download full HEST-1k (~1TB)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import zipfile\n", "\n", "from huggingface_hub import snapshot_download\n", "from tqdm import tqdm\n", "\n", "def download_hest(patterns, local_dir):\n", " repo_id = 'MahmoodLab/hest'\n", " snapshot_download(repo_id=repo_id, allow_patterns=patterns, repo_type=\"dataset\", local_dir=local_dir)\n", "\n", " seg_dir = os.path.join(local_dir, 'cellvit_seg')\n", " if os.path.exists(seg_dir):\n", " print('Unzipping cell vit segmentation...')\n", " for filename in tqdm([s for s in os.listdir(seg_dir) if s.endswith('.zip')]):\n", " path_zip = os.path.join(seg_dir, filename)\n", " \n", " with zipfile.ZipFile(path_zip, 'r') as zip_ref:\n", " zip_ref.extractall(seg_dir)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "local_dir='../hest_data' # hest will be dowloaded to this folder\n", "\n", "# Note that the full dataset is around 1TB of data\n", "download_hest('*', local_dir)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Download HEST-1k based on sample IDs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "local_dir='../hest_data' # hest will be dowloaded to this folder\n", "\n", "ids_to_query = ['INT1'] # list of ids to query\n", "\n", "list_patterns = [f\"*{id}[_.]**\" for id in ids_to_query]\n", "download_hest(list_patterns, local_dir)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Download HEST-1k based on metadata keys (e.g., organ, technology, oncotree code)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "local_dir='../hest_data' # hest will be dowloaded to this folder\n", "\n", "meta_df = pd.read_csv(\"hf://datasets/MahmoodLab/hest/HEST_v1_3_0.csv\")\n", "\n", "# Filter the dataframe by organ, oncotree code...\n", "meta_df = meta_df[meta_df['oncotree_code'] == 'IDC']\n", "meta_df = meta_df[meta_df['organ'] == 'Breast']\n", "\n", "ids_to_query = meta_df['id'].values\n", "\n", "list_patterns = [f\"*{id}[_.]**\" for id in ids_to_query]\n", "download_hest(list_patterns, local_dir)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Inspect freshly downloaded samples\n", "\n", "For each sample, we provide:\n", "\n", "- **wsis/**: H&E-stained whole slide images in pyramidal Generic TIFF (or pyramidal Generic BigTIFF if >4.1GB)\n", "- **st/**: Spatial transcriptomics expressions in a scanpy .h5ad object\n", "- **metadata/**: Metadata\n", "- **spatial_plots/**: Overlay of the WSI with the st spots\n", "- **thumbnails/**: Downscaled version of the WSI\n", "- **tissue_seg/**: Tissue segmentation masks:\n", " - `{id}.geojson`: Tissue segmentation mask\n", " - `{id}_vis.jpg`: Visualization of the tissue mask on the downscaled WSI\n", "- **pixel_size_vis/**: Visualization of the pixel size\n", "- **patches/**: 256x256 H&E patches (0.5µm/px) extracted around ST spots in a .h5 object optimized for deep-learning. Each patch is matched to the corresponding ST profile (see **st/**) with a barcode.\n", "- **patches_vis/**: Visualization of the mask and patches on a downscaled WSI.\n", "- **transcripts/**: individual transcripts aligned to H&E for xenium samples; read with pandas.read_parquet; aligned coordinates in pixel are in columns `['he_x', 'he_y']`\n", "- **cellvit_seg/**: Cellvit nuclei segmentation\n", "- **xenium_seg**: xenium segmentation on DAPI and aligned to H&E\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from hest import iter_hest\n", "import pandas as pd\n", "\n", "# Ex: inspect all the Invasive Lobular Carcinoma samples (ILC)\n", "meta_df = pd.read_csv('../assets/HEST_v1_3_0.csv')\n", "\n", "id_list = meta_df[meta_df['oncotree_code'] == 'ILC']['id'].values\n", "\n", "print('load hest...')\n", "# Iterate through a subset of hest\n", "for st in iter_hest('../hest_data', id_list=id_list):\n", " print(st)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.14" } }, "nbformat": 4, "nbformat_minor": 4 }