{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step-by-step instructions to download HEST-1k \n",
    "\n",
    "This tutorial will guide you to:\n",
    "\n",
    "- Download HEST-1k in its entirety (scanpy, whole-slide images, patches, nuclear segmentation, alignment preview)\n",
    "- Download some samples of HEST-1k \n",
    "- Download samples with some attributes (e.g., all breast cancer cases) \n",
    "- Inspect freshly downloaded samples\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### Instructions for Setting Up HuggingFace Account and Token\n",
    "\n",
    "#### 1. Create an Account on HuggingFace\n",
    "Follow the instructions provided on the [HuggingFace sign-up page](https://huggingface.co/join).\n",
    "\n",
    "#### 2. Accept terms of use of HEST\n",
    "\n",
    "1. Go to [HEST HuggingFace page](https://huggingface.co/datasets/MahmoodLab/hest)\n",
    "2. Request access (access will be automatically granted)\n",
    "3. At this stage, you can already manually inspect the data by navigating in the `Files and version`\n",
    "\n",
    "#### 3. Create a Hugging Face Token\n",
    "\n",
    "1. **Go to Settings:** Navigate to your profile settings by clicking on your profile picture in the top right corner and selecting `Settings` from the dropdown menu.\n",
    "\n",
    "2. **Access Tokens:** In the settings menu, find and click on `Access tokens`.\n",
    "\n",
    "3. **Create New Token:**\n",
    "   - Click on `New token`.\n",
    "   - Set the token name (e.g., `hest`).\n",
    "   - Set the access level to `Write`.\n",
    "   - Click on `Create`.\n",
    "\n",
    "4. **Copy Token:** After the token is created, copy it to your clipboard. You will need this token for authentication."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4. Logging\n",
    "\n",
    "Install the python library `datasets` and run cell below. If successful, you should see:\n",
    "\n",
    "```\n",
    "Your token has been saved to /home/usr/.cache/huggingface/token\n",
    "Login successful\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "pip install datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from huggingface_hub import login\n",
    "\n",
    "login(token=\"YOUR HUGGING FACE TOKEN\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download full HEST-1k (~1TB)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import zipfile\n",
    "\n",
    "from huggingface_hub import snapshot_download\n",
    "from tqdm import tqdm\n",
    "\n",
    "def download_hest(patterns, local_dir):\n",
    "    repo_id = 'MahmoodLab/hest'\n",
    "    snapshot_download(repo_id=repo_id, allow_patterns=patterns, repo_type=\"dataset\", local_dir=local_dir)\n",
    "\n",
    "    seg_dir = os.path.join(local_dir, 'cellvit_seg')\n",
    "    if os.path.exists(seg_dir):\n",
    "        print('Unzipping cell vit segmentation...')\n",
    "        for filename in tqdm([s for s in os.listdir(seg_dir) if s.endswith('.zip')]):\n",
    "            path_zip = os.path.join(seg_dir, filename)\n",
    "                        \n",
    "            with zipfile.ZipFile(path_zip, 'r') as zip_ref:\n",
    "                zip_ref.extractall(seg_dir)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "local_dir='../hest_data' # hest will be dowloaded to this folder\n",
    "\n",
    "# Note that the full dataset is around 1TB of data\n",
    "download_hest('*', local_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download HEST-1k based on sample IDs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "local_dir='../hest_data' # hest will be dowloaded to this folder\n",
    "\n",
    "ids_to_query = ['INT1'] # list of ids to query\n",
    "\n",
    "list_patterns = [f\"*{id}[_.]**\" for id in ids_to_query]\n",
    "download_hest(list_patterns, local_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download HEST-1k based on metadata keys (e.g., organ, technology, oncotree code)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "local_dir='../hest_data' # hest will be dowloaded to this folder\n",
    "\n",
    "meta_df = pd.read_csv(\"hf://datasets/MahmoodLab/hest/HEST_v1_3_0.csv\")\n",
    "\n",
    "# Filter the dataframe by organ, oncotree code...\n",
    "meta_df = meta_df[meta_df['oncotree_code'] == 'IDC']\n",
    "meta_df = meta_df[meta_df['organ'] == 'Breast']\n",
    "\n",
    "ids_to_query = meta_df['id'].values\n",
    "\n",
    "list_patterns = [f\"*{id}[_.]**\" for id in ids_to_query]\n",
    "download_hest(list_patterns, local_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Inspect freshly downloaded samples\n",
    "\n",
    "For each sample, we provide:\n",
    "\n",
    "- **wsis/**: H&E-stained whole slide images in pyramidal Generic TIFF (or pyramidal Generic BigTIFF if >4.1GB)\n",
    "- **st/**: Spatial transcriptomics expressions in a scanpy .h5ad object\n",
    "- **metadata/**: Metadata\n",
    "- **spatial_plots/**: Overlay of the WSI with the st spots\n",
    "- **thumbnails/**: Downscaled version of the WSI\n",
    "- **tissue_seg/**: Tissue segmentation masks:\n",
    "    - `{id}.geojson`: Tissue segmentation mask\n",
    "    - `{id}_vis.jpg`: Visualization of the tissue mask on the downscaled WSI\n",
    "- **pixel_size_vis/**: Visualization of the pixel size\n",
    "- **patches/**: 256x256 H&E patches (0.5µm/px) extracted around ST spots in a .h5 object optimized for deep-learning. Each patch is matched to the corresponding ST profile (see **st/**) with a barcode.\n",
    "- **patches_vis/**: Visualization of the mask and patches on a downscaled WSI.\n",
    "- **transcripts/**: individual transcripts aligned to H&E for xenium samples; read with pandas.read_parquet; aligned coordinates in pixel are in columns `['he_x', 'he_y']`\n",
    "- **cellvit_seg/**: Cellvit nuclei segmentation\n",
    "- **xenium_seg**: xenium segmentation on DAPI and aligned to H&E\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from hest import iter_hest\n",
    "import pandas as pd\n",
    "\n",
    "# Ex: inspect all the Invasive Lobular Carcinoma samples (ILC)\n",
    "meta_df = pd.read_csv('../assets/HEST_v1_3_0.csv')\n",
    "\n",
    "id_list = meta_df[meta_df['oncotree_code'] == 'ILC']['id'].values\n",
    "\n",
    "print('load hest...')\n",
    "# Iterate through a subset of hest\n",
    "for st in iter_hest('../hest_data', id_list=id_list):\n",
    "    print(st)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}