{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step-by-step instructions to interact HEST-1k \n",
    "\n",
    "This tutorial will guide you to:\n",
    "\n",
    "- Read HEST data\n",
    "- Visualized the spots over a downscaled version of the WSI\n",
    "- Saving HESTData into Pyramidal Tif and anndata\n",
    "\n",
    "\n",
    "This tutorial assumes that the user has already downloaded HEST-1k (in its entirety or partially). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Read HEST"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from hest import iter_hest\n",
    "\n",
    "# Iterate through a subset of hest\n",
    "for st in iter_hest('../hest_data', id_list=['TENX105']):\n",
    "\n",
    "    # ST (adata):\n",
    "    adata = st.adata\n",
    "    print('\\n* Scanpy adata:')\n",
    "    print(adata)\n",
    "\n",
    "    # WSI:\n",
    "    wsi = st.wsi\n",
    "    print('\\n* WSI:')\n",
    "    print(wsi)\n",
    "    \n",
    "    # Shapes:\n",
    "    shapes = st.shapes\n",
    "    print('\\n* Shapes:')\n",
    "    print(shapes)\n",
    "    \n",
    "    # Tissue segmentation\n",
    "    tissue_contours = st.tissue_contours\n",
    "    print('\\n* Tissue contours:')\n",
    "    print(tissue_contours)\n",
    "    \n",
    "    # Conversion to SpatialData\n",
    "    sdata = st.to_spatial_data()\n",
    "    print('\\n* SpatialData conversion:')\n",
    "    print(sdata)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`st.adata` is a spatial scanpy object containing the following:\n",
    "#### Observations (st.adata.obs)\n",
    "- `in_tissue`: Indicator if the observation is within the tissue (`in_tissue` comes from the initial Visium/Xenium run and might not be accurate, prefer the segmentation obtained by st.segment_tissue() instead).\n",
    "- `pxl_col_in_fullres`: Pixel column position of the patch/spot centroid in the full resolution image.\n",
    "- `pxl_row_in_fullres`: Pixel row position of the patch/spot centroid in the full resolution image.\n",
    "- `array_col`: Patch/spot column position in the array.\n",
    "- `array_row`: Patch/spot row position in the array.\n",
    "- `n_counts`: Number of counts for each observation.\n",
    "- `n_genes_by_counts`: Number of genes detected by counts in each observation.\n",
    "- `log1p_n_genes_by_counts`: Log-transformed number of genes detected by counts.\n",
    "- `total_counts`: Total counts per observation.\n",
    "- `log1p_total_counts`: Log-transformed total counts.\n",
    "- `pct_counts_in_top_50_genes`: Percentage of counts in the top 50 genes.\n",
    "- `pct_counts_in_top_100_genes`: Percentage of counts in the top 100 genes.\n",
    "- `pct_counts_in_top_200_genes`: Percentage of counts in the top 200 genes.\n",
    "- `pct_counts_in_top_500_genes`: Percentage of counts in the top 500 genes.\n",
    "- `total_counts_mito`: Total mitochondrial counts per observation. (note that this field might not be accurate)\n",
    "- `log1p_total_counts_mito`: Log-transformed total mitochondrial counts. (note that this field might not be accurate)\n",
    "- `pct_counts_mito`: Percentage of counts that are mitochondrial. (note that this field might not be accurate)\n",
    "\n",
    "#### Variables (st.adata.var)\n",
    "- `n_cells_by_counts`: Number of cells detected by counts for each variable.\n",
    "- `mean_counts`: Mean counts per variable.\n",
    "- `log1p_mean_counts`: Log-transformed mean counts.\n",
    "- `pct_dropout_by_counts`: Percentage of dropout events by counts.\n",
    "- `total_counts`: Total counts per variable.\n",
    "- `log1p_total_counts`: Log-transformed total counts.\n",
    "- `mito`: Indicator if the gene is mitochondrial. (note that this field might not be accurate)\n",
    "\n",
    "#### Unstructured (st.adata.uns)\n",
    "- `spatial`: Contains a downscaled version of the full resolution image in `st.adata.uns['spatial']['ST']['images']['downscaled_fullres']`\n",
    "\n",
    "#### Observation-wise Multidimensional (st.adata.obsm)\n",
    "- `spatial`: Pixel coordinates of spots/patches centroids on the full resolution image. (first column is x axis, second column is y axis)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Visualizing the spots over a downscaled version of the WSI"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# visualize the spots over a downscaled version of the full resolution image\n",
    "save_dir = '.'\n",
    "st.save_spatial_plot(save_dir)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Saving to pyramidal tiff and h5\n",
    "Save `HESTData` object to `.tiff` + expression `.h5ad` and a metadata file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Warning saving a large image to pyramidal tiff (>1GB) can be slow on a hard drive.\n",
    "st.save(save_dir, pyramidal=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tissue segmentation\n",
    "\n",
    "We integrated 2 tissue segmentation methods:\n",
    "\n",
    "- Image processing-based using Otsu thresholding \n",
    "- Deep learning-based using a fine-tuned DeepLabV3 ResNet50\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "save_dir = '.'\n",
    "\n",
    "name = 'tissue_seg_otsu'\n",
    "st.segment_tissue(method='otsu') \n",
    "st.save_tissue_seg_pkl(save_dir, name)\n",
    "\n",
    "name = 'tissue_seg_deep'\n",
    "st.segment_tissue(method='deep') \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Changing the Patching/pooling size\n",
    "\n",
    "#### Patching\n",
    "You can change the size of patches around the spots with `dump_patches`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# directory where the patch .h5 will be saved\n",
    "patch_save_dir = './processed'\n",
    "new_patch_size = 224\n",
    "\n",
    "\n",
    "st.dump_patches(\n",
    "    patch_save_dir,\n",
    "    name='demo',\n",
    "    target_patch_size=new_patch_size,\n",
    "    target_pixel_size=0.5 # pixel size of the patches in um/px after rescaling\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Changing the Pooling size\n",
    "\n",
    "You can change the pooling size of Xenium/Visium HD samples stored in `HESTData.adata` with the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from hest.readers import pool_transcripts_xenium\n",
    "from hest import iter_hest\n",
    "\n",
    "new_spot_size = 200\n",
    "new_patch_size = 224\n",
    "patch_save_dir = './processed'\n",
    "\n",
    "# Iterate through a subset of hest\n",
    "for st in iter_hest('../hest_data', id_list=['TENX105'], load_transcripts=True,):\n",
    "    print(st.transcript_df)\n",
    "\n",
    "    st.adata = pool_transcripts_xenium(\n",
    "                    st.transcript_df, # Feel free to convert st.transcript_df to a Dask DataFrame if you are working with limited RAM.\n",
    "                    st.pixel_size, \n",
    "                    key_x='he_x',\n",
    "                    key_y='he_y',\n",
    "                    spot_size_um=new_spot_size\n",
    "                )\n",
    "\n",
    "\n",
    "    # We change the spots, so we need to re-extract the patches\n",
    "    st.dump_patches(\n",
    "        patch_save_dir,\n",
    "        name='demo',\n",
    "        target_patch_size=new_patch_size, # target patch size in 224\n",
    "        target_pixel_size=0.5 # pixel size of the patches in um/px after rescaling\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Batch effect visualization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from hest.batch_effect import filter_hest_stromal_housekeeping, get_silhouette_score, plot_umap\n",
    "\n",
    "meta_df = pd.read_csv('../assets/HEST_v1_2_0.csv')\n",
    "\n",
    "\n",
    "meta_df = meta_df[\n",
    "    (meta_df['st_technology'] == 'Visium') & \n",
    "    (meta_df['disease_state'] == 'Cancer') & \n",
    "    (meta_df['oncotree_code'] == 'IDC') & \n",
    "    (meta_df['species'] == 'Homo sapiens')\n",
    "]\n",
    "\n",
    "# We filter spots, such that:\n",
    "# - we only keep the most stable housekeeping genes (based on https://housekeeping.unicamp.br/?download)\n",
    "# - we only keep the spots under the stroma (based on CellViT segmentation)\n",
    "adata_list = filter_hest_stromal_housekeeping(meta_df, hest_dir='../hest_data')\n",
    "\n",
    "labels = meta_df['id'].values\n",
    "score = get_silhouette_score(adata_list, labels=labels)\n",
    "# (-1: strong overlap between clusters, can be indicative of a small batch effect, 1: poor overlap between clusters, can be indicative of a strong batch effect)\n",
    "print(f'Silhouette score is {score}')\n",
    "\n",
    "\n",
    "plot_umap(adata_list, labels, 'batch_whole_tissue.png', umap_kwargs={'min_dist': 0.6})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Loading transcripts (for Xenium only)\n",
    "The transcript dataframe contains the following columns:\n",
    "- `cell_id` (default Xenium Explorer flag): id of cell containing this transcript\n",
    "- `he_x`, `he_y`: x, y coordinate or transcript on the HE WSI in pixels (aligned with a non-rigid transformation)\n",
    "- `feature_name`: name of transcript\n",
    "- `qv` (default Xenium Explorer flag): Phred-scaled quality value (Q-Score) estimating the probability of incorrect call"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from hest import iter_hest\n",
    "\n",
    "# Iterate through a subset of hest\n",
    "for st in iter_hest('../hest_data', id_list=['TENX105'], load_transcripts=True):\n",
    "    print(st.transcript_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Warning:** The segmentation contained in 'xenium_seg/' (accessed with st.get_shapes('xenium_cell', 'he') or st.get_shapes('xenium_nucleus', 'he')) is different than the CellViT segmentation. It was obtained by the Xenium instrument directly by segmenting the DAPI stained image, the DAPI segmented cells were then aligned to H&E with a non rigid transform."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Linking transcripts with cells\n",
    "Each transcript is associated with a cell id (derived by the Xenium analyzer directly). See column `cell_id` in the transcript dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "transcripts = st.transcript_df\n",
    "shapes = st.get_shapes('xenium_cell', 'he').shapes\n",
    "\n",
    "cell_id = 1\n",
    "\n",
    "# Print the transcripts associated with a particular cell\n",
    "print(transcripts[transcripts['cell_id'] == cell_id])\n",
    "print(shapes[shapes.index == cell_id])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "HEST",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}