update docs

khanlab · Dec 13, 2024 · 2b1bdda · 2b1bdda
1 parent cfd4229
commit 2b1bdda
Show file tree

Hide file tree

Showing 10 changed files with 131 additions and 275 deletions.
diff --git a/docs/GettingStarted/beta_amyloid_plaque_detection.rst b/docs/GettingStarted/beta_amyloid_plaque_detection.rst
@@ -152,3 +152,19 @@ this. In a local test (without coiled, but still using gcs), we may set :code:`l
 
 Here, the MAX_THRESHOLD variable specifies the threshold above which objects will be counted as plaque. This
 threshold is applied over corrected ome zarr image based on the original image and bias image provided.
+
+After running the above code, a numpy file will be written to :code:`f'{subject.COILED_CACHE_DIR_PATH}/final_lc.npy'`,
+which contains the results in a N * 5 matrix of type np.float64. First three columns are the Z, Y, X locations of
+the plaques, and the fourth and fifth column are sizes (in voxels) and a unique ID for the plaque. To retrieve:
+
+.. code-block:: Python
+
+    from cvpl_tools.fsspec import RDirFileSystem
+    cdir_fs = RDirFileSystem(subject.COILED_CACHE_DIR_PATH)
+    with cdir_fs.open('final_lc.npy', mode='rb') as fd:
+        lc = np.load(cdir_fs)
+    print(f'First 10 rows of lc:\n')
+    print(lc[:10])
+
+Alternatively, use :code:`cvpl_tools.fsspec.copyfile` function to obtain the .npy file locally.
+
diff --git a/docs/GettingStarted/nnunet.rst b/docs/GettingStarted/nnunet.rst
@@ -80,16 +80,16 @@ results in the following segmentation workflow we use:
 
 2. Then select one of N scans, say M1
 
-2. Downsample M1 and use a GUI to paint a binary mask, which contains 1 on regions of edges and 0 on plaques and
+3. Downsample M1 and use a GUI to paint a binary mask, which contains 1 on regions of edges and 0 on plaques and
    elsewhere
 
-3. Split the M1 volume and its binary mask annotation vertically to Z slices, and train an nnUNet model on these slices
+4. Split the M1 volume and its binary mask annotation vertically to Z slices, and train an nnUNet model on these slices
 
-4. Above produces a model that can predict negative masks on any mousebrain scans of the same format; for the rest N-1
+5. Above produces a model that can predict negative masks on any mousebrain scans of the same format; for the rest N-1
    mouse brains, they are down-sampled and we use this model to predict on them to obtain their corresponding negative
    masks
 
-5. These masks are used to remove edge areas of the image before we apply thresholding to find plaque objects.
+6. These masks are used to remove edge areas of the image before we apply thresholding to find plaque objects.
    Algorithmically, we compute M' where :code:`M'[z, y, x] = M[z, y, x] * (1 - NEG_MASK[z, y, x]`) for each
    voxel location (z, y, x); then, we apply threshold on M' and take connected component of value of 1 as individual
    plaque objects; their centroid locations and sizes (in number of voxels) are summarized in a numpy table and

diff --git a/docs/GettingStarted/ome_zarr.rst b/docs/GettingStarted/ome_zarr.rst
@@ -18,34 +18,24 @@ To view an ome-zarr file this way with :code:`cvpl_tools`, use the command
 ::
 
     import cvpl_tools.ome_zarr.napari.add as napari_add_ome_zarr
-    napari_add_ome_zarr.subarray(viewer, "/absolute/path/to/your/ome.zarr", kwargs=dict(name="displayed_name_in_ui"))
+    napari_add_ome_zarr.subarray_from_path(viewer, "/absolute/path/to/your/ome.zarr", kwargs=dict(name="displayed_name_in_ui"))
 
 - This will create a new layer named **displayed_name_in_ui** which displays your ome zarr array.
-  The underlying implementation uses Napari add_image's multiscale array display
+  :code:`subarray` uses Napari's :code:`add_image` function's :code:`multiscale` argument for this,
+  which loads different resolution based on viewer's current zoom factor
 - To open a **.zip** file, specify ome.zarr.zip in the path (and optionally set use_zip=True to
   open the file as zip regardless of its suffix)
 - To open both the **ome.zarr** file and any label files located at **ome.zarr/labels/label_name**
   as specified by the ome zarr standard, use cvpl_zarr.add_ome_zarr_group_from_path instead.
 - To specify additional arguments passed to Napari's add_image function, pass your arguments in
   the kwargs as a dictionary.
-- To open an ome zarr image on Google cloud storage, use the following code instead
-
-.. code-block:: Python
-
-    import cvpl_tools.ome_zarr.napari.add as napari_add_ome_zarr
-    import gcsfs
-    import zarr
-    gfs = gcsfs.GCSFileSystem(token=None)
-    store = gfs.get_mapper('gs://path_to_your.ome.zarr')
-    zarr_group = zarr.open(store, mode='r')
-    napari_add_ome_zarr.group(viewer, zarr_group, kwargs=dict(name="displayed_name_in_ui"))
-
 - An extra argument :code:`is_label` can be passed into the function via :code:`kwargs` dictionary.
   This is a boolean value that specifies whether to use :code:`viewer.add_labels`
   (if :code:`True`) or :code:`viewer.add_image` (if :code:`False`) function. This is useful for
   displaying instance segmentaion masks, where each segmented object has a distinct color.
 
-Similarly, you can open a zip, or an image with multiple labels this way.
+The zip loading feature does not work with remote location on gcs, because of the limitations of
+ZipStore.
 
 
 Reading and Writing ome zarr files
@@ -59,7 +49,7 @@ the following:
 
     - image.OME.ZARR/  # The ome zarr image, which is a directory in Windows/Linux
         - 0/  # The original image, in ZARR format
-            + 0/  # Slice of image at first axis X=0
+            + 0/  # Slice of image at first axis Z=0
             + 1/
             ...
             .zarray  # .zarray is meta attribute for the ZARR image
@@ -73,11 +63,11 @@ the following:
 Above + denotes collapsed folder and - denotes expanded folder. A few things to note here:
 
 - An image does not have to end with .OME.ZARR suffix
-- The image is multiscaled if the maximum downsample level is 0 instead of 4, in which case there
+- The image is not multiscaled if the maximum downsample level is 0 instead of 4, in which case there
   will only be one 0/ folder
-- You may confuse an ome zarr image with a single ZARR image. An ome zarr image
+- An ome zarr image is easily confused with a ZARR image. An ome zarr image
   is not a standard ZARR directory and contains no **.zarray** meta file. Loading an ome zarr
-  image as ZARR will crash, if you forget to specify **0/** subfolder as the path to load
+  image as ZARR will crash if you forget to specify **0/** subfolder as the path to load
 - When saved as a zip file instead of a directory, the directory structure is the same except that
   the root is zipped. Loading a zipped ome zarr, cvpl_tools uses :code:`ZipStore`'s features to
   directly reading individual chunks without having to unpack
@@ -105,10 +95,9 @@ Specifying slices in path
 :code:`cvpl_tools` allows specifying the channel, or x/y/z slices to use in the path string when
 reading or viewing an ome zarr file for convenience.
 
-The functions :code:`load_dask_array_from_path` in :code:`cvpl_tools.ome_zarr.io`, and
-:code:`load_zarr_group_from_path` as well as :code:`load_ome_zarr_array_from_path` from
-:code:`cvpl_tools.ome_zarr.napari.add` support specifying the slices in the following
-syntax, much similar to torch or numpy array slicing:
+The functions :code:`cvpl_tools.ome_zarr.io.load_dask_array_from_path`, and
+:code:`cvpl_tools.ome_zarr.napari.add.group_from_path/subarray_from_path`
+support specifying the slices in the following syntax, much similar to torch or numpy array slicing:
 
 .. code-block:: Python
 

diff --git a/docs/GettingStarted/result_caching.rst b/docs/GettingStarted/result_caching.rst
@@ -9,9 +9,9 @@ In many cases it's useful to cache some of the intermediate results instead of d
 all at once. Think of the following cases where you may have encountered when writing a long-running image processing
 workflow:
 
-1. The cell density for each region in the scan is computed but the number does not match up with what's expected,
-so you want to display a heatmap in a graphical viewer showing cell density. The final results you got is text
-output in the console, requiring redoing the computation to display.
+1. The cell density for each region in the scan is computed, but the number does not match up with what's expected,
+so you want to debug by displaying a heatmap in a graphical viewer showing cell density. However,
+the intermediate result is lost so you need to recompute the density in order to draw the heatmap.
 
 2. Some error occurs and you need to find out why a step in the computation causes the issue, but it's rather
 difficult to understand what went wrong without displaying some intermediate results to aid debugging.
@@ -21,7 +21,7 @@ issues, but requires saving all the results onto disk and chunked in a viewer-fr
 
 In all cases above, caching all the intermediate results help reduce headaches and risks of unknown errors coming
 from the difficulty of debugging in an image processing and distributed computing environment. The basic strategy
-we use to overcome these is to cache all the results inside a directory tree. Each step saves all its
+is to cache all the results inside a directory tree. Each step saves all its
 intermediate and final results onto a node in the tree. The node's children are directories saved by its
 sub-steps.
 
@@ -30,27 +30,24 @@ Here, the outputs of a processing step (function) may contain intermediate image
 
 We describe the CacheDirectory interface in details below.
 
-cache root directory
-********************
-Every cache directory tree starts with a CacheRootDirectory node at its root, which is the only node of that class in
-the tree. In order to create a cache directory tree you need a url to the root directory location, as follows:
+cache directory
+***************
+Every cache directory tree starts with a root directory. In order to create a cache directory tree you
+need a url to the root directory location:
 
 .. code-block:: Python
 
     import cvpl_tools.tools.fs as tlfs
     loc = f'path/to/root'  # a remote url will work as well
     query = tlfs.cdir_init(loc)
     # Now a directory is created at the given path, so you can start writing cache files to it
+    # ...
+    tlfs.cidr_commit(loc)
 
-This creates two directories 'path/to/root' on the first run,
-the naming of the subfolder indicates that it is :code:`dir` a directory and :code:`cache` a
-persistent cache instead of a temporary folder in that location.
+This creates a directory 'path/to/root' on the first run.
 The next time the program is run, it will not create new folders but return a query object with
-:code:`query.commit=True`
-
-cache directory
-***************
-A cache directory can be a child directory of a cache root directory or other cache directories.
+:code:`query.commit=True`. A cache directory can be a child directory of a cache root directory
+or other cache directories.
 
 .. code-block:: Python
 
@@ -63,6 +60,9 @@ A cache directory can be a child directory of a cache root directory or other ca
             print('subdirectory is already created')
         else:
             print('subdirectory created successfully')
+        # put your code here that writes to the subdir...
+        tlfs.cidr_commit(f'{loc}/subdir')
+    tlfs.cidr_commit(loc)
 
 Tips
 ****

diff --git a/docs/GettingStarted/segmentation_pipeline.rst b/docs/GettingStarted/segmentation_pipeline.rst
@@ -3,26 +3,24 @@
 Segmentation Pipeline
 #####################
 
-Motivation: Microscope, Cell Counting, Atlas map and Object Segmentation
-************************************************************************
+Motivation
+**********
 
 In our use case, lightsheet microscopy of mouse brain produces several hundreds of GBs of
-data, represented as a 3-dimensional array. This array is stored as an OME
-ZARR file in our case. Distributed processing of the data is necessary to make the analysis time
+data, represented as 3-dimensional arrays with an extra channel axis. This array is stored as an OME
+ZARR file. Distributed computation is necessary to make the analysis time
 trackable, and we choose to use Dask as the distributed computing library for our project.
 
-As part of our research, we need to use automated method to count different types of visible objects in
-the image. After counting cells, we use a map that maps from pixel location of
-the image to brain regions (this is the **atlas map**) to obtain the density of cells in each region of
+As part of our research, we need to use automated method to count objects in
+the image. Afterwards, we use a map that maps from pixel location of
+the image to brain region segmentation (this is the **atlas map**) to obtain the density of cells in each region of
 the mouse brain. We need to test several methods and find one that would give the most
 accurate counts, and for new incoming data volumes, we want to be able to quickly find a set of parameters
 that works on the new data.
 
-On the algorithm side, object counting consists of performing a sequence of steps to process an input image
-into densities over regions of the brain, and is relatively simple to understand and implement on a Numpy
-array with small dataset size. On larger datasets, we need to do long-running distributed computation
-that are hard to debug and requires tens of minutes or hours if we need to rerun the computation
-(often just to display the results again).
+On the algorithm side, object counting is easy on a Numpy array with small dataset size.
+On larger datasets, we need to long-running distributed computation
+that are hard to debug and requires hours to run the computation.
 
 SegProcess
 **********
@@ -43,10 +41,10 @@ Consider a function that counts the number of cells in a 3d-block of brightness
         cell_cnt = count_inst(inst)  # count each contour as a cell
         return cell_cnt
 
-This code may seem complete, but it has a few issues:
+This code is complete, but notice that:
 
-1. Lack of interpretability. Often when we first run this function some bug will show up, for example
-when the output cell count is unexpectedly large. Debugging this becomes a problem since we don't know
+1. Lack of interpretability. Often the first run is when some bug shows up.
+Debugging this becomes a problem without some way to look into intermediate results since we don't know
 if one of the three steps in the cell_count function did not work as expected, or if the algorithm does
 not work well on the input data for some reason. In either case, if we want to find the root cause of
 the problem we very often end up adding code and rerun the pipeline to display the output of each step
@@ -55,10 +53,11 @@ to see if they match with our expectations.
 2. Result is not cached. Ideally the pipeline is run once and we get the result, but more often than
 not the result may need to be used in different places (visualization, analysis etc.). Caching these
 results makes sure computation is done only once, which is necessary when we work with costly algorithms
-on hundreds of GBs of data.
+on hundreds of GBs of data (of course it's still best to first test on a slice of <1GB of data).
 
 The basic idea to address 1) is to put visualization as part of the cell_count function, and to address
-2) is to cache the result of each step into a file in a :code:`CacheDirectory`. It will provide:
+2) is to cache the result of each step into a file in a :code:`CacheDirectory`. In more details, we
+want a image processing pipeline that provide:
 
 1. dask-support. Inputs are expected to be either numpy array, dask array, or
 :code:`cvpl.im.ndblock.NDBlock` objects. In particular, dask.Array and NDBlock are suitable for
@@ -111,14 +110,16 @@ which takes arbitrary inputs and one parameter: :code:`context_args`, which will
   provided, then the image will be cached via dask's persist() and its loaded copy will be returned
 
 - storage_option (dict, optional): If provided, specifies the compression method to use for image chunks
+
   - preferred_chunksize (tuple, optional): Re-chunk before save; this rechunking will be undone in load
   - multiscale (int, optional): Specifies the number of downsampling levels on OME ZARR
   - compressor (numcodecs.abc.Codec, optional): Compressor used to compress the chunks
 
 - viewer_args (dict, optional): If provided, an image will be displayed as a new layer in Napari viewer
+
   - viewer (napari.Viewer, optional): Only display if a viewer is provided
   - is_label (bool, optional): defaults to False; if True, use viewer's add_labels() instead of
-  add_image() to display the array
+    add_image() to display the array
 
 - layer_args (dict, optional): If provided, used along with viewer_args to specify add_image() kwargs
 
@@ -139,7 +140,7 @@ which takes arbitrary inputs and one parameter: :code:`context_args`, which will
 
 - The :code:`viewer_args` parameter specifies the napari viewer to display the intermediate results. If not provided
   (:code:`viewer_args=None`), then no computation will be done to visualize the image. Within the forward() method, you
-  should use :code:`viewer.add_labels()`, :code:`lc_interpretable_napari()` or :code:`temp_directory.cache_im()`
+  should use :code:`viewer.add_labels()` or :code:`tlfs.cache_im()`
   while passing in :code:`viewer_args` argument to display your results:
 
   .. code-block:: Python
@@ -174,9 +175,9 @@ Running the Pipeline
 ********************
 
 See `Setting Up the Script <GettingStarted/setting_up_the_script>`_ to understand boilerplate code used below,
-which is required to understand the following example.
+required to understand the following example.
 
-Now we have defined a :code:`process` function, the next step is to write our script that uses the pipeline
+With a :code:`process` function defined, the next step is to write our script that uses the pipeline
 to segment an input dataset. Note we need a dask cluster and a temporary directory setup before running the
 :code:`forward()` method.
 
@@ -202,8 +203,8 @@ to segment an input dataset. Note we need a dask cluster and a temporary directo
             client.close()
             viewer.show(block=True)
 
-If instead :code:`viewer_args=None` is passed the :code:`example_seg_process()` function will display
-nothing, process the image and cache it.
+If instead :code:`viewer_args=None` is passed the :code:`example_seg_process()` function will process
+the image and cache it, but displays nothing.
 
 - A process function has signature :code:`process(arg1, ..., argn, context_args)`, where
   arg1 to n are arbitrary arguments and :code:`context_args` is a dictionary