1 Inbred Mouse Strains

We start off with over 1000 mouse samples measured at the TOMCAT beamline of the Swiss Light Source at the Paul Scherrer Institute in Villigen, Switzerland. Each sample consists of 14GB of image data as well as 98 genetic tags correlating each sample and phenotype to a specific pattern of inheritance corresponding to 10s of terabytes of data which analyzed normally would require a dozens of scripts, cluster management tools, and a lot of patience.

With IJSQL from 4Quant you can now do such analyses as easily as a SQL database query (even from Excel if you wish) and IJSQL handles loading the data, making sure it is evenly distributed, optimizing queries, and making a cluster or even entire cloud of computers work like one super-fast one using the latest generation Big Data technology.

2 Load the data in

The first step is creating the cluster, this can be done using public clouds like those available at Amazon AWS, Google Compute Engine, Databricks Cloud or even own your own cluster. Once the Spark Cluster has been created, you have the SparkContext called sc the data can be loaded using the Spark Image Layer.

The command readImage[Float] loads in the data as decimal values since our images show the mineralization density at every pixel.
The * indicates the all folders should be included which in this case means over 1000 samples or 14 TB of data!
The command can easily read mega-, giga-, even petabytes of data from
Amazon’s S3
Hadoop Filesystem (HDFS)
or any shared / network filesystem
The cache suffix keeps the files in memory so they can be read faster as many of our image processing tasks access the images a number of times.

val uctImages = 
    sc.readImage[Float]("s3n://bone-images/f2-study/*/ufilt.tif").cache

We can then move the data into our IJSQL database so instead of writing Scala code we can utilize easy SQL commands for further analysis.

uctImages.registerAsImageTable("ImageTable")

Although we execute the commands on only one machine, the data will be evenly loaded over all of the machines in the cluster (or cloud). We can show any of these images at any point by just typing

uctImages.first().show(1)

Mineralization Density

3 Image Processing

Once the table has been registered we can perform our analysis using the easy IJSQL interface (or use our Python and Java APIs to make your own analysis). The next steps for this bone analysis - extracting the porosity data from the images - analyzing the cells (small pores) inside

3.1 Image Enhancement

Since the measurements have some degree of noise from the detectors, we first clean up the images using a Gaussian Filter.

CREATE TABLE FilteredImages AS
SELECT boneId,GaussianFilter(image) FROM ImageTable

3.1.1 Other filters

Any ImageJ plugin can be easily used inside IJSQL and for a 3x3 Median filter

SELECT boneId,run(image,"Median...","radius=3") FROM ImageTable

3.1.2 3D Renderings

We also offer a number of 3D rendering options for when a single slice does not give enough detail. This is rendered using the cluster and just the final image is sent to your machine so even huge images can be rendered quickly.

uctImages.render3D(slice=0.2,lut="gray").first()

3D renderings

3.2 Segmentation

To segment the images into bone and air, we can either manually specify a cut-off or simply use an automated approach like Otsu, IsoData, or Intermodes.

CREATE TABLE ThresholdImages AS
SELECT boneId,ApplyThreshold(image,OTSU) FROM FilteredImages

As with the last steps, a slice can be immediately inspected for one or more images with

sql("SELECT image FROM ThresholdImages").first().show(1)

Calcified Tissue

3.3 Mask Creation and Porosity Extraction

From the segmented image, we can extract the cells by first creating a mask with all of the holes filled in.

CREATE TABLE MaskImages AS
SELECT boneId, FillHoles(image) FROM ThresholdImages

sql("SELECT image FROM ThresholdImages").first().show(1) ``` 

![Filled Holes](ext-figures/bone-filled.png)

CREATE TABLE CorticalImages AS
SELECT boneId, PeelMask(thr.Image,mask.Image) FROM ThreshImage thr 
  INNER JOIN MaskImages mask ON thr.boneId = mask.boneId

sql("SELECT image FROM CorticalImages").first().show(1)

CREATE TABLE PorosityImages AS
SELECT boneId,PeelMask(run(thr.image,"Invert"),mask.image) 
  FROM ThreshImage thr 
  INNER JOIN MaskImages mask ON thr.boneId = mask.boneId

sql("SELECT image FROM PorosityImages").first().show(1)

3.4 Labeling Objects

We can then identify the individual cells using component labeling.

CREATE TABLE LabelImages AS
SELECT boneId,ComponentLabel(image) FROM PorosityImages

3.4.1 Cells from Vessels

We can also utilize component labeling to help us distinguish cells from vessels

CREATE TABLE VesselImages AS
SELECT * FROM LabelImages WHERE obj.VOLUME>1000
CREATE TABLE CellImages AS
SELECT * FROM LabelImages WHERE obj.VOLUME<1000

multicolor3D(red=vesselImages.first,green=cellImages.first)

3.5 Shape Analysis (Volume, Position, Shape)

Now we can calculate the shape information for the cell volume to look at some of the statistics.

CREATE TABLE CellAnalysis AS
SELECT boneId,AnalyzeShape(CellImages) FROM CellImages GROUP BY boneId

Once this analysis is done, we can move back to SQL and use standard commands for analyzing and visualizing all of the shapes

shapeAnalysis.toPointDF().registerTempTable("BoneAnalysis")

4 Analyzing 1000s of samples

4.1 Overview of all of the animals

MGROUP	Animals	Female.Count	Male.Count	Source
Group 1 - B6 lit/lit female	14	14	0	PROGENITOR
Group 10 - B6xC3.B6F1 lit/lit male	5	0	5	PROGENITOR
GROUP 11 - B6xC3.B6F2 lit/lit	1960	1017	933	F2
Group 2 - B6 lit/lit male	15	0	15	PROGENITOR
Group 3 - C3.B6 lit/lit female	18	18	0	PROGENITOR
Group 4 - C3.B6 lit/lit male	16	0	16	PROGENITOR
Group 5 - B6 lit/+ female	15	15	0	PROGENITOR
Group 6 - B6 lit/+ male	12	0	12	PROGENITOR
Group 7 - C3.B6 lit/+ female	15	15	0	PROGENITOR
Group 8 - C3.B6 lit/+ male	15	0	15	PROGENITOR
Group 9 - B6xC3.B6F1 lit/lit female	11	11	0	PROGENITOR

We can then combine this with our shape information with a join command. In this case the genomic information comes from a text file, but this can easily come from an Excel file, SQL database, S3 store, or another Spark Analysis.

CREATE TABLE GenomicCellAnalysis
SELECT * FROM MouseHistory mh JOIN CellAnalysis ca 
  ON mh.boneId == ca.boneId

4.2 Preview of the raw data

The raw data can be read out as a table for investigating individual samples.

MGROUP	Gender	SAN	VOLUME	LACUNA_NUMBER	POS_X	POS_Y	POS_Z	GROUP	Strain	Growth
Group 1 - B6 lit/lit female	female	1	2.0e-07	1	0.3181591	0.0020761	0.0023529	B6 lit/lit female	B6	lit/lit
Group 1 - B6 lit/lit female	female	1	1.9e-06	2	0.3435069	0.0064215	0.0067113	B6 lit/lit female	B6	lit/lit
Group 1 - B6 lit/lit female	female	1	8.0e-07	3	0.4145227	0.0021684	0.0056980	B6 lit/lit female	B6	lit/lit
Group 1 - B6 lit/lit female	female	1	1.0e-06	4	0.4562597	0.0024831	0.0108601	B6 lit/lit female	B6	lit/lit
Group 1 - B6 lit/lit female	female	1	4.0e-07	5	0.6334585	0.0031225	0.0038936	B6 lit/lit female	B6	lit/lit
Group 1 - B6 lit/lit female	female	1	7.0e-07	6	0.6546799	0.0121322	0.0020201	B6 lit/lit female	B6	lit/lit

4.3 Viewing Individual Samples

The shape analysis from a single sample can easily be brought up and rendered in the browser.

SELECT lacuna_points FROM GenomicCellAnalysis WHERE strain == "B6" & gender == "female" LIMIT 1

A number of further analyses can be made in both 2D and 3D plots looking at everything from cell size and shape to density.

SELECT lacuna_points FROM CellAnalysis WHERE strain == "C3H" & gender == "female" LIMIT 1

4.4 Compare two datasets directly

{
  sql("SELECT lacuna_points FROM GenomicCellAnalysis WHERE strain == 'B6' & gender == 'female'") 
  ++
  sql("SELECT lacuna_points FROM GenomicCellAnalysis WHERE strain == 'C3H' & gender == 'female')
}.groupBy("strain").show

4.5 Run Standard Analyses over millions of cells

Instead of storing the results in tables for each sample, have each of the cells as a row in a new table in the database called AllLacunae (this time with 50+ million rows)

GenomicCellAnalysis.flattenDF().registerTempTable("AllLacunae")

We can now run SQL commands and get results instantly even though it would take minutes to hours on a standard MySQL instance.

SELECT AVG(VOLUME),SD(VOLUME) FROM GenomicCellAnalysis GROUP BY boneId

5 Acknowledgements

Analysis powered by IJSQL and Spark Image Layer from 4Quant, Visualizations, and Document Generation provided by:

To cite ggplot2 in publications, please use:

H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009.

A BibTeX entry for LaTeX users is

@Book{, author = {Hadley Wickham}, title = {ggplot2: elegant graphics for data analysis}, publisher = {Springer New York}, year = {2009}, isbn = {978-0-387-98140-6}, url = {http://had.co.nz/ggplot2/book}, }

To cite package ‘threejs’ in publications use:

B. W. Lewis (2015). threejs: 3D Graphics using Three.js and Htmlwidgets. R package version 0.2.1. http://bwlewis.github.io/rthreejs

A BibTeX entry for LaTeX users is

@Manual{, title = {threejs: 3D Graphics using Three.js and Htmlwidgets}, author = {B. W. Lewis}, year = {2015}, note = {R package version 0.2.1}, url = {http://bwlewis.github.io/rthreejs}, }

ATTENTION: This citation information has been auto-generated from the package DESCRIPTION file and may need manual editing, see ‘help(“citation”)’.

To cite plyr in publications use:

Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. URL http://www.jstatsoft.org/v40/i01/.

A BibTeX entry for LaTeX users is

@Article{, title = {The Split-Apply-Combine Strategy for Data Analysis}, author = {Hadley Wickham}, journal = {Journal of Statistical Software}, year = {2011}, volume = {40}, number = {1}, pages = {1–29}, url = {http://www.jstatsoft.org/v40/i01/}, }

To cite package ‘rmarkdown’ in publications use:

JJ Allaire, Joe Cheng, Yihui Xie, Jonathan McPherson, Winston Chang, Jeff Allen, Hadley Wickham and Rob Hyndman (2015). rmarkdown: Dynamic Documents for R. R package version 0.5.1. http://CRAN.R-project.org/package=rmarkdown

A BibTeX entry for LaTeX users is

@Manual{, title = {rmarkdown: Dynamic Documents for R}, author = {JJ Allaire and Joe Cheng and Yihui Xie and Jonathan McPherson and Winston Chang and Jeff Allen and Hadley Wickham and Rob Hyndman}, year = {2015}, note = {R package version 0.5.1}, url = {http://CRAN.R-project.org/package=rmarkdown}, }

To cite the ‘knitr’ package in publications use:

Yihui Xie (2015). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.10.

Yihui Xie (2013) Dynamic Documents with R and knitr. Chapman and Hall/CRC. ISBN 978-1482203530

Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible Research in R. In Victoria Stodden, Friedrich Leisch and Roger D. Peng, editors, Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595

Image Database

4Quant

April 28, 2015

1 Inbred Mouse Strains

2 Load the data in

3 Image Processing

3.1 Image Enhancement

3.1.1 Other filters

3.1.2 3D Renderings

3.2 Segmentation

3.3 Mask Creation and Porosity Extraction

3.4 Labeling Objects

3.4.1 Cells from Vessels

3.5 Shape Analysis (Volume, Position, Shape)

4 Analyzing 1000s of samples

4.1 Overview of all of the animals

4.2 Preview of the raw data

4.3 Viewing Individual Samples

4.4 Compare two datasets directly

4.5 Run Standard Analyses over millions of cells

5 Acknowledgements