We start off with over 1000 mouse samples measured at the TOMCAT beamline of the Swiss Light Source at the Paul Scherrer Institute in Villigen, Switzerland. Each sample consists of 14GB of image data as well as 98 genetic tags correlating each sample and phenotype to a specific pattern of inheritance corresponding to 10s of terabytes of data which analyzed normally would require a dozens of scripts, cluster management tools, and a lot of patience.
With IJSQL from 4Quant you can now do such analyses as easily as a SQL database query (even from Excel if you wish) and IJSQL handles loading the data, making sure it is evenly distributed, optimizing queries, and making a cluster or even entire cloud of computers work like one super-fast one using the latest generation Big Data technology.
The first step is creating the cluster, this can be done using public clouds like those available at Amazon AWS, Google Compute Engine, Databricks Cloud or even own your own cluster. Once the Spark Cluster has been created, you have the SparkContext called sc
the data can be loaded using the Spark Image Layer.
readImage[Float]
loads in the data as decimal values since our images show the mineralization density at every pixel.*
indicates the all folders should be included which in this case means over 1000 samples or 14 TB of data!cache
suffix keeps the files in memory so they can be read faster as many of our image processing tasks access the images a number of times.val uctImages =
sc.readImage[Float]("s3n://bone-images/f2-study/*/ufilt.tif").cache
We can then move the data into our IJSQL database so instead of writing Scala code we can utilize easy SQL commands for further analysis.
uctImages.registerAsImageTable("ImageTable")
Although we execute the commands on only one machine, the data will be evenly loaded over all of the machines in the cluster (or cloud). We can show any of these images at any point by just typing
uctImages.first().show(1)
Once the table has been registered we can perform our analysis using the easy IJSQL interface (or use our Python and Java APIs to make your own analysis). The next steps for this bone analysis - extracting the porosity data from the images - analyzing the cells (small pores) inside
Since the measurements have some degree of noise from the detectors, we first clean up the images using a Gaussian Filter.
CREATE TABLE FilteredImages AS
SELECT boneId,GaussianFilter(image) FROM ImageTable
Any ImageJ plugin can be easily used inside IJSQL and for a 3x3 Median filter
SELECT boneId,run(image,"Median...","radius=3") FROM ImageTable
We also offer a number of 3D rendering options for when a single slice does not give enough detail. This is rendered using the cluster and just the final image is sent to your machine so even huge images can be rendered quickly.
uctImages.render3D(slice=0.2,lut="gray").first()
To segment the images into bone and air, we can either manually specify a cut-off or simply use an automated approach like Otsu, IsoData, or Intermodes.
CREATE TABLE ThresholdImages AS
SELECT boneId,ApplyThreshold(image,OTSU) FROM FilteredImages
As with the last steps, a slice can be immediately inspected for one or more images with
sql("SELECT image FROM ThresholdImages").first().show(1)
From the segmented image, we can extract the cells by first creating a mask with all of the holes filled in.
CREATE TABLE MaskImages AS
SELECT boneId, FillHoles(image) FROM ThresholdImages
sql("SELECT image FROM ThresholdImages").first().show(1) ```
![Filled Holes](ext-figures/bone-filled.png)
CREATE TABLE CorticalImages AS
SELECT boneId, PeelMask(thr.Image,mask.Image) FROM ThreshImage thr
INNER JOIN MaskImages mask ON thr.boneId = mask.boneId
sql("SELECT image FROM CorticalImages").first().show(1)
CREATE TABLE PorosityImages AS
SELECT boneId,PeelMask(run(thr.image,"Invert"),mask.image)
FROM ThreshImage thr
INNER JOIN MaskImages mask ON thr.boneId = mask.boneId
sql("SELECT image FROM PorosityImages").first().show(1)
We can then identify the individual cells using component labeling.
CREATE TABLE LabelImages AS
SELECT boneId,ComponentLabel(image) FROM PorosityImages
We can also utilize component labeling to help us distinguish cells from vessels
CREATE TABLE VesselImages AS
SELECT * FROM LabelImages WHERE obj.VOLUME>1000
CREATE TABLE CellImages AS
SELECT * FROM LabelImages WHERE obj.VOLUME<1000
multicolor3D(red=vesselImages.first,green=cellImages.first)
Now we can calculate the shape information for the cell volume to look at some of the statistics.
CREATE TABLE CellAnalysis AS
SELECT boneId,AnalyzeShape(CellImages) FROM CellImages GROUP BY boneId
Once this analysis is done, we can move back to SQL and use standard commands for analyzing and visualizing all of the shapes
shapeAnalysis.toPointDF().registerTempTable("BoneAnalysis")
MGROUP | Animals | Female.Count | Male.Count | Source |
---|---|---|---|---|
Group 1 - B6 lit/lit female | 14 | 14 | 0 | PROGENITOR |
Group 10 - B6xC3.B6F1 lit/lit male | 5 | 0 | 5 | PROGENITOR |
GROUP 11 - B6xC3.B6F2 lit/lit | 1960 | 1017 | 933 | F2 |
Group 2 - B6 lit/lit male | 15 | 0 | 15 | PROGENITOR |
Group 3 - C3.B6 lit/lit female | 18 | 18 | 0 | PROGENITOR |
Group 4 - C3.B6 lit/lit male | 16 | 0 | 16 | PROGENITOR |
Group 5 - B6 lit/+ female | 15 | 15 | 0 | PROGENITOR |
Group 6 - B6 lit/+ male | 12 | 0 | 12 | PROGENITOR |
Group 7 - C3.B6 lit/+ female | 15 | 15 | 0 | PROGENITOR |
Group 8 - C3.B6 lit/+ male | 15 | 0 | 15 | PROGENITOR |
Group 9 - B6xC3.B6F1 lit/lit female | 11 | 11 | 0 | PROGENITOR |
We can then combine this with our shape information with a join command. In this case the genomic information comes from a text file, but this can easily come from an Excel file, SQL database, S3 store, or another Spark Analysis.
CREATE TABLE GenomicCellAnalysis
SELECT * FROM MouseHistory mh JOIN CellAnalysis ca
ON mh.boneId == ca.boneId
The raw data can be read out as a table for investigating individual samples.
MGROUP | Gender | SAN | VOLUME | LACUNA_NUMBER | POS_X | POS_Y | POS_Z | GROUP | Strain | Growth |
---|---|---|---|---|---|---|---|---|---|---|
Group 1 - B6 lit/lit female | female | 1 | 2.0e-07 | 1 | 0.3181591 | 0.0020761 | 0.0023529 | B6 lit/lit female | B6 | lit/lit |
Group 1 - B6 lit/lit female | female | 1 | 1.9e-06 | 2 | 0.3435069 | 0.0064215 | 0.0067113 | B6 lit/lit female | B6 | lit/lit |
Group 1 - B6 lit/lit female | female | 1 | 8.0e-07 | 3 | 0.4145227 | 0.0021684 | 0.0056980 | B6 lit/lit female | B6 | lit/lit |
Group 1 - B6 lit/lit female | female | 1 | 1.0e-06 | 4 | 0.4562597 | 0.0024831 | 0.0108601 | B6 lit/lit female | B6 | lit/lit |
Group 1 - B6 lit/lit female | female | 1 | 4.0e-07 | 5 | 0.6334585 | 0.0031225 | 0.0038936 | B6 lit/lit female | B6 | lit/lit |
Group 1 - B6 lit/lit female | female | 1 | 7.0e-07 | 6 | 0.6546799 | 0.0121322 | 0.0020201 | B6 lit/lit female | B6 | lit/lit |
The shape analysis from a single sample can easily be brought up and rendered in the browser.
SELECT lacuna_points FROM GenomicCellAnalysis WHERE strain == "B6" & gender == "female" LIMIT 1
A number of further analyses can be made in both 2D and 3D plots looking at everything from cell size and shape to density.
SELECT lacuna_points FROM CellAnalysis WHERE strain == "C3H" & gender == "female" LIMIT 1
{
sql("SELECT lacuna_points FROM GenomicCellAnalysis WHERE strain == 'B6' & gender == 'female'")
++
sql("SELECT lacuna_points FROM GenomicCellAnalysis WHERE strain == 'C3H' & gender == 'female')
}.groupBy("strain").show
Instead of storing the results in tables for each sample, have each of the cells as a row in a new table in the database called AllLacunae (this time with 50+ million rows)
GenomicCellAnalysis.flattenDF().registerTempTable("AllLacunae")
We can now run SQL commands and get results instantly even though it would take minutes to hours on a standard MySQL instance.
SELECT AVG(VOLUME),SD(VOLUME) FROM GenomicCellAnalysis GROUP BY boneId
Analysis powered by IJSQL and Spark Image Layer from 4Quant, Visualizations, and Document Generation provided by:
To cite ggplot2 in publications, please use:
H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009.
A BibTeX entry for LaTeX users is
@Book{, author = {Hadley Wickham}, title = {ggplot2: elegant graphics for data analysis}, publisher = {Springer New York}, year = {2009}, isbn = {978-0-387-98140-6}, url = {http://had.co.nz/ggplot2/book}, }
To cite package ‘threejs’ in publications use:
B. W. Lewis (2015). threejs: 3D Graphics using Three.js and Htmlwidgets. R package version 0.2.1. http://bwlewis.github.io/rthreejs
A BibTeX entry for LaTeX users is
@Manual{, title = {threejs: 3D Graphics using Three.js and Htmlwidgets}, author = {B. W. Lewis}, year = {2015}, note = {R package version 0.2.1}, url = {http://bwlewis.github.io/rthreejs}, }
ATTENTION: This citation information has been auto-generated from the package DESCRIPTION file and may need manual editing, see ‘help(“citation”)’.
To cite plyr in publications use:
Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. URL http://www.jstatsoft.org/v40/i01/.
A BibTeX entry for LaTeX users is
@Article{, title = {The Split-Apply-Combine Strategy for Data Analysis}, author = {Hadley Wickham}, journal = {Journal of Statistical Software}, year = {2011}, volume = {40}, number = {1}, pages = {1–29}, url = {http://www.jstatsoft.org/v40/i01/}, }
To cite package ‘rmarkdown’ in publications use:
JJ Allaire, Joe Cheng, Yihui Xie, Jonathan McPherson, Winston Chang, Jeff Allen, Hadley Wickham and Rob Hyndman (2015). rmarkdown: Dynamic Documents for R. R package version 0.5.1. http://CRAN.R-project.org/package=rmarkdown
A BibTeX entry for LaTeX users is
@Manual{, title = {rmarkdown: Dynamic Documents for R}, author = {JJ Allaire and Joe Cheng and Yihui Xie and Jonathan McPherson and Winston Chang and Jeff Allen and Hadley Wickham and Rob Hyndman}, year = {2015}, note = {R package version 0.5.1}, url = {http://CRAN.R-project.org/package=rmarkdown}, }
To cite the ‘knitr’ package in publications use:
Yihui Xie (2015). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.10.
Yihui Xie (2013) Dynamic Documents with R and knitr. Chapman and Hall/CRC. ISBN 978-1482203530
Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible Research in R. In Victoria Stodden, Friedrich Leisch and Roger D. Peng, editors, Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595