Skip to content

h5jan is a Java API for writing, reading and slicing Data Frames and nD arrays. It is file format neutral. It supports HDF5, Arrow and older formats. It is useful because it allows microservices with access to data to write data frames which can be reread in a performant way, for instance in Python. It requires Java 11 LTS or newer.

License

EPL-1.0, Unknown licenses found

Licenses found

EPL-1.0
LICENSE.md
Unknown
COPYING
Notifications You must be signed in to change notification settings

h5jan/h5jan-core

Repository files navigation

Java CI

h5jan

Maven

	<dependency>
		<groupId>io.github.h5jan</groupId>
		<artifactId>h5jan-core</artifactId>
		<version>0.8.0</version>
	</dependency>

Gradle

	compile group: 'io.github.h5jan', name: 'h5jan-core', version: '0.8.0'	

What is it

h5jan is a Java(TM) API for reading and writing Eclipse January* datasets to HDF5. It writes the HDF5 to a self describing format which is easily readable from Python as DataFrames using h5py. (The format is: NeXus)

It allows the reading and writing of:

  1. Datasets to/from HDF5 files (numpy nD array and pandas DataFrame)
  2. Lazy datasets to/from HDF5 files and working with slices. (Larger data than will fit in memory).

Why are these things useful? Well it means that binary data structures can be built in Java and saved as HDF5. That then means that you can use Java in the middleware or middle microservice, here it shines with tools like Spring Boot and Grails available. Then if a parallel execution of python process such as machine learning run are required, h5jan allows you to write scalable h5 files which can be loaded as dataframes or numpy arrays in python using h5py and pandas.

The reading and writing API are based around slicable lazy data. This allows code to be written which interacts with very large datasets without loading all the data into memory. You can write on the fly using ILazyWritableDataset and read slices as required for data analysis. This means that instead of holding large datasets in memory on vendor cloud and paying the cost, relatively cheap solutions around data slicing can be created - where that would work with your design of course!

* Eclipse January has a page on the Eclipse Foundation web site.

Examples

DataFrame Examples

Write DataFrame

	// Write in memory
	// We create a place to put our data
	IDataset someData = Random.rand(256, 3);
	someData.setName("fred");
		
	// Make a test frame
	DataFrame frame = new DataFrame(someData, 1, Arrays.asList("a", "b", "c"), Dataset.FLOAT32);
		
	// Save to HDF5
	frame.to_hdf("test-scratch/write_example/inmem_data_frame.h5", "/entry1/myData");

Reading DataFrame in Python

	# Make a reader
	reader = DataFrameReader()
	
	# Read the frame
	print(reader.read('test-scratch/write_example/lazy_data_frame-2d.h5'))
	

The python class DataFrameReader is not yet available on pypi it is here: DataFrameReader

Write Data Frame as Slices - save memory!

	// Write as slices, not all frame in memory at one time.
	// We create a place to put our data
	DataFrame frame = new DataFrame("data", Dataset.FLOAT32, new int[] { 256 });
		
	// Save to HDF5, columns can be large, these are not it's a test
	try (Appender app = frame.open_hdf("test-scratch/write_example/lazy_data_frame.h5", "/entry1/myData")) {
			
		// Add the columns incrementally without them all being in memory
		for (int i = 0; i < 10; i++) {
			app.append("slice_"+i, Random.rand(256));
		}
	}

Dataset Examples

Read

	// Read a slice
	try(NxsFile nfile = NxsFile.open("i05-4859.h5")) {
	
		// Data *not* read in
		ILazyDataset lz = nfile.getDataset("/entry1/instrument/analyser/data");
		
		// Read in a slice and squeeze it into an image. *Data now in memory*
		IDataset    mem = lz.getSlice(new Slice(), new Slice(100, 600), new Slice(200, 700));
		mem.squeeze();
	}

Write nD array

	// Write nD data to block not holding it in memory
	try(NxsFile nfile = NxsFile.open("my_example.h5")) {
	
		// We create a place to put our data
		ILazyWriteableDataset data = new LazyWriteableDataset("data", Dataset.FLOAT64, new int[] { 10, 1024, 1024 }, null, null, null);
	
		// We have all the data in memory, it might be large at this point!
		nfile.createData("/entry1/acme/experiment1/", data, true);
			
		// Make an image to write
		IDataset random = Random.rand(1, 1024, 1024);
			
		// Write one image, others may follow
		data.setSlice(new IMonitor.Stub(), random, new SliceND(random.getShape(), new Slice(0,1), new Slice(0,1024), new Slice(0,1024)));
	}

Write stream of images

	// Write stream of images in above example.
	for (int i = 0; i < 10; i++) {
		// Make an image to write
		IDataset random = Random.rand(1, 1024, 1024);
				
		// Write one image, others may follow. We use the int args for adding the randoms here.
		data.setSlice(new IMonitor.Stub(), random, SliceND.createSlice(data, new int[] {i,0,0}, new int[] {i+1,1024,1024}, new int[] {1,1,1}));
				
		// Optionally flush
		nfile.flush();
	}

Example if the images come from a directory structure, as many detectors and microscopes do. Each directory contains tiles of a larger image then each directory above that is a represents the stack. We write the HDF5 stack directly to a single file in a lazy way such that the whole thing is never in memory.

  // Make a writing frame, the tiled image is this size.
  // Each tile is 96,128 so as we are making 3x3 stitching, we need 288,384
  DataFrame frame = new DataFrame("scope_image", Dataset.FLOAT32, new int[] { 288,384 });
		
  // Make an object to read other formats, in this case TIFF
  DataFrameReader reader = new DataFrameReader();

  // Save to HDF5, columns can be large, these are not it's a test
  try (Appender app = frame.open_hdf("test-scratch/write_example/lazy_microscope_image.h5", "/entry1/myData")) {
  
  app.setCompression(NexusFile.COMPRESSION_LZW_L1); // Otherwise it will be too large. (Version 0.6 onwards.)
	
  // Directory structure is "microscope/0" image0, "microscope/1" image1 and in each
  // directory are nine images which need to be stitched.
  File[] dirs = JPaths.getTestResource("microscope").toFile().listFiles();
			
  for (int i = 0; i < dirs.length; i++) {
				
	// Directory of tiles
	File dir = dirs[i];

	// Read tiles, assuming their file name order is also their tile order.
	DataFrame tiles = reader.read(dir, Configuration.GREYSCALE, new IMonitor.Stub());
				
	// Stitch to make image based on a 3x3 matrix of tiles.
	Dataset image = tiles.stitch(new int[] {3,3});
	assertArrayEquals(new int[] {288,384}, frame.getColumnShape());
	image = DatasetUtils.cast(image, Dataset.FLOAT32);
				
	// Add the image - note they are not all in memory
	// so this process should scale reasonably well.
	app.append("image_"+i, image);
    }
  }

Advanced

Filters

Applying different operations to images using code from DAWN Science

	// Example data 
	Dataset data = Random.rand(256, 256);

	// Image filters
	Dataset sobel = Image.sobelFilter(data);
	Dataset fano = Image.fanoFilter(data, 5, 5);
	
	// Downsample (nD)
	Downsample ds = new Downsample(DownsampleMode.MAXIMUM, 4, 4);
	Dataset smaller = ds.value(data).get(0);
	
	// Integration of user selected regions
	RectangularROI rroi = new RectangularROI(new double[]{100,100}, new double[]{200,200});
	Dataset[] xAndY = ROIProfile.box(data, rroi); // Integrate x and y
	
	SectorROI sroi = new SectorROI(100,100, 10, 50, 0, Math.PI);
	Dataset[] radAndAzi = ROIProfile.sector(data, sroi); // Radial and azimuthal integration.

And we can write these transforms or analysis data as follows:

	Function<Dataset,Dataset> imageTransform = ...  // See above
	
	// Example of writing the data in a DataFrame
	// Make a writing frame
	DataFrame frame = new DataFrame("data", Dataset.FLOAT32, new int[] { 256, 256 });
		
	// Save to HDF5, columns can be large, these are not it's a test
	try (Appender app = frame.open_hdf("test-scratch/write_example/"+fileName, "/entry1/myData")) {
			
		// Add the columns incrementally without them all being in memory
		for (int i = 0; i < 10; i++) {
			Dataset data = Random.rand(256, 256); // Insert your real data here, it will be larger...
				
			// Apply the function and write that
			Dataset toWrite = imageTransform.apply(data);
			app.append(datasetName+i, toWrite); // Our data is transformed as we write it
				
			// Keeping raw data is a good idea
			// It might not be in this frame but it will be useful.
			app.record("raw", data);  // We append the raw data in a non-data frame record. 
		}
	}

Arrow

Apache Arrow is a data format popularly used in data science. The test ArrowTest shows it performance compared to HDF5. Because homogeneous data frames are fast to write in HDF5, it can be made very fast to read data which is useful for data analysis.

	// Example of writing the data in a DataFrame
	// Make a writing frame
	DataFrame frame = ... // A data frame we want to write to arrow.
		
	ArrowIO	arrowIO= new ArrowIO();
	
	try (FileOutputStream output = new FileOutputStream(new File("/tmp/mydata.arw")){
		// The channel could also be a Hadoop File System writable channel.
		// We use a File here just out of example.
		WritableByteChannel channel = output.getChannel();
		
		arrowIO.write(frame, output);
		output.flush();
	}	
	
	// Now lets read it!
	try(FileInputStream input = new FileInputStream(new File("/tmp/mydata.arw"))) {
		DataFrame readBackIn = arrowIO.read(input);
	}

	// Reading from python: You have to use pyarrow directly this format
	// is not guaranteed compatible with Pandas DataFrame. However reading the frame
	// is easy because nD arrays are stored as 1D and the shape stored in the metadata.
	// So a numpy array can be made using pyarrow and reshaped to the shape in the metadata "shape" field.
	// Unfortunately arrow supports string metadate so the shape has to be parsed from a string.

Additional Examples

The other examples are run as part of the unit tests. Eclipse January Examples

When the image stack nxs file is opened in DAWN you get: stack

Repackaging

This project is only possible by repackaging some code released on github using an EPL license. Unfortunately you cannot use this code in its current locations over several bundles in DAWN Science which are not repackaged for reuse in a normal gradle/maven manner (they are OSGi bundles).

h5jan uses those libraries and adds a data frame library on top.

Taking Part

Drop a message as to why you want to be part of the project or submit a merge request.

We support the Eclipse Science working group.
Eclipse Science NOTE: This project has not been proposed as an eclipse science project yet, however it is made up of code from the dawnsci project and depends on January.

Setting PATH and LD_LIBRARY_PATH

This library will load its native parts automatically. If you hit problems you can set PATH on windows or LD_LIBRARY_PATH on linux to your part of the libs folder. The code attempts to use load automatically and usually works.

How to use h5jan from a Gradle Build

Build

This guide assumes that you are familiar with gradle and have a build.gradle file or other named gradle where you would like to add a dependency into your build.

Steps

Still working publishing to maven central. If it does not work then binaries are on github release.

  1. Make sure
mavenCentral()

is in your repositories{} block

  1. Add dependency Gradle
compile "io.github.h5jan:h5jan-core:0.4.3" 

Maven

<dependency>
	<groupId>io.github.h5jan</groupId>
	<artifactId>h5jan-core</artifactId>
	<version>0.7.0</version>
</dependency>

When you rebuild now january will be available to your gradle project.

Setup

To run the Java build you will need Java8 or 9. Developers on the project currently use 9 to develop with an 8 minimum source code version.

Code Format Guidelines

  1. Follow the code format guidelines
  2. Do not reformat where you are not the primary author/committer without the author's agreement (including auto formatting). a. Unless the primary author/committer has left the organisation b. Even if you think they are not following the guidelines c. Do not mix up formatting changes which functional changes, there are hard to review reliably. The @author tag specifies the primary author/committer with whom you agree if auto formatting.

Build

Run gradlew (gradlew.bat on Windows) from the root directory.

    $ git clone [email protected]:h5jan/h5jan-core.git
    $ cd h5jan-core
    $ ./gradlew
    :compileJava UP-TO-DATE
    :processResources UP-TO-DATE
    :classes UP-TO-DATE
    :jar UP-TO-DATE

    BUILD SUCCESSFUL

If the test coverage is not at least 60%, this project will fail to build.

Proxy: If you get something like:

    $ ./gradlew
    Downloading https://services.gradle.org/distributions/gradle-2.10-bin.zip

    Exception in thread "main" java.net.UnknownHostException: services.gradle.org

You need to properly configure your machine with the HAL proxy.

You can create a file at %USERPROFILE%.gradle\gradle.properties. Inside this file add:

    systemProp.http.proxyHost=<your proxy host>
    systemProp.http.proxyPort=80
    systemProp.https.proxyHost=<your proxy host>
    systemProp.https.proxyPort=80

Idea setup:

    $ ./gradlew idea
  • Launch Idea
  • File -> Open...
  • Navigate to data-connector directory and select data-connector.ipr
  • Build -> Rebuild project

Eclipse setup:

    $ ./gradlew eclipse
  • Launch Eclipse
  • Close Welcome dialog
  • Import as Gradle project or as java project

Project layout

  • build/ - Build artifacts
  • build/reports - Source analysis reports
  • src- Source code
  • doc- Documentation
  • src/main/ - Shipped code
  • src/test/ - Test Non Shipped code
  • lib - Temporarily holds some of the libraries used for communication with DSIS. Hopefully will be removed in the future.

Running tests

Right click on project in Eclipse and choose run as junit tests.

Run all tests using:

    $ ./gradlew test

Get more test output:

    $ ./gradlew test -i

One single test using:

    $ ./gradlew test --tests "xx.xx.XXTest"

debug tests on localhost:5005 using:

    $ gradle test --debug-jvm

Generate Test Coverage Report:

    $ ./gradlew test testCoverageReport

Look in build\reports\coverage\ for the generated file.

Changing Log Levels in a Running Docker Image

You can change the log levels for an installed Docker image, by using the _JAVA_OPTIONS environment variable:

  • Edit the Configuration for the Marathon service

  • Click on "Environment"

  • Click on "ADD ENVIRONMENT VARIABLE"

  • Type in "_JAVA_OPTIONS" for the key

  • Type in "-DHAL_LOG_LEVEL=DEBUG" for the value

  • Click on "REVIEW & RUN"

  • It may be necessary to set more than one level for instance:

      -DROOT_LOG_LEVEL=DEBUG -DSTDOUT_LOG_LEVEL=DEBUG -DHAL_LOG_LEVEL=DEBUG -DFILE_LOG_LEVEL=DEBUG
    

License

See LICENSE

About

h5jan is a Java API for writing, reading and slicing Data Frames and nD arrays. It is file format neutral. It supports HDF5, Arrow and older formats. It is useful because it allows microservices with access to data to write data frames which can be reread in a performant way, for instance in Python. It requires Java 11 LTS or newer.

Resources

License

EPL-1.0, Unknown licenses found

Licenses found

EPL-1.0
LICENSE.md
Unknown
COPYING

Stars

Watchers

Forks

Packages

No packages published

Languages