Notes

It is very very important to export OMP_NUM_THREADS before running the program, else this will not be defined and program unusually seg faults


Date: 6 Feb
Writing d2_sample_2 with the assumption that in each run we get the same number of threads. This is needed so that the size of partitions remains the same throughout and I can use the cumulative distance array to optimize the procedure because distance value can change for a point only if it is smaller than the current value so we need not iterate over each center to find the minimum


Date: 1 March
Which random number generator do we use has an impact on the performance on the cost of the solution
MT19937 is better than using standard rand() function! Need to find its c variant and use it

Date: 2 March
Distribution of workload between threads is wrongly done in d2_sample.
This has been working fine all this time because number of threads was a factor of NUM_POINTS and so no points were left out.
If it is not the case then some points are left out towards the last for which no thread does the calculation.
I have used the same thing everywhere so may be I need to correct it so that if num_threads is not a factor of NUM_POINTS, things still work correctly

Date 19 March:
nvprof files:
outFile_optimized_pinnedMem_constMem:
	Pinned distance memory for efficient transfer
	ConstMem for storing centers chosen so far, on GPUs
	Using cost computed in last iteration to compute cost in current iteration
	Removing print statements from the loop that had all this seeding code

outFile_optimized_pinnedMem:
	Pinned distance memory for efficient transfer
	Using cost computed in last iteration to compute cost in current iteration
	Removing print statements from the loop that had all this seeding code

outFile_optimized_unpinnedMem:
	Using cost computed in last iteration to compute cost in current iteration
	Removing print statements from the loop that had all this seeding code

outFile_optimized_print:
	Using cost computed in last iteration to compute cost in current iteration
	Has print statements from the loop that had all this seeding code

outFile_unoptimized:
	NOT using cost computed in last iteration to compute cost in current iteration
	Has print statements from the loop that had all this seeding code

----------------------------------------------------------------


April 9
Organizing function arguments etc causes some gap between 2 kernel calls, can this be avoided by using 2 or more streams?, Reducing number of arguments may also help as this will ensure slightly faster setup of function call

In meanHeuristic replacement code we have
A large number of kernel calls each doing very little amount of computation if proably not a favourable way of doing things on GPUs
It might make sense to have mean heuristic on CPU, I can try to optimize that code and its performance will be better on CPUs anyway

Implement Lloyds algo on GPU
Implement Vornoi partitionin on GPU, it might not make sense as it is already very fast on CPU, and on GPU we might need several kernels to be able to do that compuation....

------------------------------

Using Nvidia Visual Profiler
First generate an output file using command line profiler nvprof
/usr/loca/cuda-8.0/bin/nvprof -o <outFilename> -f <program name> <program args>

-o: To create an output file
-f: To overwrite if file with same name already exists
Then open nvvp 
	/usr/local/cuda-8.0/bin/nvvp
Import the outFile created using nvprof

The details can be be also seen in terminal by following command
	/usr/local/cuda-8.0/bin/nvprof -i <outFileName>

-i: To import an outputfile already created


Using cudaMemcheck

/usr/local/cuda-8.0/bin/cuda-memcheck [--option value] <prog name> <prog args>

/usr/local/cuda-8.0/bin/cuda-memcheck --binary-patching no <prog name> <prog args>

binary-patching: If set to yes, then it patches the binary of the program on the go on.