## Resolving analytical histogram data into gaussian component populations

### A comparison of 3 different de-convolution programs.

#### Kingsley Burlinson

## Introduction

The baro-acoustic decrepitation technique produces histograms of counts versus temperature for each sample analysis. Each sample may contain multiple populations of fluid inclusions which all contribute to the overall histogram result. Although there have been suggestions made to mathematically resolve the histogram (de-convolute) into its component populations, this has not been easy to do. Recently, M. Gibbes and M. Clark from Lismore University used de-convolution in their study of the hydrothermal deposits in the Drake district, NSW. In that work, the decrepitation histograms were resolved into populations with a gaussian distribution, but they found that using skewed gaussian distributions for the populations gave better fits to the analytical data. They used the software program PLOT, for macintosh computers. The samples were de-convolved into many individual populations, as many as 11 for each sample. Although hydrothermal systems probably do contain very many different populations of fluids, there is some concern that the large number of components might be a mathematical artifact during the de-convolution. And given this complexity, perhaps multiple re-fitting of the same data using different software and operators might produce different results.This discussion investigates the consistency and reproducibility of de-convolution of decrepitation histograms performed using different software packages and different operators over a one year period. One specific sample was de-convolved numerous times, using 3 different software packages. One de-convolution was performed using PLOT on macintosh, 3 de-convolutions were performed using Scidavis on linux, and 7 de-convolutions were performed using Fityk on linux.

## Discussion

During their study of the Drake mineral field, NSW, Gibbes and Clark collected 33 samples. One of these, sample number 28, from the Guy bell pit is the object of this study. This sample was analysed by baro-acoustic decrepitation as analysis number H2133. The raw analytical data was first smoothed using a weighted rolling mean of 3 samples and it is this smoothed decrepigram that was used for all the mathematical de-convolutions. The unsmoothed data is shown below in red, and the smoothed data in green.## Software program PLOT

This software was used for one de-convolution by Gibbes and Clark, reported in their published work (still in press) and shown below. They resolved the data into 8 symmetrical gaussian populations. The parameters for each population are not available and so the peak centre (mode), peak height and peak width at half height were read off this graph. Some interpolation was necessary and the values are necessarily slightly imprecise. One of the populations in this de-convolution is very wide, extending from 200 C to over 600 C. It is not clear if there could actually be a fluid inclusion population with such a wide spread of decrepitation temperatures. It could be caused if there were many inclusions which have necked down, giving erratic and widespread filling densities of the inclusions. But there is suspicion that this population is an artifact of the mathematics rather than a real and distinct fluid inclusion population.## Software program SCIDAVIS

Three de-convolutions were carried out using scidavis software. This software does not include gaussian distributions by default, but they can be added as a user created function. Although there is an "auto-fit" function in this software, the starting values assumed fail to lead to convergence and so the manual fit procedure must be used with user entered starting values for each parameter. This often fails to converge and repeated attempts with different starting parameters may be required. It can be slow using this method, but it does work. In addition, when using the manual fit procedure, the output plot does not include a plot of each component population, only the final fitted curve. To provide a complete plot it is necessary to use a custom python script to read back in the fit parameters and add the population curves to complete the plot. This custom script does make it possible to easily save the individual curve parameters to an external file for additional interpretation.The deconvolutions were for 3 and 4 components using skewed gaussian populations and for 4 components using symmetrical gaussian populations. It was not possible to achieve reasonable deconvolutions using 5 components in this software as quite improbable populations were generated for the fifth component. Once again, potentially unrealistic broad populations occurred in each fit (green). The 4 component skewed gaussian fit is better than the 4 component symmetrical gaussian fit, in accordance with the conclusion by Gibbes and Clark that skewed gaussian populations provide a better de-convolution of the decrepitation data. The quality of fit is given by the sums of squares of residuals divided by the degrees of freedom (SSR/DoF) and lower values indicate a better fit to the input data. For the symmetrical gaussian populations, SSR/DoF was 40, while for the skewed gaussian populations SSR/DoF was 26. Visually, this improved fit is small, but noticeable, as seen in the next images of the 2 types of fit.

The following image of the skewed gaussian fit shows a slightly better match between the raw data (black) and the fitted sum curve (red). The improvement is most noticeable between 520 C and 620 C.

## Software program Fityk

Seven fits were done using fityk, with 5 and 6 peaks, all populations being skewed gaussian. This software has a convenient user interface to allow the selection of peak positions and sizes visually before performing the fit. It also has a simple scripting capability so that the parameters of the fitted peaks can be easily exported to a file. This program is widely used in the study of all types of spectra from numerous analytical methods and it has an astonishing selection of population shapes. Although the skewed gaussian population is not present by default it can be added easily.The fits for 5 populations all included a potentially unrealistic very broad population, as seen here in brown in fit C. This broad population is highly skewed. The SSR/DoF for this fit is 18, a very close fit, but is it real?

Attempts were made to achieve a fit which avoided a broad or highly skewed component. Only one of the 7 fit attempts satisfied this criteria. However fit D did not have a particularly low SSR/DoF value, which was 41.

The lowest SSR/DoF value was 14 for fit F. However, this includes a potentially unrealistic population (blue) with a very wide peak. Despite the better SSR/DoF, this is not considered to be the best fit because of this unrealistically broad component population. It was very difficult to reach a convergence which did not include an unacceptably broad population.

## Comparison of the results from the different software packages.

For each fit performed with Scidavis and Fityk, the parameters for the component populations were saved to an external file using some simple scripts. This data is normally used to measure subtle differences between samples in a full survey. But in this study, the results are used to ascertain the stability and reproducibility of fits to the same data set. The fit performed with PLOT software did not save these parameters and it has been necessary to try and read these values from the final plot which introduces some error.

Comparisons between results are normally done using the Mode temperature for each peak. This is because the "central temperature" used in the gaussian population formula does not occur at the maximum height of the peak for skewed distributions. (The mode is the temperature at the peak of the curve.)

The temperature and width of each component population on each of the 11 fits in this study are plotted in the following X-Y-SIZE plots, where the size of the plotted circle represents a linear function of the area of the population peak. The area is merely an estimate and calculated as Area = Constant * peak_height * peak_width_at_half_height.

Although all the fits have a disturbingly broad peak, fit 3 (fityk, 5 peaks) and fit 13 (PLOT software, 8 peaks) have exceptionally broad peaks which are of particular concern and are potentially unrealistic.

To compare variations in temperature between fits, the following plot uses a natural logarithmic representation of the peak area. Additional temperature grid lines highlight the differences between temperatures of the fitted populations.

There are considerable differences between the multiple fits on this single data set and clearly, de-convolution does not come close to providing a unique or readily reproducible result. A significant problem is due to the number of component populations. For fits using 3, 4, 5 and 6 populations, fityk and scidavis give the same pair of peaks at about 450 C and 460 C. But PLOT, using 8 peaks, has these 2 peaks shifted higher to 460 C and 470 C. Fits 6 and 7, using fityk, despite having almost identical and very good SSR/DoF values, have a marked difference with fit 6 having a peak at 470C while fit 7 has this peak at 500 C. Fit 10, by scidavis, has a peak at 525C which does not match with any other fit. This fit used symmetrical gaussian populations, which explains some of the difference, but fit 13 using PLOT, also used symmetrical populations and did not locate a peak near this temperature.

## Conclusions

De-convolution of decrepigram curves does provide a way to compare variations and similarities within a set of samples and is much more precise than mere visual comparison of the decrepigrams. However, de-convolution does not provide unique component population results and there can be significant differences caused particularly by the choice of how many peaks to include in the de-convolution. Visually, fits with as few as 3 components give a close fit to the data. But as many as 8 peaks might be included if the software is allowed to automatically choose the number of peaks to use.A major concern is that many of the fit components have broad or very skewed distribution shapes. Such populations are unlikely to be physically realistic.

The performance of de-convolution depends on the nature of the raw data, its noisiness and relative amplitudes. This particular study used a sample with very low data values between 100 and 350 C which probably exacerbated the fitting of this part of the curve.

The mathematically calculated component populations are only an aid in the interpretation of the decrepigrams and are somewhat dependent on operator choices. They should always be visually checked and critically reviewed to ensure that the component populations are realistic. The automatic fitting procedures in some software seem to have a tendancy to produce very large numbers of component poulations in order to achieve perfect fit. This may not be realistic, particularly if the input data is slightly noisy.

### Data Tabulation Decrepitation sample H2133

Fit # | Program |
Peaks |
Temp |
Temp |
Temp |
Temp |
Temp |
Temp |
Temp |
Temp |
SSR/DoF |

1 |
fityk |
5 |
172 |
448 |
459 |
483 |
585 |
17 |
|||

2 |
fityk |
5 |
171 |
452 |
459 |
493 |
583 |
20 |
|||

3 |
fityk |
5 |
451 |
459 |
544 |
585 |
585 |
34 |
|||

4 |
fityk |
5 |
171 |
447 |
459 |
460 |
585 |
18 |
|||

5 |
fityk |
6 |
171 |
261 |
327 |
446 |
459 |
586 |
41 |
||

6 |
fityk |
6 |
171 |
330 |
446 |
459 |
473 |
586 |
15 |
||

7 |
fityk |
6 |
172 |
330 |
447 |
459 |
499 |
587 |
14 |
||

9 |
scidavis |
3 |
448 |
458 |
576 |
54 |
|||||

10 |
scidavis |
4 (sym) |
452 |
458 |
526 |
582 |
40 |
||||

11 |
scidavis |
4 |
448 |
458 |
492 |
586 |
26 |
||||

13 |
PLOT |
8 (sym) |
160 |
250 |
335 |
410 |
460 |
470 |
505 |
585 |
?? |