Machine Learning Majorite barometer
modelposted on 2021-02-11, 10:38 authored by Andrew ThomsonAndrew Thomson, Michael Walter, Anirudh Prabhu, Simon Kohn
A machine learning barometer (using Random Forest Regression) to calculate equilibration pressure for majoritic garnets
Updated 04/02/21 (21/01/21) (10/12/20):
The barometer code
The barometer is provided as python scripts (.py) and Jupiter Notebooks (.ipynb) files. These are completely equivalent to one another and which is used depends on the users preference. Separate instructions are provided for each.
data files included in this repository are:
• "Majorite_database_04022021.xlsm" (Excel sheet of literature majoritic garnet compositions - inclusions (up to date as of 04/02/2021) and experiments (up to date as of 03/07/2020). This data includes all compositions that are close to majoritic, but some are borderline. Filtering as described in paper accompanying this barometer is performed in the python script prior to any data analysis or fitting)
• "lit_maj_nat_030720.txt" (python script input file of experimental literature majoritic garnet compositions - taken from dataset above)
• "di_incs_040221.txt" (python script input file of literature compilation of majoritic garnet inclusions observed in natural diamonds - taken from the dataset above)
The barometer as Jupiter Notebooks - including integrated Caret validation (added 21/01/2021)
For those more unfamiliar with Python, running the barometer as a Notebook is somewhat more intuitive than running the scripts below. It also has the benefit of including the RFR validation in using Caret within a single integrated notebook. For success the Jupiter Notebook requires a suitable Python3 environment (with pandas, numpy, matplotlib, sklearn, rpy2 and pickle packages + dependencies). We recommend installing the latest anaconda python distribution (found here https://docs.anaconda.com/anaconda/install/) and creating a custom environment containing the required packages to run the Jupiter Notebook (as both python3 and R must be active in the environment). Instructions on this procedure can be found here (https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html), or to assist we have provided a copy of the environment used to produce the scripts to assist in this process (barom-spec-file.txt).
An identical conda environment (called myenv) can be created, and used by:
1) copying the barometer-spec-file.txt to a suitable location (i.e. your home directory)
2) running the command
conda create --name myenv --file barom-spec-file.txt
3) entering this environment
conda activate myenv
4) Running an instance of Jupyter Notebook by typing
Two Notebooks are provided:
• calculate_pressures_notebook.ipynb (equivalent to calculate_pressures.py described below)
• rfr_majbar_10122020_notebook.ipynb (equivalent to rfr_majbar_10122020.py described below) but also including integrated Caret validation performed using the rpy2 package in a single notebook environment
The barometer as scripts (10/12/2020)
The scripts below need to be run in a suitable Python3 environment (with pandas, numpy, matplotlib, sklearn and pickle packages + dependencies). For inexperienced users we recommend installing the latest anaconda python distribution (found here https://docs.anaconda.com/anaconda/install/) and running in Spyder (a GUI scripting environment provided with Anaconda.
Note - if running python 3.7 (or earlier) then you will need to install pickle5 package to use the provided barometer files and comment / uncomment the appropriate lines in the “calculate_pressures.py” (lines 16/17) and “rfr_majbar_10122020.py” (lines 26/27) scripts.
The user may additionally need to download and install the packages required if they are not provided with the anaconda distribution (pandas, numpy, matplotlib, scikit-learn and pickle). This will be obvious as, when run, the script will return an error similar to “No module name XXXX”.
Packages can either be installed using the anaconda package manager or in the command line / terminal via commands such as:
conda install -c conda-forge pickle5
Appropriate command line installation commands can be obtained via searching the anaconda cloud at anaconda.org for each required package.
A python script (.py) is provided to calculate pressures for any majoritic garnet using barometer calibrated in Thomson et al. (2021)
• calculate_pressures.py script takes an input file of any majoritic garnet compositions (example input file is provided “example_test_data.txt" - which are inclusion compositions reported by Zedgenizov et al., 2014, Chemical Geology, 363, pp 114-124).
• employs published RFR model and scaler - both provided as pickle files (pickle_model_20201210.pkl, scaler_20201210.pkl)
User can simply edit the input file name in the provided .py script - and then runs the script in a suitable python3 environment (requires pandas, numpy, sklearn and pickle packages). Script initially filters data for majoritic compositions (according to criteria used for barometer calibration) and predicts pressures for these compositions. Writes out pressures and 2 x std_dev in pressure estimates alongside input data into "out_pressures_test.txt".
*** if this script produces any errors or warnings it is likely because the serialised pickle files provided are not compatible with the python build being used (this is a common issue with serialised ML models). Please first try installing the pickle5 package and commenting/uncommenting lines 16/17. If this is unsuccessful then run the full barometer calibration script below (using the same input files as in Thomson et al. (2021) which are provided) to produce pickle files compatible with the python build on the local machine (action 5 of script below). Subsequently edit the filenames called in the “calculate_pressures.py” script (lines 22 & 27) to match the new barometer calibration files and re-run the calculate pressure script. The output (predicted pressures) for the test dataset provided (and using the published calibration) given in the output file should be similar to the following results:
P (GPa) error (GPa)
Full RFR barometer calibration script -
rfr_majbar_10122020.py The RFR barometer calibration script used and described in Thomson et al. (2021). This script performs the following actions.1) filters input data - outputs this filtered data as a .txt file (which is the input expected for RFR validation script using R package Caret)
2) fits 1000 RFR models each using a randomly selected training dataset (70% of the input data)
3) performs leave-one-out validation
4) plots figure 5 from Thomson et al. (2021)
5) fits one single RFR barometer using all input data (saves this and the scaler as .pkl files with a datestamp for use in the "calculate_pressures.py script)
6) calculates the pressure for all literature inclusion compositions over 100 iterations with randomly distributed compositional uncertainties added - provides the mean pressure and 2 std deviations, written alongside input inclusion compositons, as a .txt output file "diout.txt"
7) plots the global distribution of majoritic inclusion pressures
The RFR barometer can be easily updated to include (or exclude) additional experimental compositions by modification of the literature data input files provided
RFR validation using Caret in R (script titled “RFR_validation_03072020.R”)
Additional validation tests of RFR barometer completed using the Caret package in R. Requires the filtered experimental dataset file "data_filteredforvalidation.txt" (which is generated by the rfr_majbar_10122020.py script if required for a new dataset) performs bootstrap, K-fold and leave-one out validation. outputs validation stats for 5, 7 and 9 input variables (elements)
Please email Andrew Thomson (firstname.lastname@example.org) if you have any questions or queries.
Calcium Perovskite: the forgotten mantle phase
Natural Environment Research CouncilFind out more...
I confirm that I am not uploading any: personal data as defined by data protection legislation, including information that may identify a living individual; information provided in confidence; or information that would contravene a third-party agreement
I have considered whether the data to be published may be licensed commercially before deciding to freely release it to the public. Further information and advice may be sought from UCL Business https://www.uclb.com/about/our-people/