4 Python Extraction Code

The following data extraction and cleaning code is from simulation.ipynb. This file used to be where I did exploratory analysis of the data and tried some models. Further Python code for the model and analysis was later ported to R so has not been included.

The Python notebook is also used to generate a rudimentary config file, which I later added additional data to (note to self: don’t run it or it will overwrite!). It seems that now Rmarkdown nicely integrates Python, but the Reticulate package was only released in August 2018. Oh well. I’ve put the Python code in here so technically, if I ran this notebook right this moment, it would do it all.

4.1 Pre-processing

(Unix) Launch with

cd src
jupyter notebook

Notes

  • read_binary_solution is able to read from Vida’s outputs and turn it into a well/FP mapping. However I don’t use it now because she doesn’t map all of the wells. Instead I manually enter it from Data for AU.xlsx.
  • Here we read in the Liquid Wells spreadsheet. My script manages to pick up at least 30 of the approx. 60 wells in here, but I manually copy/pasted the rest because it was too annoying.
  • The network structure and parameters are read from a config spreadsheet, config.xlsx. I generate as much as I can automatically and then fill any gaps manually.
  • The AIC for the linear model (the one I eventually used) is the lowest. This backs up the DIC figures I came up with for the Bayesian model later.
import pandas as pd
import numpy as np
import seaborn as sns
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
from matplotlib.colors import Normalize
from matplotlib.colorbar import ColorbarBase
from IPython.display import display, HTML
import itertools
import os
import pyjags
import warnings
base_year = '2000'    # numeric dates calculated from Jan-01
configpath = '../wairakei_data/config.xlsx'
def read_binary_solution(path='../wairakei_data/toy-network-v4.xlsm'):
    # read from Vida's toy model workbook
    xlfile = pd.ExcelFile(path)
    sheet = xlfile.parse('Full LP')
    sheet = sheet.loc[sheet.count(1)>50]  # arbitrary, anything works
    sheet = sheet.transpose()
    sheet.columns = ['used', 'combination']
    combinations = pd.DataFrame([x.split('-') for x in sheet.query('used==1')['combination']],
                               columns = ['well', 'fp'])
    combinations['well'] = 'wk' + combinations['well']
    combinations['fp'] = 'fp' + combinations['fp']
    return(combinations)
def myprint(df):
    display(HTML(df.to_html()))
    
def central(data, m=3.29):
    return data[abs(data - np.mean(data)) < m * np.std(data)]
def rename_wk(names):
    # fix 'WK' inconsistencies
    new = names.str.lower()
    new = new.str.replace("^[^\d]*", "wk")
    new = new.str.strip()
    return new
    
def datetime_to_numeric(my_datetime):
    # returns days since base_year-01-01.
    try:
        # datetime implem
        date_numeric = (my_datetime - datetime(int(base_year),1,1)) / timedelta(days=1)
    except:
        # numpy implem
        date_numeric = (my_datetime - np.datetime64(base_year)) / np.timedelta64(1, 'D')
    return date_numeric
# Check if Excel file is already in memory (loading is slow)
try:    xl
except: xl = pd.ExcelFile('../wairakei_data/Liquid wells (version 1).xlsx')
sheetlist = [x for x in xl.sheet_names if set(x) & set('FtT(L') == set()]
print("Sheets:", ', '.join(sheetlist))
# sheets to load data from

Sheets: WK26A, WK26B, WK27 curve, WK28, WK46, WK55 curve, WK59, WK66 curve, WK67 curve, WK68, WK70, WK71, WK72, WK74, WK76, WK81 curve, WK82, WK83, WK88, WK92, WK96, WK101, WK116, w124, WK207, WK215, WK222, WK229, wk242 , wk243, wk244 , wk245 , wk247, w253, w254, wk255, wk256, w258, w259, w260, w261, w262, wk263, w264, w265, w266, w267, w268, w269, WK270, WK271, WK272, WK216, WK65, WK118, 253

sheets = ['wk255', 'wk256']
dfs = []
for sheet in sheets:
    try:
        df = xl.parse(sheet)  # select well data
        df['well'] = sheet    # label data with well name
        dfs.append(df)
    except:
        print('Failed on sheet', sheet)
df = pd.concat(dfs)
df = df[['date', 'whp', 'mf', 'h', 'well']]           # only keep certain columns
df['well'] = rename_wk(df['well'])
df['mf'] = pd.to_numeric(df['mf'], errors='coerce')   # remove 'dummy' entries
df = df.dropna(subset=['date', 'whp', 'mf'])          # remove NA
df['date_numeric'] = datetime_to_numeric(df['date']) #  yrs since base_year
df = df.reset_index(drop=True)
wells = df['well'].unique()
print(df.head())
    date        whp          mf            h   well  date_numeric

0 2008-12-18 13.028442 507.776080 1030.000000 wk255 3274.0 1 2008-12-18 14.841479 283.165920 1030.000000 wk255 3274.0 2 2008-12-18 14.939021 208.716385 1030.000000 wk255 3274.0 3 2009-04-08 13.466667 498.071714 1054.230235 wk255 3385.0 4 2009-04-08 13.800000 457.879864 1040.519288 wk255 3385.0

regression_df = df.reset_index(drop=True)
wells = regression_df['well'].unique()
print(wells)
# import and process data

[‘wk255’ ‘wk256’]

try:    fpxl
except: fpxl = pd.ExcelFile('../wairakei_data/Data for AU.xlsx')
    
fpdf = pd.read_excel(fpxl, 'calculation', header=1, usecols="D:E, J:L, N:P")
fpdf = fpdf.rename(columns={"FP15": "well", "Unnamed: 1": "fp",
                            "hf": "hf_ip", "hg": "hg_ip", "hfg": "hfg_ip",
                            "hf.1": "hf_lp", "hg.1": "hg_lp", "hfg.1": "hfg_lp"})
 # make sure it has the necessary data
fpdf = fpdf[pd.to_numeric(fpdf['hf_ip'], errors='coerce').notnull()]
for col in ['well', 'fp']:
    fpdf[col] = fpdf[col].str.lower()
fpdf[fpdf.columns] = fpdf[fpdf.columns].apply(pd.to_numeric, errors='ignore')
print(fpdf.head())
well    fp       hf_ip        hg_ip       hfg_ip       hf_lp        hg_lp  \

0 wk255 fp15 677.625311 2757.984943 2080.359632 461.792989 2691.196937
1 wk256 fp15 677.625311 2757.984943 2080.359632 461.792989 2691.196937
2 wk251 fp15 677.625311 2757.984943 2080.359632 461.792989 2691.196937
3 wk250 fp15 677.625311 2757.984943 2080.359632 461.792989 2691.196937
4 wk252 fp15 677.625311 2757.984943 2080.359632 461.792989 2691.196937

    hfg_lp  

0 2229.403949
1 2229.403949
2 2229.403949
3 2229.403949
4 2229.403949

def write_config(configpath):
    # only use if it gets lost. Will refresh file
    well_fp_map = pd.DataFrame({'well': ['wk27', 'wk242', 'wk247', 'wk253', 'wk254',
    'wk255', 'wk256', 'wk258', 'wk259', 'wk267', 'wk268', 'wk269', 'wk270', 'wk271', 'wk272'],
                                'fp':   ['fp1',  'fp14',  'fp15', 'fp16',  'fp16',
                                'fp15',  'fp15',  'fp16',  'fp16',  'fp16',  'fp16',
                                'fp15',  'fp15',  'fp14',  'fp14']},
                               columns=['well', 'fp'])
    fp_gen_map = pd.DataFrame({'fp':     ['abandoned', 'poi dry', 'direct ip', 'fp1',
                               'fp14', 'fp15', 'fp16', 'fp2', 'fp4', 'fp5', 'fp9-10'],
                               'gen_ip': [ None,       'POI',      None,       'WRK',
                               'WRK',  'THI',  'POI',  'WRK', 'WRK', 'WRK', 'WRK'   ],
                               'gen_lp': [ None,       'POI',      None,       'WRK',
                               'WRK',  'THI',  'POI',  'WRK', 'WRK', 'WRK', 'WRK'   ],
                               'gen_w':  [ None,        None,      None,       'BIN',
                               None,   None,   None,  'BIN', 'BIN', 'BIN', 'BIN'   ]},
                              columns=['fp', 'gen_ip', 'gen_lp', 'gen_w'])
    gen_constants = pd.DataFrame({'gen':    ['WRK', 'THI', 'BIN', 'POI' ],
                                  'ip':     [ True,  True,  False, True ],
                                  'lp':     [ True,  True,  False, True ],
                                  'bin':    [ False, False, True,  False],
                                  'factor': [ 9.2,   8.22,  178.9, 7.76]},  # m3/MW
                                 columns=['gen', 'ip', 'lp', 'bin', 'factor'])
                                 
    # find details of the last known operating conditions
    last_idx = regression_df.groupby('well')['date_numeric'].idxmax()
    operating_conditions = regression_df.iloc[last_idx][['well', 'whp', 'h']]
    # set constants (could use median)
    fp_constants = fpdf.groupby('fp').mean().reset_index()
    if os.path.exists(configpath):
        os.remove(configpath)
    config_writer = pd.ExcelWriter('../wairakei_data/config.xlsx')
    print("Writing config data to", configpath)
    configdata = {'well_fp_map': well_fp_map,
                  'fp_gen_map': fp_gen_map,
                  'operating_conditions': operating_conditions,
                  'fp_constants': fp_constants,
                  'gen_constants': gen_constants}
    for sheetname, df in configdata.items():
        df.to_excel(config_writer, sheetname, index=False)
    # config_writer.save() # Don't overwrite!
    return pd.ExcelFile(configpath)
try:
    config = pd.ExcelFile(configpath)
except FileNotFoundError:
    print("Warning: are you sure you want to overwrite the config file?")
#     config = write_config(configpath)
## Exploratory Analysis
import itertools
cmap = plt.get_cmap('viridis')
fig, (ax1, ax2) = plt.subplots(1,2, figsize=[14,4])
fig.tight_layout() #spreads out the plots
# left plot (not that useful tbh)
df.plot('date', 'whp', style='x', ax=ax1)
ax1.set_xlabel('date')
ax1.set_ylabel('whp')
# right plot (different colours represent time)
marker = itertools.cycle(['o', ',', '+', 'x', '*', '.'])
for well in wells:
    plt.scatter('whp', 'mf', c='date_numeric',
    data=df.loc[df['well']==well], marker=next(marker), label=well)
ax2.set_xlabel('whp')
ax2.set_ylabel('mf')
plt.legend()
plt.show()
## Set up regression data and create prediction frame for plotting

date_pred = np.arange(df['date'].min(), df['date'].max(),
        np.timedelta64(365*2, 'D').astype(datetime))
date_numeric_pred = datetime_to_numeric(date_pred)
whp_pred = np.linspace(0, 16, 1000)
well_pred = wells
pred = pd.DataFrame(list(itertools.product(date_numeric_pred, whp_pred, well_pred)),
                    columns=['date_numeric', 'whp', 'well'])
print(pred.head(3))
## Perform regression and prediction

date_numeric whp well 0 3274.0 0.000000 wk255 1 3274.0 0.000000 wk256 2 3274.0 0.016016 wk255

from statsmodels.formula.api import ols
# Not conditioned on date
model1 = ols("mf ~ well * whp", data=df)
results1 = model1.fit()
pred['mf1'] = results1.predict(pred)
# Linear fit dependent on date
model2 = ols("mf ~ well * (whp + date_numeric)", data=df)
results2 = model2.fit()
pred['mf2'] = results2.predict(pred)
# Elliptic fit dependent on date
model3 = ols("np.power(mf,2) ~ well * (np.power(whp,2) + date_numeric)", data=df)
results3 = model3.fit()
pred['mf3^2'] = results3.predict(pred)
pred.loc[pred['mf3^2'] < 0, 'mf3^2'] = np.nan    # remove invalid results
pred['mf3'] = np.sqrt(pred['mf3^2'])
print(pred.head(3))
# ===============================================================
# Set up axes
# ===============================================================
# colors

date_numeric whp well mf1 mf2 mf3^2
0 3274.0 0.000000 wk255 798.291912 2116.142850 960405.251630
1 3274.0 0.000000 wk256 524.858769 1878.420017 910600.744186
2 3274.0 0.016016 wk255 797.774653 2114.209523 960404.240037

      mf3  

0 980.002679
1 954.254025
2 980.002163

cmap = plt.get_cmap('viridis')
indices = np.linspace(0, cmap.N, len(df))
my_colors = [cmap(int(i)) for i in indices]
# subplots
fig, (ax1, ax2, ax3, ax4) = plt.subplots(1, 4, figsize=[14,4],
        gridspec_kw={'width_ratios': [9,9,9,1]})
ax1.get_shared_y_axes().join(ax1, ax2, ax3)
ax1.set_ylim([0, 1000])
ax1.set_xlim(0, 16)
ax1.set_ylabel('Mass flow')
ax1.set_xlabel("Well head pressure")
ax1.set_title('$mf \sim whp$')
ax2.set_title('$mf \sim whp + date$')
ax3.set_title('$mf^2 \sim whp^2 + date$')
# create date colorbar
indices = np.linspace(0, cmap.N, len(date_pred))
my_colors = [cmap(int(i)) for i in indices]
norm = Normalize(np.min(df['date']).year, np.max(df['date']).year)
cb = ColorbarBase(ax4, cmap=cmap, norm=norm, orientation='vertical')
cb.set_label('Year')
linestyles = itertools.cycle(('-', '--', '-.', ':'))
marker = itertools.cycle(['o', ',', '+', 'x', '*', '.'])
# ===============================================================
# Plot data
# ===============================================================
# plot raw data points
for well in wells:
   mkr = next(marker)
   for ax in [ax1, ax2, ax3]:
       ax.scatter('whp', 'mf', c='date_numeric',
              data=df.loc[df['well']==well], marker=mkr, label=well)
 
# plot fitted curves
for well in wells:
    lty = next(linestyles)
    # model 1
    # 'data' argument filters the data to just the data from one well
    ax1.plot('whp', 'mf1', lty, data=pred[(pred['well']==well)])
   
    # models 2 & 3
    for i, date in enumerate(date_numeric_pred):
        # 'data' argument similar, for a specific prediction date in the loop
        ax2.plot('whp', 'mf2', lty,
        data=pred[(pred['well']==well) & (pred['date_numeric']==date)], c=my_colors[i])
        ax3.plot('whp', 'mf3', lty,
        data=pred[(pred['well']==well) & (pred['date_numeric']==date)], c=my_colors[i])
       
# show model selection criteria
for ax, result in zip([ax1, ax2, ax3], [results1, results2, results3]):
    ax.legend(['Adj $R^2$: %.2f' % result.rsquared_adj,
            'AIC: %.2f' % result.aic],
            handlelength=0, handletextpad=0, loc=1).legendHandles[0].set_visible(False)
             
plt.show()

4.2 Preview Data

Let’s have a look at what we get out of the pre-processing.

We begin with the regression data.

py$df %>% head %>% kable(caption="Well calibration data") %>%
  kable_styling %>% scroll_box(width="100%")
Table 4.1: Well calibration data
date whp mf h well date_numeric
2008-12-18 13.02844 507.7761 1030.000 wk255 3274
2008-12-18 14.84148 283.1659 1030.000 wk255 3274
2008-12-18 14.93902 208.7164 1030.000 wk255 3274
2009-04-08 13.46667 498.0717 1054.230 wk255 3385
2009-04-08 13.80000 457.8799 1040.519 wk255 3385
2009-04-08 14.45000 388.3241 1030.130 wk255 3385

Now, the flash plant operating conditions.

py$fpdf %>% as.data.frame %>% head %>% kable(caption="Flash plant operational parameters") %>%
  kable_styling %>% scroll_box(width="100%")
Table 4.2: Flash plant operational parameters
well fp hf_ip hg_ip hfg_ip hf_lp hg_lp hfg_lp
0 wk255 fp15 677.6253 2757.985 2080.36 461.793 2691.197 2229.404
1 wk256 fp15 677.6253 2757.985 2080.36 461.793 2691.197 2229.404
2 wk251 fp15 677.6253 2757.985 2080.36 461.793 2691.197 2229.404
3 wk250 fp15 677.6253 2757.985 2080.36 461.793 2691.197 2229.404
4 wk252 fp15 677.6253 2757.985 2080.36 461.793 2691.197 2229.404
5 wk238 fp15 677.6253 2757.985 2080.36 461.793 2691.197 2229.404

Here’s a (deterministic) example of what we expect to get out - mass flow predictions per well. But we will use JAGS to get stochastic samples rather than this regression.

Table 4.3: Mass flow prediction demo
date_numeric whp well mf1 mf2 mf3^2 mf3
3274 0.0000000 wk255 798.2919 2116.143 960405.3 980.0027
3274 0.0000000 wk256 524.8588 1878.420 910600.7 954.2540
3274 0.0160160 wk255 797.7747 2114.210 960404.2 980.0022
3274 0.0160160 wk256 524.6518 1876.548 910599.6 954.2534
3274 0.0320320 wk255 797.2574 2112.276 960401.2 980.0006
3274 0.0320320 wk256 524.4448 1874.676 910596.0 954.2515
3274 0.0480480 wk255 796.7401 2110.343 960396.1 979.9980
3274 0.0480480 wk256 524.2379 1872.804 910590.0 954.2484
3274 0.0640641 wk255 796.2229 2108.410 960389.1 979.9944
3274 0.0640641 wk256 524.0309 1870.932 910581.7 954.2440
3274 0.0800801 wk255 795.7056 2106.476 960380.0 979.9898
3274 0.0800801 wk256 523.8240 1869.060 910571.0 954.2384