Introduction to Python for Chemical Engineers

Jacob Albrecht

Principal Scientist

Product Development, Bristol-Myers Squibb

October 2, 2019

The challenges ChEs face

  • Appling fundamental knowledge requires non-linear models and calculus
  • Performing routine data manipulation that yields consistent results
  • Retrieving the work of others and sharing the results we produce

The technology gap

  • With technology, the quantity of data is becoming larger, and is being stored online

  • Manual processing doesn't scale well

  • Work is more collaborative, and results must be transparent and reproducible

Software fills the gap, many disparate tools exist

  • Spreadsheets and other third party tools can't extend to all situations

  • Programming languages fill the gap to solve our unique problems, with their own challenges

  • This webinar sprung from an AIChE Engage discussion forum on computational tools for beginning engineers

Python may be a valuable tool in your toolbox

  • Very versatile, good for most needs

  • Links into some fields of research with rapidly advancing technology

  • Easy language to start learning - has been described as fun

  • Great option for self-development

Investing in Python will pay off

  • Can be applied to any ChE problem, technical or not

  • Free or low-cost learning resources are available

  • A great way to add agility to your style of work

In [1]:
print('Hello ChEs')
Hello ChEs

About the Python Programming Language

Fun facts

  • Released 1991 by Guido van Rossum

Guido-portrait-2014-curvves

  • Named after the Monty Python comedy group

Why Python, now?

  • Large user base and mature capabilities with Python 3

  • Being taught to the next generation of engineers

  • Great language for self-paced learning

  • Already widely shared by ChEs through repositories on GitHub.com:

Out[3]:

DANGER ZONE

"Are you saying Python is 'better' than another (my) approach?"

The best programming language depends on the task !

  • Is the desired result created from the program or is it the program itself ?

  • Is it the optimal use of constrained resources (i.e. your time)?

  • Does it work with your computer hardware & OS ?

  • Is it appropriate for Customer/ Collaborators/ Stakeholders ?

  • Can it be maintained and supported ?

Why not to Python?

  • Despite that it is relatively easy to learn, it's still a programming language

  • Open source:

    • Only some core capabilities are well maintained

    • New capabilities can appear suddenly

    • Most support is from a volunteer community

    • Requires time spent searching and reading documentation

  • Pure Python not suited for HPC calculations

How to get started in 2019

Python basics

  • A scripting language: interpreted, dynamically typed, lazy

  • Pure Python code is plain text saved as .py files

  • Text file with commands that are read by a Python interpreter program and converted into computer instructions

In [4]:
pi = 3.14

2*pi
Out[4]:
6.28

Everything in Python has extra properties and functionality that can be accessed with .

In [5]:
'asdf'.upper()
Out[5]:
'ASDF'

With limited time, only some of the basic concepts are presented next.

Jupyter notebooks can help documenting and running Python code

A Jupyter notebook is a document (.ipynb file) you edit with a web browser. It consists of a single column of 'cells' that can contain either code or markdown text. Markdown adds formatting to plain text fields:

  • Enclosing words with asterisks like **bold** and *italic* make bold and italic

  • Wrapping text with backticks (`) like `code statements` will appear to look like code

  • For bulleted (or numbered) lists, just type - (or 1.) before each line

  • For heading formats use #, ##, etc. at the start of each line

  • Can write math expressions: $y_i=\sqrt{x_i{^2}}$ becomes $y_i=\sqrt{x_i{^2}}$

  • Extra settings and features (plotting, reporting run time) controlled by "magic commands" starting with %

  • Can be exported to plain code (.py) PDF, or slides

The notebook and the kernel (the computing engine that runs the code) can be served from your local computer or a remote sever.

Python is extended with imported packages

  • Python doesn't come with built-in capabilities for technical problems

  • ~200k packages are available

  • Key packages for scientific or engineering tasks:

    • numpy matrices, linear algebra, some math functions

    • scipy, scientific and engineering functions

      • other scikits e.g. scikit-image, scikit-learn
    • pandas best for working with tables of data (data frames)

    • matplotlib standard plotting library, seaborn is another that adds functionality for data visualization

    • sympy symbolic math

In [6]:
import numpy as np
np.sqrt(4)
Out[6]:
2.0

Packages are installed with package managers

Conda package manager

  • Best for managing multiple environments

  • Connects to packages housed on Anaconda.org

  • Installs non-python package dependencies too

$> conda install numpy

Pip package manager

  • Standard package manager

  • Connects to python packages housed on pypi.org

$> pip install numpy

Python syntax

  • Anything after a # is considered to be a comment and is ignored by the computer, triple quotes """ are used around a block of comments

  • Index counts start at [0]

  • Spaces and indentation matter! Where other languages use { },Python uses indentation to group statements

  • To exponentiate use ** not ^

  • To get help for a function, type ? followed by the function name, e.g. ?np.sqrt

  • Single ' or double quotes " can both be used to denote text strings

Python data types

There are some basic data types

  • int integer (-1,42)
  • float real (floating-point) number (3.141)
  • bool boolean: (True or False)
  • str text string ('asdf')

Some special types of data:

  • None in Python this is a placeholder for an empty space
  • numpy.nan a placeholder for something that is not-a-number
  • numpy.inf a placeholder for infinity

Working with types:

  • The data type of a variable is usually inferred by Python
  • The data type can be determined with the type command
  • Data types can be interconverted

Data can be collected in structures

The data types can be collected into different structures. Most common are:

  • list: a list of items, one-dimensional, uses square brackets: []

  • dict: an unordered collection of named (keyed) values, uses curly brackets: {}

  • numpy.array: a multi-dimensional matrix

  • pandas.DataFrame: a 2-d table with row and column names

Structures are very flexible: structures can contain other structures

numpy arrays

The numpy package is a powerful package for working with matrices. An array is most common. One way to define an array is using a list of lists:

In [7]:
import numpy as np
m = np.array([[1,2,3],[4,5,6]])
m
Out[7]:
array([[1, 2, 3],
       [4, 5, 6]])

Indexing can be done with square brackets: [row,column]. Indexes start at zero!

In [8]:
m[0,1]
Out[8]:
2

Select the second row and all columns with the slice : operator

In [9]:
m[1,:]
Out[9]:
array([4, 5, 6])

Plotting data

In [10]:
import matplotlib.pyplot as plt  # a common plotting library
%matplotlib inline 
x = np.linspace(0,5,num = 100)
ysin = np.sin(x*np.pi)
plt.plot(x,ysin,'-o',label = 'sin')
# plot another curve:
ycos = np.cos(x*np.pi)
plt.plot(x,ycos,'.',label = 'cos')
plt.legend()
Out[10]:
<matplotlib.legend.Legend at 0x7fe4e1508a90>

Solving ChE problems with Python

Numerical Methods

  • Linear Algebra

  • Non-linear Regression

  • Integrating systems of differential equations (ODEs, PDEs, DAEs)

General tasks

  • Merging and analyzing multiple files

  • Pulling data through network connections

  • Automating other programs

Python has powerful tools for numerical methods

Well-established and cutting edge algorithms are available.

The numpy package is ideal for matrix manipulations

The scipy package includes many capabilities for basic optimization and integration tasks.

To generate empirical models from data, scikit-learn has a wide range of models available.

Matrix operations in Python

Do some linear algebra: $$ \left[\begin{matrix} 3 & 1 \\ 1 & 2 \\ \end{matrix}\right] \left[\begin{matrix}x_0\\x_1\end{matrix}\right] = \left[\begin{matrix}9\\8\end{matrix}\right] $$

Solve for $x_0$ and $x_1$

In [11]:
import numpy as np
a = np.array([[3,1], [1,2]])
b = np.array([9,8])
x = np.linalg.solve(a, b)
x
Out[11]:
array([2., 3.])

Check the solution is correct by recovering b using a dot product

In [12]:
np.dot(a,x)
Out[12]:
array([9., 8.])

Curve fitting in Python

Fitting the sine-wave data

In [13]:
from scipy.optimize import curve_fit
def my_sine(x,a):
    return(np.sin(x*a))
x = np.linspace(0,5,num = 100)
ysin = np.sin(x*np.pi)
opt, cov = curve_fit(my_sine,x,ysin,p0=3)
print('Best fit: {}, covariance: {}'.format(opt,cov))
Best fit: [3.14159265], covariance: [[0.]]

Integration in Python

For example, to integrate a differential equation for a second order reaction: $\frac{dC}{dt} = -r$ where $r = kC^2$, define a function that returns the derivative at a time and concentration, and use the solve_ivp function:

In [14]:
from scipy.integrate import solve_ivp
def dc_dt(t,C):
    r = k*C**2
    return(-r)
k = 0.1
t_span = [0,100]  # range of time to use
C0 = [1]
soln = solve_ivp(dc_dt,t_span,C0)
plt.plot(soln.t,soln.y[0,:]);

Python can do symbolic math

Symbolic manipulations are also possible with the sympy library. For example, to integrate a differential equation for a second order reaction: $\frac{dC}{dt} = -r$ where $r = kC^2$, symbolic expressions can be created.

In [15]:
from sympy import symbols, Function, Eq, dsolve, Derivative
t, k, C0 = symbols('t k C0')
C = Function('C')
r = k*C(t)**2
deq = Eq(Derivative(C(t),t),-r)
deq
Out[15]:
$\displaystyle \frac{d}{d t} C{\left(t \right)} = - k C^{2}{\left(t \right)}$
In [16]:
dsolve(deq,ics={C(0):C0})
Out[16]:
$\displaystyle C{\left(t \right)} = \frac{1}{k t + \frac{1}{C_{0}}}$

Ten Problems, solved with Python

The well-known set of ten problems from a wide range of applications have been solved using Python's technical computing capabilities.

Interactive notebooks for the solutions are available in the Ten_Problems folder here:

https://mybinder.org/v2/gh/chepyle/Python4ChEs/master?urlpath=lab

Reference: “The Use of Mathematical Software packages in Chemical Engineering”, Michael B. Cutlip, John J. Hwalek, Eric H. Nuttal, Mordechai Shacham, Workshop Material from Session 12, Chemical Engineering Summer School, Snowbird, Utah, Aug., 1997.

Data aggregation example

Merging multiple files collected over time or from different people is a common practice. Another task is merging information in two tables that share a common reference ID. This example will go through managing the results of a hypothetical experiment.

In this example, a 3 factor full factorial design of experiments was conducted. Eight experimental conditions were tested, and the results were collected in eight files. The only way to link the results to the conditions is from the experiment number referenced in the file name. We need to:

  1. Collect all of the results into one table
  2. Merge the results with the input conditions
  3. Plot the response versus the three factors

Get the files

To get a list of file names available in a folder, the glob package is useful

In [17]:
import glob
file_list = glob.glob('./ExampleData/*.xlsx')
file_list
Out[17]:
['./ExampleData/condition 1.xlsx',
 './ExampleData/condition 2.xlsx',
 './ExampleData/condition 3.xlsx',
 './ExampleData/condition 4.xlsx',
 './ExampleData/condition 5.xlsx',
 './ExampleData/condition 6.xlsx',
 './ExampleData/condition 7.xlsx',
 './ExampleData/condition 8.xlsx',
 './ExampleData/condition info.xlsx']

Read the files

Excel files can be read into Python using pandas.read_excel function

In [18]:
import pandas as pd
expt_info  = pd.read_excel('./ExampleData/condition info.xlsx')

Inspect the sheet with experiment info:

In [19]:
expt_info
Out[19]:
Expt Temperature Pressure Concentration
0 1 20 1 50
1 2 30 1 50
2 3 20 2 50
3 4 30 2 50
4 5 20 1 100
5 6 30 1 100
6 7 20 2 100
7 8 30 2 100

Inspect one of the data files for the first condition

In [20]:
pd.read_excel('./ExampleData/condition 1.xlsx') 
Out[20]:
Time Response
0 0 0.0
1 5 1.7
2 10 2.6
3 15 3.3
4 20 4.0
5 25 4.1

Merge the files

Now loop through all of the files, and if there is a number in the file name:

  1. Store the experiment ID in a new column of the result DataFrame

  2. Collect each of the DataFrames in a list to merge later

The re (regular expression) package, standard in Python, is an extremely powerful tool to find and extract key parts of text fields.

In [21]:
import re
dfs = []
for each_file in file_list:
    file_num = re.search('([0-9])',each_file) 
    if file_num:                
        df = pd.read_excel(each_file)  
        df['Expt'] = int(file_num[0]) 
        dfs.append(df)      

Inspect the last DataFrame that was read:

In [22]:
df
Out[22]:
Time Response Expt
0 0 0.0 8
1 5 7.0 8
2 10 10.1 8
3 15 10.0 8
4 20 11.2 8
5 25 11.6 8

Concatenate each of the result tables, and look at a random sample of some of the rows:

In [23]:
df_cat = pd.concat(dfs)
print('There are {} Rows and {} Columns'.format(*df_cat.shape))
df_cat.sample(5)
There are 48 Rows and 3 Columns
Out[23]:
Time Response Expt
4 20 5.5 3
3 15 9.6 6
3 15 4.7 2
1 5 4.8 6
1 5 2.7 3

Merge the collected results with the Experiment info, joining the tables on the value in the 'Expt' column:

In [24]:
df_all = df_cat.merge(expt_info,on='Expt')
print('There are {} Rows and {} Columns'.format(*df_all.shape))
df_all.sample(5)
There are 48 Rows and 6 Columns
Out[24]:
Time Response Expt Temperature Pressure Concentration
46 20 11.2 8 30 2 100
14 10 4.2 3 20 2 50
10 20 5.1 2 30 1 50
0 0 0.0 1 20 1 50
42 0 0.0 8 30 2 100

Plot the data

For plotting the data, use the seaborn library, it has some nice functions for plotting tabular data

In [25]:
import seaborn as sns
g = sns.FacetGrid(df_all, row="Pressure", col="Concentration",
                  hue='Temperature', margin_titles=True)
g.map(sns.scatterplot,'Time','Response')
g.add_legend();

With a few lines of Python, we were able to merge 9 files and create a plot that we can use to draw conclusions. Best of all, we have the code, meaning:

  • Any new data can be added with minimal effort

  • The process can be repeated by anyone in the future to follow our steps and reproduce the results

Python is great for pulling in remote data

To access data stored remotely, some websites offer application programing interfaces to allow people to access the service via a computer generated request. In this example, a custom function is used to access the Environmental Protection Agency's Toxic Release Inventory database. This database contains records of toxic waste disposal including chemical information and disposal method. The full workflow is in TRI_API.ipynb. As with any exploratory data analysis, you should get familiar with the data before you make conclusions. This link is a good resource

In [26]:
import hvplot.pandas
import pandas as pd
from EPA_TRI import TRI_Query
df = TRI_Query(year=['>',2014])
Downloading Records: |################################| 724705/724705 100.0% - 0ss
In [27]:
df.shape
Out[27]:
(724705, 114)

The column names are a bit long, so store them in variables. Next create a new data frame that is a subset of the full data, only containing waste management labels related to energy recovery.

In [28]:
wm_type = 'TRI_TRANSFER_QTY.TYPE_OF_WASTE_MANAGEMENT'
qty = 'TRI_TRANSFER_QTY.TOTAL_TRANSFER'
yr = 'TRI_REPORTING_FORM.REPORTING_YEAR'
chem_name = 'TRI_REPORTING_FORM.CAS_CHEM_NAME'

subset = df.loc[df[wm_type].isin(['Energy Recovery',
                                  'Transfer to Waste Broker - Energy Recovery'])]

print('Shape after filtering:',subset.shape)
Shape after filtering: (54610, 114)

Find the largest quantities used in energy recovery

In [29]:
top_chems = subset.groupby(chem_name)[qty].agg(sum).sort_values(ascending=False).head(5)
top_chems
Out[29]:
TRI_REPORTING_FORM.CAS_CHEM_NAME
METHANOL                  3.744541e+08
TOLUENE                   2.731070e+08
XYLENE (MIXED ISOMERS)    1.985653e+08
ETHYLENE                  6.031269e+07
STYRENE                   5.686213e+07
Name: TRI_TRANSFER_QTY.TOTAL_TRANSFER, dtype: float64

Plot the quantities over time:

In [30]:
top_chems_by_year = subset.loc[subset[chem_name].isin(top_chems.index)].groupby([chem_name,yr])[qty].agg(sum)
In [31]:
(top_chems_by_year/2204).hvplot(kind='bar',
                                title='Energy Recovery: Top Chemicals over Time',
                                ylabel='Metric Tons').opts(xrotation=90)
Out[31]:

Python is great for automating other programs

Python can also be used to simulate mouse and keyboard inputs using the pywinauto library

Python is already integrated into many commercial software products, search the documentation for an "Python API"

Web browsers can be automated using the selenium library

How to learn more & get help – online resources

Try:

My Advice:

  • Failures and debugging errors are a necessary part of coding
  • Motivation is a start, but habit builds skills
  • Budget the time, and use it to solve something important to you

Learning more:

Ask questions, share problems:

This presentation

Power up with Python

In [32]:
import antigravity

https://xkcd.com/353/