Introduction to Python for Chemical Engineers¶

Jacob Albrecht¶

Principal Scientist¶

Product Development, Bristol-Myers Squibb¶

October 2, 2019¶

The challenges ChEs face¶

Appling fundamental knowledge requires non-linear models and calculus

Performing routine data manipulation that yields consistent results

Retrieving the work of others and sharing the results we produce

The technology gap¶

With technology, the quantity of data is becoming larger, and is being stored online
Manual processing doesn't scale well
Work is more collaborative, and results must be transparent and reproducible

Software fills the gap, many disparate tools exist¶

Spreadsheets and other third party tools can't extend to all situations
Programming languages fill the gap to solve our unique problems, with their own challenges
This webinar sprung from an AIChE Engage discussion forum on computational tools for beginning engineers

Python may be a valuable tool in your toolbox¶

Very versatile, good for most needs
Links into some fields of research with rapidly advancing technology
Easy language to start learning - has been described as fun
Great option for self-development

Investing in Python will pay off¶

Can be applied to any ChE problem, technical or not
Free or low-cost learning resources are available
A great way to add agility to your style of work

In [1]:

print('Hello ChEs')

Hello ChEs

About the Python Programming Language¶

Open source & managed by the nonprofit Python Software Foundation
https://python.org

Fun facts¶

Released 1991 by Guido van Rossum

Named after the Monty Python comedy group

Popular with:¶

Students (e.g. https://codewizardshq.com/)

Web developers
Science and Engineering:
- High Performance Computing
- Astronomers (https://eventhorizontelescope.org/)

Why Python, now?¶

Large user base and mature capabilities with Python 3
Being taught to the next generation of engineers
Great language for self-paced learning
Already widely shared by ChEs through repositories on GitHub.com:

Out[3]:

DANGER ZONE¶

"Are you saying Python is 'better' than another (my) approach?"

The best programming language depends on the task !

Is the desired result created from the program or is it the program itself ?
Is it the optimal use of constrained resources (i.e. your time)?
Does it work with your computer hardware & OS ?
Is it appropriate for Customer/ Collaborators/ Stakeholders ?
Can it be maintained and supported ?

Why not to Python?¶

Despite that it is relatively easy to learn, it's still a programming language
Open source:
- Only some core capabilities are well maintained
- New capabilities can appear suddenly
- Most support is from a volunteer community
- Requires time spent searching and reading documentation
Pure Python not suited for HPC calculations

How to get started in 2019¶

Use Python 3.7
Download from Anaconda: https://www.anaconda.com/distribution/
Or working on the cloud, through Jupyter notebooks https://jupyter.org/
- Binder : https://mybinder.org/
- Google Colaboratory : https://colab.research.google.com/
- Domino Data Lab : https://dominodatalab.com/
- CoCalc : https://cocalc.com/
- Microsoft Azure Notebooks : https://notebooks.azure.com/
- Many others!
Find and read resources online
- The files used in this presentation are on GitHub: https://github.com/chepyle/Python4ChEs

Python basics¶

A scripting language: interpreted, dynamically typed, lazy
Pure Python code is plain text saved as .py files
Text file with commands that are read by a Python interpreter program and converted into computer instructions

In [4]:

pi = 3.14

2*pi

Out[4]:

6.28

Everything in Python has extra properties and functionality that can be accessed with .

In [5]:

'asdf'.upper()

Out[5]:

'ASDF'

With limited time, only some of the basic concepts are presented next.

Jupyter notebooks can help documenting and running Python code¶

A Jupyter notebook is a document (.ipynb file) you edit with a web browser. It consists of a single column of 'cells' that can contain either code or markdown text. Markdown adds formatting to plain text fields:

Enclosing words with asterisks like **bold** and *italic* make bold and italic
Wrapping text with backticks (`) like `code statements` will appear to look like code
For bulleted (or numbered) lists, just type - (or 1.) before each line
For heading formats use #, ##, etc. at the start of each line
Can write math expressions: $y_i=\sqrt{x_i{^2}}$ becomes $y_i=\sqrt{x_i{^2}}$
Extra settings and features (plotting, reporting run time) controlled by "magic commands" starting with %
Can be exported to plain code (.py) PDF, or slides

The notebook and the kernel (the computing engine that runs the code) can be served from your local computer or a remote sever.

Python is extended with imported packages¶

Python doesn't come with built-in capabilities for technical problems
~200k packages are available
Key packages for scientific or engineering tasks:
- numpy matrices, linear algebra, some math functions
- scipy, scientific and engineering functions
  - other scikits e.g. scikit-image, scikit-learn
- pandas best for working with tables of data (data frames)
- matplotlib standard plotting library, seaborn is another that adds functionality for data visualization
- sympy symbolic math

In [6]:

import numpy as np
np.sqrt(4)

Out[6]:

2.0

Packages are installed with package managers¶

Conda package manager

Best for managing multiple environments
Connects to packages housed on Anaconda.org
Installs non-python package dependencies too

$> conda install numpy

Pip package manager

Standard package manager
Connects to python packages housed on pypi.org

$> pip install numpy

Python syntax¶

Anything after a # is considered to be a comment and is ignored by the computer, triple quotes """ are used around a block of comments
Index counts start at [0]
Spaces and indentation matter! Where other languages use { },Python uses indentation to group statements
To exponentiate use ** not ^
To get help for a function, type ? followed by the function name, e.g. ?np.sqrt
Single ' or double quotes " can both be used to denote text strings

Python data types¶

There are some basic data types

int integer (-1,42)
float real (floating-point) number (3.141)
bool boolean: (True or False)
str text string ('asdf')

Some special types of data:

None in Python this is a placeholder for an empty space
numpy.nan a placeholder for something that is not-a-number
numpy.inf a placeholder for infinity

Working with types:

The data type of a variable is usually inferred by Python
The data type can be determined with the type command
Data types can be interconverted

Data can be collected in structures¶

The data types can be collected into different structures. Most common are:

list: a list of items, one-dimensional, uses square brackets: []
dict: an unordered collection of named (keyed) values, uses curly brackets: {}
numpy.array: a multi-dimensional matrix
pandas.DataFrame: a 2-d table with row and column names

Structures are very flexible: structures can contain other structures

`numpy` arrays¶

The numpy package is a powerful package for working with matrices. An array is most common. One way to define an array is using a list of lists:

In [7]:

import numpy as np
m = np.array([[1,2,3],[4,5,6]])
m

Out[7]:

array([[1, 2, 3],
       [4, 5, 6]])

Indexing can be done with square brackets: [row,column]. Indexes start at zero!

In [8]:

m[0,1]

Out[8]:

Select the second row and all columns with the slice : operator

In [9]:

m[1,:]

Out[9]:

array([4, 5, 6])

Plotting data¶

In [10]:

import matplotlib.pyplot as plt  # a common plotting library
%matplotlib inline 
x = np.linspace(0,5,num = 100)
ysin = np.sin(x*np.pi)
plt.plot(x,ysin,'-o',label = 'sin')
# plot another curve:
ycos = np.cos(x*np.pi)
plt.plot(x,ycos,'.',label = 'cos')
plt.legend()

Out[10]:

<matplotlib.legend.Legend at 0x7fe4e1508a90>

Solving ChE problems with Python¶

Numerical Methods

Linear Algebra
Non-linear Regression
Integrating systems of differential equations (ODEs, PDEs, DAEs)

General tasks

Merging and analyzing multiple files
Pulling data through network connections
Automating other programs

Python has powerful tools for numerical methods¶

Well-established and cutting edge algorithms are available.

The numpy package is ideal for matrix manipulations

The scipy package includes many capabilities for basic optimization and integration tasks.

To generate empirical models from data, scikit-learn has a wide range of models available.

Matrix operations in Python¶

Do some linear algebra: $$ \left[\begin{matrix} 3 & 1 \\ 1 & 2 \\ \end{matrix}\right] \left[\begin{matrix}x_0\\x_1\end{matrix}\right] = \left[\begin{matrix}9\\8\end{matrix}\right] $$

Solve for $x_0$ and $x_1$

In [11]:

import numpy as np
a = np.array([[3,1], [1,2]])
b = np.array([9,8])
x = np.linalg.solve(a, b)
x

Out[11]:

array([2., 3.])

Check the solution is correct by recovering b using a dot product

In [12]:

np.dot(a,x)

Out[12]:

array([9., 8.])

Curve fitting in Python¶

Fitting the sine-wave data

In [13]:

from scipy.optimize import curve_fit
def my_sine(x,a):
    return(np.sin(x*a))
x = np.linspace(0,5,num = 100)
ysin = np.sin(x*np.pi)
opt, cov = curve_fit(my_sine,x,ysin,p0=3)
print('Best fit: {}, covariance: {}'.format(opt,cov))

Best fit: [3.14159265], covariance: [[0.]]

Integration in Python¶

For example, to integrate a differential equation for a second order reaction: $\frac{dC}{dt} = -r$ where $r = kC^2$, define a function that returns the derivative at a time and concentration, and use the solve_ivp function:

In [14]:

from scipy.integrate import solve_ivp
def dc_dt(t,C):
    r = k*C**2
    return(-r)
k = 0.1
t_span = [0,100]  # range of time to use
C0 = [1]
soln = solve_ivp(dc_dt,t_span,C0)
plt.plot(soln.t,soln.y[0,:]);

Python can do symbolic math¶

Symbolic manipulations are also possible with the sympy library. For example, to integrate a differential equation for a second order reaction: $\frac{dC}{dt} = -r$ where $r = kC^2$, symbolic expressions can be created.

In [15]:

from sympy import symbols, Function, Eq, dsolve, Derivative
t, k, C0 = symbols('t k C0')
C = Function('C')
r = k*C(t)**2
deq = Eq(Derivative(C(t),t),-r)
deq

Out[15]:

$\displaystyle \frac{d}{d t} C{\left(t \right)} = - k C^{2}{\left(t \right)}$

In [16]:

dsolve(deq,ics={C(0):C0})

Out[16]:

$\displaystyle C{\left(t \right)} = \frac{1}{k t + \frac{1}{C_{0}}}$

Ten Problems, solved with Python¶

The well-known set of ten problems from a wide range of applications have been solved using Python's technical computing capabilities.

Interactive notebooks for the solutions are available in the Ten_Problems folder here:

https://mybinder.org/v2/gh/chepyle/Python4ChEs/master?urlpath=lab

Reference: “The Use of Mathematical Software packages in Chemical Engineering”, Michael B. Cutlip, John J. Hwalek, Eric H. Nuttal, Mordechai Shacham, Workshop Material from Session 12, Chemical Engineering Summer School, Snowbird, Utah, Aug., 1997.

Data aggregation example¶

Merging multiple files collected over time or from different people is a common practice. Another task is merging information in two tables that share a common reference ID. This example will go through managing the results of a hypothetical experiment.

In this example, a 3 factor full factorial design of experiments was conducted. Eight experimental conditions were tested, and the results were collected in eight files. The only way to link the results to the conditions is from the experiment number referenced in the file name. We need to:

Collect all of the results into one table
Merge the results with the input conditions
Plot the response versus the three factors

Get the files¶

To get a list of file names available in a folder, the glob package is useful

In [17]:

import glob
file_list = glob.glob('./ExampleData/*.xlsx')
file_list

Out[17]:

['./ExampleData/condition 1.xlsx',
 './ExampleData/condition 2.xlsx',
 './ExampleData/condition 3.xlsx',
 './ExampleData/condition 4.xlsx',
 './ExampleData/condition 5.xlsx',
 './ExampleData/condition 6.xlsx',
 './ExampleData/condition 7.xlsx',
 './ExampleData/condition 8.xlsx',
 './ExampleData/condition info.xlsx']

Read the files¶

Excel files can be read into Python using pandas.read_excel function

In [18]:

import pandas as pd
expt_info  = pd.read_excel('./ExampleData/condition info.xlsx')

Inspect the sheet with experiment info:

In [19]:

expt_info

Out[19]:

	Expt	Temperature	Pressure	Concentration
0	1	20	1	50
1	2	30	1	50
2	3	20	2	50
3	4	30	2	50
4	5	20	1	100
5	6	30	1	100
6	7	20	2	100
7	8	30	2	100

Inspect one of the data files for the first condition

In [20]:

pd.read_excel('./ExampleData/condition 1.xlsx')

Out[20]:

	Time	Response
0	0	0.0
1	5	1.7
2	10	2.6
3	15	3.3
4	20	4.0
5	25	4.1

Merge the files¶

Now loop through all of the files, and if there is a number in the file name:

Store the experiment ID in a new column of the result DataFrame
Collect each of the DataFrames in a list to merge later

The re (regular expression) package, standard in Python, is an extremely powerful tool to find and extract key parts of text fields.

In [21]:

import re
dfs = []
for each_file in file_list:
    file_num = re.search('([0-9])',each_file) 
    if file_num:                
        df = pd.read_excel(each_file)  
        df['Expt'] = int(file_num[0]) 
        dfs.append(df)

Inspect the last DataFrame that was read:

In [22]:

df

Out[22]:

	Time	Response	Expt
0	0	0.0	8
1	5	7.0	8
2	10	10.1	8
3	15	10.0	8
4	20	11.2	8
5	25	11.6	8

Concatenate each of the result tables, and look at a random sample of some of the rows:

In [23]:

df_cat = pd.concat(dfs)
print('There are {} Rows and {} Columns'.format(*df_cat.shape))
df_cat.sample(5)

There are 48 Rows and 3 Columns

Out[23]:

	Time	Response	Expt
4	20	5.5	3
3	15	9.6	6
3	15	4.7	2
1	5	4.8	6
1	5	2.7	3

Merge the collected results with the Experiment info, joining the tables on the value in the 'Expt' column:

In [24]:

df_all = df_cat.merge(expt_info,on='Expt')
print('There are {} Rows and {} Columns'.format(*df_all.shape))
df_all.sample(5)

There are 48 Rows and 6 Columns

Out[24]:

	Time	Response	Expt	Temperature	Pressure	Concentration
46	20	11.2	8	30	2	100
14	10	4.2	3	20	2	50
10	20	5.1	2	30	1	50
0	0	0.0	1	20	1	50
42	0	0.0	8	30	2	100

Plot the data¶

For plotting the data, use the seaborn library, it has some nice functions for plotting tabular data

In [25]:

import seaborn as sns
g = sns.FacetGrid(df_all, row="Pressure", col="Concentration",
                  hue='Temperature', margin_titles=True)
g.map(sns.scatterplot,'Time','Response')
g.add_legend();

With a few lines of Python, we were able to merge 9 files and create a plot that we can use to draw conclusions. Best of all, we have the code, meaning:

Any new data can be added with minimal effort
The process can be repeated by anyone in the future to follow our steps and reproduce the results

Python is great for pulling in remote data¶

To access data stored remotely, some websites offer application programing interfaces to allow people to access the service via a computer generated request. In this example, a custom function is used to access the Environmental Protection Agency's Toxic Release Inventory database. This database contains records of toxic waste disposal including chemical information and disposal method. The full workflow is in TRI_API.ipynb. As with any exploratory data analysis, you should get familiar with the data before you make conclusions. This link is a good resource

In [26]:

import hvplot.pandas
import pandas as pd
from EPA_TRI import TRI_Query
df = TRI_Query(year=['>',2014])

Downloading Records: |################################| 724705/724705 100.0% - 0ss

In [27]:

df.shape

Out[27]:

(724705, 114)

The column names are a bit long, so store them in variables. Next create a new data frame that is a subset of the full data, only containing waste management labels related to energy recovery.

In [28]:

wm_type = 'TRI_TRANSFER_QTY.TYPE_OF_WASTE_MANAGEMENT'
qty = 'TRI_TRANSFER_QTY.TOTAL_TRANSFER'
yr = 'TRI_REPORTING_FORM.REPORTING_YEAR'
chem_name = 'TRI_REPORTING_FORM.CAS_CHEM_NAME'

subset = df.loc[df[wm_type].isin(['Energy Recovery',
                                  'Transfer to Waste Broker - Energy Recovery'])]

print('Shape after filtering:',subset.shape)

Shape after filtering: (54610, 114)

Find the largest quantities used in energy recovery

In [29]:

top_chems = subset.groupby(chem_name)[qty].agg(sum).sort_values(ascending=False).head(5)
top_chems

Out[29]:

TRI_REPORTING_FORM.CAS_CHEM_NAME
METHANOL                  3.744541e+08
TOLUENE                   2.731070e+08
XYLENE (MIXED ISOMERS)    1.985653e+08
ETHYLENE                  6.031269e+07
STYRENE                   5.686213e+07
Name: TRI_TRANSFER_QTY.TOTAL_TRANSFER, dtype: float64

Plot the quantities over time:

In [30]:

top_chems_by_year = subset.loc[subset[chem_name].isin(top_chems.index)].groupby([chem_name,yr])[qty].agg(sum)

In [31]:

(top_chems_by_year/2204).hvplot(kind='bar',
                                title='Energy Recovery: Top Chemicals over Time',
                                ylabel='Metric Tons').opts(xrotation=90)

Out[31]:

Python is great for automating other programs¶

Python can also be used to simulate mouse and keyboard inputs using the pywinauto library

Python is already integrated into many commercial software products, search the documentation for an "Python API"

Web browsers can be automated using the selenium library

How to learn more & get help – online resources¶

Try:

Install Python 3 on your computer: https://www.anaconda.com/distribution/
Or working on the cloud, through Jupyter notebooks https://jupyter.org/
- Binder : https://mybinder.org/
- Google Colaboratory : https://colab.research.google.com/
- Domino Data Lab : https://dominodatalab.com/
- CoCalc : https://cocalc.com/
- Microsoft Azure Notebooks : https://notebooks.azure.com/
- Many others!

My Advice:

Failures and debugging errors are a necessary part of coding
Motivation is a start, but habit builds skills
Budget the time, and use it to solve something important to you

Learning more:¶

Abundant help online:
- Getting started with SciPy: https://scipy.org/getting-started.html
- Search the Internet: "Python MOOC" for online courses, e.g. https://www.edx.org/learn/python
LinkedIn Learning: https://www.linkedin.com/learning/
- Over a hundred courses & thousands of videos
AIChE Virtual Local Section hosted Milan Curcic : https://milancurcic.com/should-i-py-or-should-i-fortran/

Lots of literature, some free e-books for total beginners:
- Think Python by Allen B. Downey https://greenteapress.com/wp/think-python-2e/
- Automate the Boring Stuff with Python by Al Sweigart https://automatetheboringstuff.com/

Ask questions, share problems:

StackOverflow : https://stackoverflow.com/questions
AIChE Engage : https://engage.aiche.org
Assume your question has already been asked and answered online: use search engines!

This presentation

Read only: https://nbviewer.jupyter.org/github/chepyle/Python4ChEs/blob/master/Python4ChEs.ipynb
Interactive Notebook: https://mybinder.org/v2/gh/chepyle/Python4ChEs/master?urlpath=lab
Code: https://github.com/chepyle/Python4ChEs
Slides: https://chepyle.github.io/Python4ChEs

Power up with Python¶

In [32]:

import antigravity

	Expt	Temperature	Pressure	Concentration
0	1	20	1	50
1	2	30	1	50
2	3	20	2	50
3	4	30	2	50
4	5	20	1	100
5	6	30	1	100
6	7	20	2	100
7	8	30	2	100

	Expt	Temperature	Pressure	Concentration
0	1	20	1	50
1	2	30	1	50
2	3	20	2	50
3	4	30	2	50
4	5	20	1	100
5	6	30	1	100
6	7	20	2	100
7	8	30	2	100