Python IO¶

Input and output of Python

Pyhton & txt (general text file)¶

Basic Python function, i.e. read and write of file.

Python & Saving an object¶

standard lib: JSON and Pickle

Python & Numpy¶

Input and output of numpy array.

Python & CSV¶

Using Pandas module.

Python & MatLab¶

Using scipy's io module.

It's more like a summary instead of a lecture. You don't need to understand everything on it, only learn it when necessary.¶

Python & txt (general text file)¶

Create a new file and write something into it.

Basically we will discuss three function: open, write, read.

Ref:

The official document for open

The official document for file object

Let's start from a simple example.

Now let's see what's inside in our file.

In [1]:

with open('./text/text.txt','r') as f:
    print(f.read())

Hello World!
1,0.235,0.645,0.457
1  0.2350000  0.6450000  0.4570000

Argument 'r' means 'read mode', so you cannot change anything in the file.

Now open the file and write something into it.

In [2]:

with open('./text/text.txt','w') as f:
    f.write('this is a test file.\n')

'w' means 'write mode' and '\n' means change to a newline. See the change of the file.

In [3]:

with open('./text/text.txt','r') as f:
    print(f.read())

this is a test file.

As you may notice, the former "Hello world" is replaced by new text. So what if I just want add something into the file?

You can use the 'append' mode, namely change the argument from 'w' to 'a'.

In [4]:

with open('./text/text.txt','a') as f:
    f.write("Hello World!\n")

In [5]:

with open('./text/text.txt','r') as f:
    print(f.read())

this is a test file.
Hello World!

One may find a detailed explanation of modes here (Doc for Python2. Python3 add a "exclusive" mode, you can refer the official doc for more info)

And you may notice that each time we open a file, we start with a "with". If you do in this way, Python will automatically close the file. Otherwise you need to close the file manually.

Always do not forget close the file, it's memory consuming! You can do in this way:

In [6]:

f = open('./text/text.txt','a+')
f.write("Hello World!\n")
f.close()

f = open('./text/text.txt','r')
data = f.read()
print(data)
f.close()

this is a test file.
Hello World!
Hello World!

You can also use "Print" function to write text.

In [7]:

with open('./text/text.txt', 'w') as f:
    print('Hello World!', file=f)

In [8]:

with open('./text/text.txt','r') as f:
    print(f.read())

Hello World!

String's attribute "format" is useful when you need customize your output:

Official ref here

In [9]:

with open('./text/text.txt','a') as f:
    f.write('1,0.235,0.645,0.457\n')
    f.write('{:d}  {:3.7f}  {:3.7f}  {:3.7f}\n'.format(1,0.235,0.645,0.457))
with open('./text/text.txt','r') as f:
    print(f.read())

Hello World!
1,0.235,0.645,0.457
1  0.2350000  0.6450000  0.4570000

You can see the difference between the two output.

A self-explained example is as follow:

And reference can be find: here(Chinese version) here(English version)

In [10]:

a = "{:.2f} {:+.2f}  {:.0f}   {:.2%} \n".format(3.14,3.14,3.14,3.14)
b = "{:+d} {:+d} {:-d} {:-d} \n".format(+1,-1,+1,-1)
c = "{4} {3} {2} {1} {0} \n".format(1,2,3,4,5)
d = "{0:0>6d} {0:0<6d} {0:3>6d} {0} \n".format(5)
e = "{:5d} {:5d} {:5d} {:5d} {:5d} \n".format(1,2,3,4,5)
f = "{:5d} {:<5d} {:5d} {:^5d} {:5d} \n".format(1,2,3,4,5)
print(a,b,c,d,e,f)

3.14 +3.14  3   314.00% 
 +1 -1 1 -1 
 5 4 3 2 1 
 000005 500000 333335 5 
     1     2     3     4     5 
     1 2         3   4       5

Now we look at the "read" function.

There are three "read" function, namely "read", "readline" and "readlines". See their output below:

In [11]:

with open('./text/text.txt','r') as f:
    print(f.read())
    
with open('./text/text.txt','r') as f:
    print(f.readlines())
    
with open('./text/text.txt','r') as f:
    print(f.readline())

with open('./text/text.txt','r') as f:
    for line in f:
        print(line)

Hello World!
1,0.235,0.645,0.457
1  0.2350000  0.6450000  0.4570000

['Hello World!\n', '1,0.235,0.645,0.457\n', '1  0.2350000  0.6450000  0.4570000\n']
Hello World!

Hello World!

1,0.235,0.645,0.457

1  0.2350000  0.6450000  0.4570000

read() will return a string form of the content.

readlines() will return a list of each line.

readline() return only one line of the text.

The first two method will load all the text into memory at once, while the readline() will load the text one by one. If your file is very big(e.g. 4G or larger than your memory), read and readlines is not recommended.

For the big file, you may consider output in this way:

In [12]:

with open('./text/text.txt','r') as f:
    line = f.readline()
    while line:
        print(line)
        line = f.readline()

Hello World!

1,0.235,0.645,0.457

1  0.2350000  0.6450000  0.4570000

We just introduce the very basic read and write part of Python. But one may notice that all the discussion above based on string type. If you have a list or dict, save it to a text file is not a good choice, because you need to reconstruct list or dict from the text.

Is there any other way of saving an object?

Python & Saving an object¶

First we're going to introduce Pickle.

Official document here

Usage of Pickle is simple:

In [13]:

import pickle

data = ['HKUST','PHYS',6810,'Rm 4402','Lift17-18']
with open('data_pickle','wb') as f:
    pickle.dump(data,f)

In [14]:

with open('./pickle/data_pickle','rb') as f:
    data = pickle.load(f)
    for i in data:
        print(i)

HKUST
PHYS
6810
Rm 4402
Lift17-18

And you can process multiple object like this:

In [15]:

f = open('./pickle/somedata', 'wb')
pickle.dump([1, 2, 3, 4], f)
pickle.dump('hello', f)
pickle.dump({'Apple', 'Pear', 'Banana'}, f)
f.close()

f = open('./pickle/somedata', 'rb')
print(pickle.load(f))
print(pickle.load(f))
print(pickle.load(f))
f.close()

[1, 2, 3, 4]
hello
{'Pear', 'Banana', 'Apple'}

You can see it's order preserved.

Pickle can save a function.

In [16]:

import math
with open('./pickle/function','wb') as f:
    pickle.dump(math.cos,f)

In [17]:

with open('./pickle/function','rb') as f:
    Cos = pickle.load(f)
    print(Cos(0))

1.0

As you may notice, we add a "b" argument when we open the file. This indicate the data pickle save is a binary file, which is not human readable.

If a human readable file is needed, one may consider JSON.

JSON is short for "JavaScript Object Notation", which is a common data format.

JSON only support None,bool,int,float and str datatype, and list,tuple,dict which contain those data.

Let's see some example.

In [18]:

import json

data = {
    'name' : 'ACME',
    'shares' : 100,
    'price' : 542.23
}

with open('./pickle/data.json','w') as f:
    json.dump(data, f)

In [19]:

with open('./pickle/data.json', 'r') as f:
    data = json.load(f)
    for key,value in data.items():
        print(key,":",value)

name : ACME
shares : 100
price : 542.23

You can use that to save list or tuple.

More sophisticated techniques can be found at the official website

Differences between JSON and Pickle

JSON is a text serialization format (it outputs unicode text, although most of the time it is then encoded to utf-8), while pickle is a binary serialization format.

JSON is human-readable, while pickle is not.

JSON is interoperable and widely used outside of the Python ecosystem, while pickle is Python-specific.

JSON, by default, can only represent a subset of the Python built-in types, and no custom classes; pickle can represent an extremely large number of Python types (many of them automatically, by clever usage of Python’s introspection facilities; complex cases can be tackled by implementing specific object APIs).

Python & Numpy¶

NumPy module from SciPy is widely used in scientific computing, and it will also be the most important module we will learn along with the Matplotlib module.

Now let's see what can we make use of NumPy.

In [20]:

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [21]:

data=np.load('./numpy/data.npy')
psi6 = np.load('./numpy/psi6.npy')
plt.figure(figsize=(10,10))
ax = plt.gca()
ax.set_aspect('equal')
ax.scatter(data[:,0],data[:,1],s=5,c=psi6,cmap=plt.cm.rainbow_r)
# plt.savefig('psi6.png',dpi=200)
plt.show()

The very basic object in NumPy is called numpy.ndarray, which is a N-dimensional matrix. NumPy has already implement common matrix operation in the ndarray(i.e. dot, transpose,etc),making it really convenient for you to use.

One may find there website here. A MatLab user may find this tutorial useful.

Let's see how to save array in NumPy.

NumPy provide a lot of useful function not only aim for its array but also for other files.

(First import numpy as np)

np.save np.load

np.savez np.savez_compressed

np.savetxt np.loadtxt

One can find more info here

Save a single array in to a 'npy' file.

In [22]:

# create a new array
m = np.random.rand(100,2)
np.save('./numpy/m.npy',m)
print(m[0:5,:])

[[0.63747681 0.27082748]
 [0.73047444 0.7092957 ]
 [0.99831309 0.4086804 ]
 [0.28199021 0.31725638]
 [0.10055243 0.81826568]]

In [23]:

l = np.load('./numpy/m.npy')
print(l[0:5,:])

[[0.63747681 0.27082748]
 [0.73047444 0.7092957 ]
 [0.99831309 0.4086804 ]
 [0.28199021 0.31725638]
 [0.10055243 0.81826568]]

Note that it's not actually open a file, the load function just return a array object. So no need to close the file.

Save multiple arrays in to a 'npz' file.

In [24]:

np.savez('./numpy/ml.npz',x=m,y=l)

In [25]:

n = np.load('./numpy/ml.npz')
print(n)
print((n['x']-n['y'])[0:5,:])

<numpy.lib.npyio.NpzFile object at 0x0000014DA3B039E8>
[[0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]]

Also no need to close the npz file.

Function np.savez_compressed will save the file in a compressed way. It costs less disk memory and the file can also be loaded with the load function.

The loadtxt function is vert handy when dealing with text files.

In [26]:

data = np.loadtxt('./numpy/loadtxt_example.txt',skiprows=2,usecols=(0,4))
plt.figure(figsize=(7,7))
ax = plt.gca()
ax.plot(data[:,0],data[:,1])
ax.set_xlabel("Timestep")
ax.set_ylabel("Total energy")
plt.show()

One may find doc for savetxt function here.

Python & CSV¶

Although Python contains a csv module in its standard library, it is relative low level and inconvenient to use. Here we're going to introduce the Pandas from Scipy.

official website here

Pandas stands for "Python Data Analysis Library"

Reference here

Now let's see the basic part of the module.

This is the standard way of importing pandas.

In [27]:

import pandas as pd

In [28]:

df = pd.read_csv('./csv/uk_rain_2014.csv', header=0)

df.head(n) give you the first n rows in dataframe.

In [29]:

df.head(5)

Out[29]:

	Water Year	Rain (mm) Oct-Sep	Outflow (m3/s) Oct-Sep	Rain (mm) Dec-Feb	Outflow (m3/s) Dec-Feb	Rain (mm) Jun-Aug	Outflow (m3/s) Jun-Aug
0	1980/81	1182	5408	292	7248	174	2212
1	1981/82	1098	5112	257	7316	242	1936
2	1982/83	1156	5701	330	8567	124	1802
3	1983/84	993	4265	391	8905	141	1078
4	1984/85	1182	5364	217	5813	343	4313

df.tail(n) give you the last n rows in dataframe.

In [30]:

df.tail(5)

Out[30]:

	Water Year	Rain (mm) Oct-Sep	Outflow (m3/s) Oct-Sep	Rain (mm) Dec-Feb	Outflow (m3/s) Dec-Feb	Rain (mm) Jun-Aug	Outflow (m3/s) Jun-Aug
28	2008/09	1139	4941	268	6690	323	3189
29	2009/10	1103	4738	255	6435	244	1958
30	2010/11	1053	4521	265	6593	267	2885
31	2011/12	1285	5500	339	7630	379	5261
32	2012/13	1090	5329	350	9615	187	1797

You can change the column labels by using df.columns

In [31]:

df.columns = ['Year','rain_octsep', 'outflow_octsep',
              'rain_decfeb', 'outflow_decfeb', 'rain_junaug', 'outflow_junaug']

df.head(5)

Out[31]:

	Year	rain_octsep	outflow_octsep	rain_decfeb	outflow_decfeb	rain_junaug	outflow_junaug
0	1980/81	1182	5408	292	7248	174	2212
1	1981/82	1098	5112	257	7316	242	1936
2	1982/83	1156	5701	330	8567	124	1802
3	1983/84	993	4265	391	8905	141	1078
4	1984/85	1182	5364	217	5813	343	4313

You can change the format of data and get a statistic describe:

In [32]:

pd.options.display.float_format = '{:,.3f}'.format
df.describe()

Out[32]:

	rain_octsep	outflow_octsep	rain_decfeb	outflow_decfeb	rain_junaug	outflow_junaug
count	33.000	33.000	33.000	33.000	33.000	33.000
mean	1,129.000	5,019.182	325.364	7,926.545	237.485	2,439.758
std	101.900	658.588	69.995	1,692.800	66.168	1,025.914
min	856.000	3,479.000	206.000	4,578.000	103.000	1,078.000
25%	1,053.000	4,506.000	268.000	6,690.000	193.000	1,797.000
50%	1,139.000	5,112.000	309.000	7,630.000	229.000	2,142.000
75%	1,182.000	5,497.000	360.000	8,905.000	280.000	2,959.000
max	1,387.000	6,391.000	484.000	11,486.000	379.000	5,261.000

And you can access the columns in two ways:

In [33]:

df['rain_octsep'][0:5]

Out[33]:

0    1182
1    1098
2    1156
3     993
4    1182
Name: rain_octsep, dtype: int64

In [34]:

df.rain_octsep[28:]

Out[34]:

28    1139
29    1103
30    1053
31    1285
32    1090
Name: rain_octsep, dtype: int64

Also it's easy to implement bool operation.

In [35]:

df[df.rain_octsep < 1000]

Out[35]:

	Year	rain_octsep	outflow_octsep	rain_decfeb	outflow_decfeb	rain_junaug	outflow_junaug
3	1983/84	993	4265	391	8905	141	1078
8	1988/89	976	4330	309	6465	200	1440
15	1995/96	856	3479	245	5515	172	1439

In [36]:

df[(df.rain_octsep < 1000) & (df.outflow_octsep < 4000)]

Out[36]:

	Year	rain_octsep	outflow_octsep	rain_decfeb	outflow_decfeb	rain_junaug	outflow_junaug
15	1995/96	856	3479	245	5515	172	1439

If we need to search string in dataframe:

In [37]:

df[df.Year.str.startswith('199')]

Out[37]:

	Year	rain_octsep	outflow_octsep	rain_decfeb	outflow_decfeb	rain_junaug	outflow_junaug
10	1990/91	1022	4418	305	7120	216	1923
11	1991/92	1151	4506	246	5493	280	2118
12	1992/93	1130	5246	308	8751	219	2551
13	1993/94	1162	5583	422	10109	193	1638
14	1994/95	1110	5370	484	11486	103	1231
15	1995/96	856	3479	245	5515	172	1439
16	1996/97	1047	4019	258	5770	256	2102
17	1997/98	1169	4953	341	7747	285	3206
18	1998/99	1268	5824	360	8771	225	2240
19	1999/00	1204	5665	417	10021	197	2166

Use df.loc to locate specific row and column.

In [38]:

df.loc[11,['Year','rain_octsep']]

Out[38]:

Year           1991/92
rain_octsep       1151
Name: 11, dtype: object

Now I want to compare the rainfall of HK and UK.

In [39]:

hkdf = pd.read_csv('./csv/hk_weather_data.csv',header=0)

In [40]:

hkdf.head(5)

Out[40]:

	Year	Avg Pressure(100P)	Max Temp	Avg Temp(H)	Avg Temp	Avg Temp(L)	Min Temp	Rainfall(mm)	sunshine(hr)
0	1961	1,012.600	34.200	25.600	22.900	20.800	7.300	2,232.400	1,981.600
1	1962	1,013.200	35.500	25.800	22.700	20.400	6.000	1,741.000	2,395.400
2	1963	1,013.400	35.600	26.500	23.300	20.900	7.100	901.100	2,469.700
3	1964	1,012.700	33.900	25.700	22.900	20.500	7.000	2,432.100	2,029.600
4	1965	1,012.800	33.400	25.900	23.100	20.900	7.300	2,352.600	1,990.700

Combine the two dataset together and get the plot of rainfall.

First pick up the columns we need.

In [41]:

hk_rainfall = hkdf.loc[:,['Year','Rainfall(mm)']]
hk_rainfall = hk_rainfall[(hk_rainfall.Year<=2012)&(hk_rainfall.Year>=1980)]
hk_rainfall.columns = ['Year','rainfall']
uk_rainfall = df.loc[:,['Year','rain_octsep']]

The 'Year' column of uk_rainfall is string type, so we need to change it to integer.

In [42]:

def str2int(year):
    year = int(year[:4])
    return year

In [43]:

print(type(uk_rainfall.Year[0]))
uk_rainfall.Year = uk_rainfall.Year.apply(str2int)

<class 'str'>

In [44]:

type(uk_rainfall.Year[0])

Out[44]:

numpy.int64

Now we merge two dataframe together.

In [45]:

hk_uk_data = uk_rainfall.merge(hk_rainfall,on='Year')

In [46]:

hk_uk_data.head(5)

Out[46]:

	Year	rain_octsep	rainfall
0	1980	1182	1,710.600
1	1981	1098	1,659.500
2	1982	1156	3,247.500
3	1983	993	2,893.800
4	1984	1182	2,017.000

And we can get a plot easily.

In [47]:

hk_uk_data.plot(x='Year',y=['rain_octsep','rainfall'],figsize=(7,7))
plt.show()

Finally we save new dataframe to a csv file.

In [48]:

hk_uk_data.to_csv('./csv/hk_uk_rain.csv')

Python & MatLab¶

MatLab is a important tool for physics students.

The io module from scipy provide a lot of useful APIs for data reading, which include a MatLab file API. And vice versa, you can use Python API in MatLab, too.

Let's see.

One may find reference here

Mainly three function will be used:

loadmat: load MatLab format file.

savemat: save file in MatLab format.

whosmat: see what's inside a Mat file.

In [49]:

import scipy.io as sio
# import numpy as np
# import matplotlib.pyplot as plt
# %matplotlib inline

Make sure you import all the module.

Let's see whosmat first:

In [50]:

sio.whosmat('./mat/voro.mat')

Out[50]:

[('vx', (2, 579), 'double'),
 ('vy', (2, 579), 'double'),
 ('x', (200, 1), 'double'),
 ('y', (200, 1), 'double')]

Now load the mat file.

In [51]:

mat = sio.loadmat('./mat/voro.mat')
x = mat['x']
y = mat['y']
vx = mat['vx']
vy = mat['vy']

Get a figure using matplotlib.

In [52]:

plt.figure(figsize=(10,10))
plt.gca()
plt.plot(x,y,'bo',vx,vy)
plt.xlim(0,1)
plt.ylim(0,1)
plt.show()

Finaly we sava a random array into matlab format

In [53]:

sio.savemat('./mat/np_rand.mat',{'position':np.random.rand(100,2)})

In [54]:

sio.whosmat('./mat/np_rand.mat')

Out[54]:

[('position', (100, 2), 'double')]