• 703-743-9010
  • info@oneoffcoder.com
  • 7526 Old Linton Hall Rd, Gainesville VA, 20155

Writing Beautiful, Idiomatic Python

Learn how to write beautiful, idiomatic Python code that will improve readability and performance.

Purpose

These are some idiomatic Pythonic ways to write code based on this video by Raymond Hettinger. Under each major section, you will see two sub-sections: Don't do this and Do this. Code under Don't do this are discouraged, and following the adjective of Jeff Knupp, are harmful. Code under Do this are the encouraged, beautiful and idiomatic Pythonic way to write the code instead. However, as you will see, some code examples are provided for speed performance.

Additional idiomatic Pythonic syntax has also been added in while some from the original video were left out (we will try to find alternative working examples).

Looping over a range of numbers

The key is to avoid creating an array. Use the range function instead as it will make your code more concise and is more memory efficient.

Don't do this

In [1]:
for i in [0, 1, 2, 3, 4, 5]:
    print(i ** 2)
0
1
4
9
16
25

Do this

In [2]:
for i in range(6):
    print(i ** 2)
0
1
4
9
16
25

Looping over a collection

Avoid using an index to access your elements in the array.

Don't do this

In [3]:
names = ['john', 'jane', 'jeremy', 'janice', 'joyce', 'jonathan']

for i in range(len(names)):
    print(names[i])
john
jane
jeremy
janice
joyce
jonathan

Do this

In [4]:
for name in names:
    print(name)
john
jane
jeremy
janice
joyce
jonathan

Looping backwards

The key here is to avoid the awkward -1 values and nested functions (look at how many parenthesis pairs are involved). Use reverse to make your code more elegant.

Don't do this

In [5]:
for i in range(len(names) - 1, -1, -1):
    print(names[i])
jonathan
joyce
janice
jeremy
jane
john

Do this

In [6]:
for name in reversed(names):
    print(name)
jonathan
joyce
janice
jeremy
jane
john

Looping over a collection and indicies

The key here is to use enumerate which will return the index with the element.

Don't do this

In [7]:
for i in range(len(names)):
    print(i, names[i])
0 john
1 jane
2 jeremy
3 janice
4 joyce
5 jonathan

Do this

In [8]:
for i, name in enumerate(names):
    print(i, name)
0 john
1 jane
2 jeremy
3 janice
4 joyce
5 jonathan

Looping over two collections

The key is to avoid accessing elements by indicies and also managing the concern of which list is smaller than which. Use zip to iterate over the two lists; the iteration will only go until the end of the shorter list.

Don't do this

In [9]:
names = ['john', 'jane', 'jeremy', 'janice', 'joyce', 'jonathan']
colors = ['red', 'green', 'blue', 'orange', 'purple', 'pink']

n = min(len(names), len(colors))
for i in range(n):
    print(names[i], colors[i])
john red
jane green
jeremy blue
janice orange
joyce purple
jonathan pink

Do this

In [10]:
for name, color in zip(names, colors):
    print(name, color)
john red
jane green
jeremy blue
janice orange
joyce purple
jonathan pink

Flattening data

Here, we need to flatten an array of arrays into one array. Notice that the second discouraged approach is actually the fastest (faster than the encouraged approaches)? The setup with the x array and use of a for loop spans 3 lines. This example appears controversial with trading off idiomatic Python for speed.

Don't do this

In [11]:
data = [list(range(10000000)) for _ in range(10)]
In [12]:
%%time
x = []
for arr in data:
    for val in arr:
        x.append(val)
len(x)
CPU times: user 8.37 s, sys: 271 ms, total: 8.64 s
Wall time: 8.69 s
Out[12]:
100000000
In [13]:
%%time
x = []
for arr in data:
    x.extend(arr)
len(x)
CPU times: user 867 ms, sys: 328 ms, total: 1.19 s
Wall time: 1.2 s
Out[13]:
100000000

Do this

In [14]:
%%time
x = [val for arr in data for val in arr]
len(x)
CPU times: user 2.21 s, sys: 297 ms, total: 2.5 s
Wall time: 2.52 s
Out[14]:
100000000
In [15]:
%%time

import itertools

x = itertools.chain.from_iterable(data)
len(list(x))
CPU times: user 1.43 s, sys: 317 ms, total: 1.75 s
Wall time: 1.75 s
Out[15]:
100000000

defaultdict

The key is to avoid checking to see if a key exists in the dictionary, and if not, then initialize its associated value. The use of defaultdict will initialize a value associated with a key that does not yet exists upon first access. Check out itertools too.

Don't do this

In [16]:
names = ['john', 'jane', 'jeremy', 'janice', 'joyce', 'jonathan']

d = {}
for name in names:
    key = len(name)
    if key not in d:
        d[key] = []
    d[key].append(name)
    
print(d)
{4: ['john', 'jane'], 6: ['jeremy', 'janice'], 5: ['joyce'], 8: ['jonathan']}

Do this

In [17]:
from collections import defaultdict

d = defaultdict(list)
for name in names:
    key = len(name)
    d[key].append(name)

print(d)
defaultdict(<class 'list'>, {4: ['john', 'jane'], 6: ['jeremy', 'janice'], 5: ['joyce'], 8: ['jonathan']})
In [18]:
import itertools

key = lambda s: len(s)
d = {k: list(g) for k, g in itertools.groupby(sorted(names, key=key), key)}
print(d)
{4: ['john', 'jane'], 5: ['joyce'], 6: ['jeremy', 'janice'], 8: ['jonathan']}

Map, filter, reduce

Don't do this

In [19]:
data = [i for i in range(10000000)]
In [20]:
%%time
x = []
for val in data:
    x.append(val * 2)
    
y = []
for val in x:
    if val % 2 == 0:
        y.append(val)
        
z = 0
for val in y:
    z = z + val
    
print(z)
99999990000000
CPU times: user 3.23 s, sys: 144 ms, total: 3.38 s
Wall time: 3.38 s

Do this

In [21]:
%%time
from functools import reduce

x = map(lambda val: val * 2, data)
x = filter(lambda val: val % 2 == 0, x)
x = reduce(lambda val1, val2: val1 + val2, x)

print(x)
99999990000000
CPU times: user 2.41 s, sys: 13.4 ms, total: 2.42 s
Wall time: 2.42 s

Dictionary default values

Use the dictionary .get method with a supplied default value.

Don't do this

In [22]:
d = {
    'username': 'jdoe'
}

is_authorized = False
if 'auth_token' in d:
    is_authorized = True
    
print(is_authorized)
False

Do this

In [23]:
is_authorized = d.get('auth_token', False)

print(is_authorized)
False

ChainMap

The key is to avoid copying and updating dictionaries just to override values. ChainMap will take care of this concern. Notice how the discouraged approached copies d1 then updates with d2, while ChainMap starts with d2 followed by d1. This part of the ChainMap is awkward.

Don't do this

In [24]:
d1 = {'color': 'red', 'user': 'jdoe'}
d2 = {'color': 'blue', 'first_name': 'john', 'last_name': 'doe'}

d = d1.copy()
d.update(d2)

for k, v in d.items():
    print(k, v)
color blue
user jdoe
first_name john
last_name doe

Do this

In [25]:
from collections import ChainMap

d1 = {'color': 'red', 'user': 'jdoe'}
d2 = {'color': 'blue', 'first_name': 'john', 'last_name': 'doe'}

d = ChainMap(d2, d1)
for k, v in d.items():
    print(k, v)
color blue
user jdoe
first_name john
last_name doe

Counter

Like defaultdict, Counter initialize values associated with keys to 0. Note how we get rid of checking to see if a key entry exists?

Don't do this

In [26]:
names = ['john', 'jane', 'jeremy', 'janice', 'joyce', 'jonathan']

d = {}
for name in names:
    key = len(name)
    if key not in d:
        d[key] = 0
    d[key] = d[key] + 1
    
print(d)
{4: 2, 6: 2, 5: 1, 8: 1}

Do this

In [27]:
names = ['john', 'jane', 'jeremy', 'janice', 'joyce', 'jonathan']

d = defaultdict(int)

for name in names:
    key = len(name)
    d[key] = d[key] + 1

print(d)
defaultdict(<class 'int'>, {4: 2, 6: 2, 5: 1, 8: 1})
In [28]:
from collections import Counter

d = Counter()
for name in names:
    key = len(name)
    d[key] = d[key] + 1
    
print(d)
Counter({4: 2, 6: 2, 5: 1, 8: 1})
In [29]:
d = Counter(map(lambda s: len(s), names))
print(d)
Counter({4: 2, 6: 2, 5: 1, 8: 1})

Ignoring tuples

Try not to create that extra variable declaration when unpacking tuples.

Don't do this

In [30]:
def get_info():
    return 'John', 'Doe', 28

fname, lname, tmp = get_info()
print(fname, lname)
John Doe

Do this

In [31]:
def get_info():
    return 'John', 'Doe', 28

fname, lname, _ = get_info()
print(fname, lname)
John Doe

namedtuple

The key here is to avoid accessing tuples by indicies since those indicies are meaningless. Instead, use namedtuple and access elements of the tuple by a meaningful name.

Don't do this

In [32]:
scores = [80, 90, 95, 88, 99, 93]

students = [(name, score) for name, score in zip(names, scores)]
for student in students:
    print('{} {}'.format(student[0], student[1]))
john 80
jane 90
jeremy 95
janice 88
joyce 99
jonathan 93

Do this

In [33]:
from collections import namedtuple

Student = namedtuple('Student', 'name score')

students = [Student(name, score) for name, score in zip(names, scores)]
for student in students:
    print('{} {}'.format(student.name, student.score))
john 80
jane 90
jeremy 95
janice 88
joyce 99
jonathan 93

Unpacking sequences

The key is to avoid long code that breaks up the coherent intention. In the discouraged approach, we receive a tuple, and store it in s and then for each element in s, use a different line to access the values. In the encouraged approach, the tuple is unpacked neatly into one line.

Don't do this

In [34]:
def get_student():
    return 'john', 'doe', 88

s = get_student()
first_name = s[0]
last_name = s[1]
score = s[2]

print(first_name, last_name, score)
john doe 88

Do this

In [35]:
first_name, last_name, score = get_student()

print(first_name, last_name, score)
john doe 88

String concatentation

The key here is to avoid writing too much code just to concatenate a string. In the discouraged approach, note how we have to add logic to append a comma ,? In the encourage approach, the for loop is gone and there is no more need for when to add a comma.

Don't do this

In [36]:
s = ''
for i, name in enumerate(names):
    s += name
    if i < len(names) - 1:
        s += ', '

s
Out[36]:
'john, jane, jeremy, janice, joyce, jonathan'

Do this

In [37]:
', '.join(names)
Out[37]:
'john, jane, jeremy, janice, joyce, jonathan'

Updating sequences

There is not much differences between the discouraged and encouraged approaches here. However, removing an element by value rather than by index seems much more meaningful.

Don't do this

In [38]:
names = ['john', 'jane', 'jeremy', 'janice', 'joyce', 'jonathan']

del names[0]
print(names)

names.pop(0)
print(names)

names.insert(0, 'jerry')
print(names)
['jane', 'jeremy', 'janice', 'joyce', 'jonathan']
['jeremy', 'janice', 'joyce', 'jonathan']
['jerry', 'jeremy', 'janice', 'joyce', 'jonathan']

Do this

In [39]:
from collections import deque

names = ['john', 'jane', 'jeremy', 'janice', 'joyce', 'jonathan']

names.remove('john')
print(names)

names.pop(0)
print(names)

names.insert(0, 'jerry')
print(names)
['jane', 'jeremy', 'janice', 'joyce', 'jonathan']
['jeremy', 'janice', 'joyce', 'jonathan']
['jerry', 'jeremy', 'janice', 'joyce', 'jonathan']

decorators

The key here is to use the lru_cache decorator to cache results of functions that are idempotent, especially if they are expensive to call. Note how calls to add takes about 700 milliseconds? However, using the lru_cache decorator, subsequent calls are on the order of microseconds.

Don't do this

In [40]:
def add(n):
    return sum([i for i in range(n)])
In [41]:
%%time
add(10000000)
CPU times: user 476 ms, sys: 133 ms, total: 609 ms
Wall time: 609 ms
Out[41]:
49999995000000
In [42]:
%%time
add(10000000)
CPU times: user 477 ms, sys: 130 ms, total: 606 ms
Wall time: 606 ms
Out[42]:
49999995000000

Do this

In [43]:
from functools import lru_cache

@lru_cache(maxsize=32)
def add(n):
    return sum([i for i in range(n)])
In [44]:
%%time
add(10000000)
CPU times: user 489 ms, sys: 129 ms, total: 618 ms
Wall time: 618 ms
Out[44]:
49999995000000
In [45]:
%%time
add(10000000)
CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.96 µs
Out[45]:
49999995000000

Reading a file

The key here is to use a context manager to manage resources.

Don't do this

In [46]:
f = open('README.md')
try:
    data = f.read()
    print(len(data))
finally:
    f.close()
1499

Do this

In [47]:
with open('README.md') as f:
    data = f.read()
    print(len(data))
1499

Deleting a file

The key here is to avoid the try/except code and favor a context manager approach.

Don't do this

In [48]:
import os

try:
    os.remove('test.tmp')
except OSError:
    pass

Do this

In [49]:
from contextlib import suppress

with suppress(OSError):
    os.remove('test.tmp')

List vs generator comprehensions

The key here is to avoid looping over elements and storing results. Instead, use a for or generator comprehension. Note that the for (note the brackets) comprehension eagerly evaluates the expressions and returns a list, but the generator (note the parentheses) lazily evaluates the expressions.

Don't do this

In [50]:
results = []
for i in range(10):
    s = i ** 2
    results.append(s)
total = sum(results)
print(total)
285

Do this

In [51]:
total = sum([i ** 2 for i in range(10)])
print([i ** 2 for i in range(10)])
print(total)
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
285
In [52]:
total = sum((i ** 2 for i in range(10)))
print((i ** 2 for i in range(10)))
print(total)
<generator object <genexpr> at 0x11d32c048>
285

Filtering lists

Use a for comprehension to filter out values, not a for loop.

Don't do this

In [53]:
nums = []
for i in range(100):
    if i % 2 == 0:
        nums.append(i)
print(nums)
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98]

Do this

In [54]:
nums = [i for i in range(100) if i % 2 == 0]
print(nums)
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98]

Clarify function calls with keyword arguments

When passing in values/arguments to a method, try to associate the values with the argument names.

Don't do this

In [55]:
def format_information(first_name, last_name, age):
    return '{} {} is {} years old'.format(first_name, last_name, age)

format_information('John', 'Doe', 28)
Out[55]:
'John Doe is 28 years old'

Do this

In [56]:
format_information(first_name='John', last_name='Doe', age=28)
Out[56]:
'John Doe is 28 years old'
In [57]:
format_information(**{
    'first_name': 'John',
    'last_name': 'Doe',
    'age': 28
})
Out[57]:
'John Doe is 28 years old'

Simultaneous state updates

The key here is to make your code more concise and avoid nuisance variables. In the discouraged approach, you create temporary variables to avoid mutating x and y. In the encouraged approach, all mutations occur in one coherent line.

Don't do this

In [58]:
def update_x(x):
    return x + 1

def update_y(y):
    return y + 1

x = 3
y = 4
dx = 4
dy = 5

tmp_x = x + dx
tmp_y = y + dy
tmp_dx = update_x(x)
tmp_dy = update_y(y)

x = tmp_x
y = tmp_y
dx = tmp_dx
dy = tmp_dy

print(x, y, dx, dy)
7 9 4 5

Do this

In [59]:
x = 3
y = 4
dx = 4
dy = 5

x, y, dx, dy = (x + dx, y + dy, update_x(x), update_y(y))

print(x, y, dx, dy)
7 9 4 5

Single line function declarations

If you have one-liner functions, avoid using function declaration with def. Instead, use lambda.

Don't do this

In [60]:
def add_one(x):
    return x + 1

add_one(3)
Out[60]:
4

Do this

In [61]:
add_one = lambda x: x + 1

add_one(3)
Out[61]:
4

Generator functions

Avoid generating large number of values or objects as they may take up memory. Use yield inside a function to generate values or objects as needed. Functions generating collections using yield are more space efficient and faster.

Don't do this

In [62]:
%%time
def generate_sequential_numbers(n):
    nums = []
    for i in range(n):
        nums.append(i)
    return nums

sum(generate_sequential_numbers(10000000))
CPU times: user 794 ms, sys: 114 ms, total: 908 ms
Wall time: 907 ms
Out[62]:
49999995000000

Do this

In [63]:
%%time
def generate_sequential_numbers(n):
    for i in range(n):
        yield i

sum(generate_sequential_numbers(10000000))
CPU times: user 470 ms, sys: 1.54 ms, total: 472 ms
Wall time: 471 ms
Out[63]:
49999995000000

Or do this

In [64]:
%%time

generate_sequential_numbers = lambda n: (i for i in range(n))
sum(generate_sequential_numbers(10000000))
CPU times: user 463 ms, sys: 1.55 ms, total: 464 ms
Wall time: 463 ms
Out[64]:
49999995000000

Dictionary comprehension

Here, we want to create two dictionaries; index-to-word i2w and word-to-index w2i. In the discouraged approach, we create two dictionaries, use a for loop, and set the key-value pair with the help of enumerate; there are 5 lines of code. In the encouraged approach, using two lines of code, we can declare and instantiate the dictionaries with a for comprehension.

In [65]:
words = ['i', 'like', 'to', 'eat', 'pizza', 'and', 'play', 'tennis']

Don't do this

In [66]:
i2w = {}
w2i = {}
for i, word in enumerate(words):
    i2w[i] = word
    w2i[word] = i
    
print(i2w)
print(w2i)
{0: 'i', 1: 'like', 2: 'to', 3: 'eat', 4: 'pizza', 5: 'and', 6: 'play', 7: 'tennis'}
{'i': 0, 'like': 1, 'to': 2, 'eat': 3, 'pizza': 4, 'and': 5, 'play': 6, 'tennis': 7}

Do this

In [67]:
i2w = {i: word for i, word in enumerate(words)}
w2i = {word: i for i, word in enumerate(words)}

print(i2w)
print(w2i)
{0: 'i', 1: 'like', 2: 'to', 3: 'eat', 4: 'pizza', 5: 'and', 6: 'play', 7: 'tennis'}
{'i': 0, 'like': 1, 'to': 2, 'eat': 3, 'pizza': 4, 'and': 5, 'play': 6, 'tennis': 7}

Set comprehension

Set comprehension avoids for loops.

In [68]:
words = ['i', 'like', 'to', 'eat', 'pizza', 'and', 'play', 'tennis']

Don't do this

In [69]:
vocab = set()
for word in words:
    vocab.add(word)
    
print(vocab)
{'tennis', 'pizza', 'i', 'and', 'to', 'eat', 'play', 'like'}

Do this

In [70]:
vocab = {word for word in words}
print(vocab)
{'tennis', 'pizza', 'i', 'and', 'to', 'eat', 'play', 'like'}

Chained comparison operators

Some chained comparisons, like the one below, should be avoided. Notice the use of and?

Don't do this

In [71]:
x = 10
y = 15
z = 20

if x <= y and y <= z:
    print('hi')
hi

Do this

In [72]:
if x <= y <= z:
    print('hi')
hi

Falsy and truthy

It's enough to use the variable to test for falsy or truthy.

Don't do this

In [73]:
is_male = True

if is_male == True:
    print('is male is true')
is male is true

Do this

In [74]:
if is_male:
    print('is male is true')
is male is true

Ternary operator

There is no official ternary operator in Python, but we may use the if/else statement as follows to mimic the ternary operator.

Don't do this

In [75]:
is_male = True

if is_male:
    gender = 'male'
else:
    gender = 'female'
    
print(gender)
male

Do this

In [76]:
gender = 'male' if is_male else 'female'
print(gender)
male

String interpolation

Note how we have to substitute name in twice? If we used variable names inside the substitution place holders, we only have to pass it in once. Also, note the use of f-string and Template.

Don't do this

In [77]:
name = 'John'
food = 'pizza'
sport = 'tennis'

sentence = '{} likes to eat {}. {} likes to play {}.'.format(name, food, name, sport)
print(sentence)
John likes to eat pizza. John likes to play tennis.

Do this

In [78]:
name = 'John'
food = 'pizza'
sport = 'tennis'

# variable substitution
sentence = '{name} likes to eat {}. {name} likes to play {}.'.format(food, sport, name=name)
print(sentence)

# f-string
sentence = f'{name} likes to eat {food}. {name} likes to play {sport}.'
print(sentence)

# string template
from string import Template
sentence = Template('$name likes to eat $food. $name likes to play $sport.')
print(sentence.substitute(name=name, food=food, sport=sport))
John likes to eat pizza. John likes to play tennis.
John likes to eat pizza. John likes to play tennis.
John likes to eat pizza. John likes to play tennis.

Don't Repeat Yourself (DRY)

It's easier to do '-'*15 to produce 15 consecutive dashes, than to type them all out.

Don't do this

In [79]:
print('---------------')
---------------

Do this

In [80]:
print('-'*15)
---------------

Double undescores, dunders, str

Exploit dunders when doing object-oriented programming in Python. In particular, override the __str__ dunder to enable a printer friendly representation of the object.

Don't do this

In [81]:
class Student():
    def __init__(self, first_name, last_name):
        self.first_name = first_name
        self.last_name = last_name
        
student = Student('John', 'Doe')
print(student)
<__main__.Student object at 0x11d334eb8>

Do this

In [82]:
class Student():
    def __init__(self, first_name, last_name):
        self.first_name = first_name
        self.last_name = last_name
        
    def __str__(self):
        return f'{self.first_name} {self.last_name}'
        
student = Student('John', 'Doe')
print(student)
John Doe

Combinations

In [83]:
symbols = ['A', 'B', 'C', 'D']

Don't do this

In [84]:
combinations = []
for i, symbol_i in enumerate(symbols):
    for j, symbol_j in enumerate(symbols):
        if i < j:
            tup = symbol_i, symbol_j
            combinations.append(tup)
print(combinations)
[('A', 'B'), ('A', 'C'), ('A', 'D'), ('B', 'C'), ('B', 'D'), ('C', 'D')]

Do this

In [85]:
from itertools import combinations

combinations = (comb for comb in combinations(symbols, 2) if comb[0] != comb[1])
print(list(combinations))
[('A', 'B'), ('A', 'C'), ('A', 'D'), ('B', 'C'), ('B', 'D'), ('C', 'D')]

Cycling

In [86]:
colors = ['red', 'green', 'blue']

Don't do this

In [87]:
color_sequence = []
index = 0
for i in range(10):
    color_sequence.append(colors[index])
    index += 1
    if index == 3:
        index = 0
print(color_sequence)
['red', 'green', 'blue', 'red', 'green', 'blue', 'red', 'green', 'blue', 'red']

Do this

In [88]:
from itertools import cycle

color_cycle = cycle(colors)
color_sequence = (next(color_cycle) for _ in range(10))
print(list(color_sequence))
['red', 'green', 'blue', 'red', 'green', 'blue', 'red', 'green', 'blue', 'red']

Product

In [89]:
a = ['cat', 'dog', 'frog']
b = ['red', 'green', 'blue']
c = ['big', 'small']

Don't do this

In [90]:
product_list = []
for animal in a:
    for color in b:
        for size in c:
            tup = animal, color, size
            product_list.append(tup)
print(product_list)
[('cat', 'red', 'big'), ('cat', 'red', 'small'), ('cat', 'green', 'big'), ('cat', 'green', 'small'), ('cat', 'blue', 'big'), ('cat', 'blue', 'small'), ('dog', 'red', 'big'), ('dog', 'red', 'small'), ('dog', 'green', 'big'), ('dog', 'green', 'small'), ('dog', 'blue', 'big'), ('dog', 'blue', 'small'), ('frog', 'red', 'big'), ('frog', 'red', 'small'), ('frog', 'green', 'big'), ('frog', 'green', 'small'), ('frog', 'blue', 'big'), ('frog', 'blue', 'small')]

Do this

In [91]:
from itertools import product

product_list = product(a, b, c)
print(list(product_list))
[('cat', 'red', 'big'), ('cat', 'red', 'small'), ('cat', 'green', 'big'), ('cat', 'green', 'small'), ('cat', 'blue', 'big'), ('cat', 'blue', 'small'), ('dog', 'red', 'big'), ('dog', 'red', 'small'), ('dog', 'green', 'big'), ('dog', 'green', 'small'), ('dog', 'blue', 'big'), ('dog', 'blue', 'small'), ('frog', 'red', 'big'), ('frog', 'red', 'small'), ('frog', 'green', 'big'), ('frog', 'green', 'small'), ('frog', 'blue', 'big'), ('frog', 'blue', 'small')]
In [92]:
list_of_list = [a, b, c]
product_list = product(*list_of_list)
print(list(product_list))
[('cat', 'red', 'big'), ('cat', 'red', 'small'), ('cat', 'green', 'big'), ('cat', 'green', 'small'), ('cat', 'blue', 'big'), ('cat', 'blue', 'small'), ('dog', 'red', 'big'), ('dog', 'red', 'small'), ('dog', 'green', 'big'), ('dog', 'green', 'small'), ('dog', 'blue', 'big'), ('dog', 'blue', 'small'), ('frog', 'red', 'big'), ('frog', 'red', 'small'), ('frog', 'green', 'big'), ('frog', 'green', 'small'), ('frog', 'blue', 'big'), ('frog', 'blue', 'small')]

Enumerations

If you are working with enumerations, use the enum package. In the example below, we have students who may be part, half or full time. If we simply declared these states with normal variables, they may be overwritten and there will be no context. On the other hand, if we use IntEnum, once declared, these states are immutable and provide context.

Don't do this

In [93]:
PART_TIME = 1
HALF_TIME = 2
FULL_TIME = 3

Do this

In [94]:
from enum import IntEnum

class StudentType(IntEnum):
    PART_TIME = 1
    HALF_TIME = 2
    FULL_TIME = 3
    
print(StudentType.PART_TIME)
print(StudentType.HALF_TIME)
print(StudentType.FULL_TIME)
StudentType.PART_TIME
StudentType.HALF_TIME
StudentType.FULL_TIME

Filtering files

We are able to filter strings more concisely with fnmatch. Notice the second example uses a method, two for loops and an if statement?

Don't do this

In [95]:
# Example 1

files = ['one.txt', 'two.py', 'three.txt', 'four.py', 'five.scala', 'six.java', 'seven.py']

py_files = filter(lambda f: f.endswith('.py'), files)
print(list(py_files))
['two.py', 'four.py', 'seven.py']
In [96]:
# Example 2

import os

def traverse(path):
    for basepath, directories, files in os.walk(path):
        for f in files:
            if f.endswith('.ipynb'):
                yield os.path.join(basepath, f)

ipynb_files = traverse('../')
len(list(ipynb_files))
Out[96]:
38

Do this

In [97]:
# Example 1

import fnmatch

fnmatch.filter(files, '*.py')
Out[97]:
['two.py', 'four.py', 'seven.py']
In [98]:
# Example 2

ipynb_files = fnmatch.filter(
    (f for basepath, directories, files in os.walk('../') for f in files),
    '*.ipynb')

len(list(ipynb_files))
Out[98]:
38
In [99]:
# Example 2, even better

import pathlib

ipynb_files = pathlib.Path('../').glob('**/*.ipynb')

len(list(ipynb_files))
Out[99]:
38

Saving objects

In the example below, although pickle is a great way to save objects, shelve is an alternative to saving multiple data/objects into a central location.

Don't do this

In [100]:
import pickle

object_1 = 'pretend some big object 1'
object_2 = 'pretend some big object 2'
data = {
    'object_1': object_1,
    'object_2': object_2,
}

pickle.dump(data, open('data.p', 'wb')) 

data = pickle.load(open('data.p', 'rb'))
print(data['object_1'])
print(data['object_2'])
pretend some big object 1
pretend some big object 2

Do this

In [101]:
import shelve

with shelve.open('data') as s:
    s['object_1'] = object_1
    s['object_2'] = object_2
    
with shelve.open('data') as s:
    print(s['object_1'])
    print(s['object_2'])
pretend some big object 1
pretend some big object 2

Pandas

When operating over Pandas dataframes, avoid using for loops and favor the apply function and Numpy vectorization.

In [102]:
import numpy as np
import pandas as pd

np.random.seed(37)

def get_df():
    N = 10000
    M = 50
    
    get_x = lambda x: np.random.normal(x, 1, N).reshape(-1, 1)
    get_y = lambda x: np.full(N, -1).reshape(-1, 1)
    

    X = np.hstack([get_x(x) if x < M - 1 else get_y(x) for x in range(M)])
    columns=[f'X{i}' if i < M - 1 else 'y' for i in range(M)]
    
    return pd.DataFrame(
        X,
        columns=columns
    )
In [103]:
df = get_df()

Don't do this

Standard for loop.

In [104]:
%%time
for row in range(len(df)):
    total = np.sum(df.iloc[row][0:df.shape[1] - 1])
    y = 1 if total > 1175 else 0
    df['y'].iloc[row] = y
CPU times: user 4.6 s, sys: 8.55 ms, total: 4.61 s
Wall time: 4.61 s

Pandas iterrows.

In [105]:
%%time
for i, r in df.iterrows():
    total = np.sum(r[0:df.shape[1] - 1])
    y = 1 if total > 1175 else 0
    df['y'].iloc[row] = y
CPU times: user 4.6 s, sys: 23.6 ms, total: 4.63 s
Wall time: 4.65 s

Do this

Pandas apply.

In [106]:
%%time
df['y'] = df.apply(lambda r: 1 if np.sum(r[0:df.shape[1] - 1]) > 1175 else 0, axis=1)
CPU times: user 2.02 s, sys: 17.2 ms, total: 2.04 s
Wall time: 2.05 s

Numpy vectorization. The approach below uses 3 lines to be clear about the intention, but the amount of time is in the milliseconds scale.

In [107]:
%%time

f = lambda s: 1 if s > 1175 else 0
s = df[[c for c in df.columns if c != 'y']].values.sum(axis=1)
df['y'] = [f(val) for val in s]
CPU times: user 8.62 ms, sys: 1.7 ms, total: 10.3 ms
Wall time: 9.21 ms
In [ ]: