Thursday, 22 March 2018

Toy Examples: How os.walk Works

One of the many things a programmer needs to do is walk a directory tree and do stuff to the files and folders in it, even if it’s just list them. This is a recursive exercise and those can be mind-bending to code, and if done badly can mess up all sorts of low-level things. It’s best left to the kind of people who have actually read Knuth. Fortunately someone on the Python project did, and they gave us os.walk. (It’s in the os module, and is called walk().)

Unfortunately, the almost identical explanations of how to use os.walk mostly miss the point. All of them - that I’ve found - print out a directory and file listing. Which is not what I wanted to do with it.

When you execute os.walk(starting_point), for a directory called ‘starting_point’, it returns a triple consisting of: the path for starting_point a list of the subdirectories of starting_point a list of the files in starting_point.

os.walk works in a loop. Outside a loop, it doesn’t do much. Here’s how to use it in Python:

for current_directory, subdirectories, files in os.walk(starting_point): (do stuff)

What happens? The first time execution hits os.walk, it returns a triple like this:

current_directory = starting_point
subdirectories .... of starting_point
files ... in starting point

If you want to do something to all the files in the starting_point directory, you loop like this

for file in files: do stuff to file

If you want to do something to the subdirectories, unless it’s to list them, don’t. Wait for a moment, because...

The second time execution hits os.walk, it steps one directory down the tree, like this:

current_directory = first_subdirectory_in_starting_point
subdirectories .... of first_subdirectory_in_starting_point
files ... in first_subdirectory_in_starting_point

Now you can ‘do stuff to file’ for the files in first_subdirectory_in_starting_point.

What happens if there’s a subdirectory in first_subdirectory? The next time os.walk is executed it will return

current_directory = first_subdirectory_of_first_subdirectory_in_starting_point
subdirectories .... of first_subdirectory_of_first_subdirectory_in_starting_point
files ... in first_subdirectory_of_first_subdirectory_in_starting_point

Why don’t you do anything to the directories? Because os.walk is using that list to walk through them, so if you change names or permissions or something, before you have walked to the directory, os.walk (probably) won’t work.

If you want to mess with the subdirectories themselves, the chances are you need to run os_walk in the reverse direction (look that up).

The toy example of a directory listing just doesn’t expose the inner workings clearly enough. It can leave you thinking you have to do stuff with the directories as well as the files, but you don’t, of course.

Someone who has worked a lot with recursive Python functions will, should they have got all that experience before needing os.walk, grok os.walk fairly quickly. They will read the description and look at the examples and match that against the way they know Python array-returning programs have worked in the past, and say after a moment ‘Oh, sure, it does this and that, and you always have to use it in a loop’.

Catch is, walking a directory tree is one of the first things a programmer wants to do. And grokking recursive-return functions like os.walk is not simple. Being able to picture a recursive process is one of those big-jump differences between code-bashers and actual programmers. A toy example isn’t going to cut it.

No comments:

Post a Comment