Using sub-tasks to define a group of similar tasks
Objectives:
- Explain how to create a group of sub-tasks
- Explain what extra configuration a sub-task definition requires
- Adapt an existing task to turn it into a sub-task generator
Lets have another look at the doit file we had created at the end of the last lesson:
%load_ext doitmagic
%%doit
# automatic_variables.py
def task_reformat_temperature_data():
"""Reformats the raw temperature data file for easier analysis"""
return {
'actions': ['python reformat_weather_data.py %(dependencies)s > %(targets)s'],
'file_dep': ['UK_Tmean_data.txt'],
'targets': ['UK_Tmean_data.reformatted.txt'],
}
def task_reformat_sunshine_data():
"""Reformats the raw sunshine data file for easier analysis"""
return {
'actions': ['python reformat_weather_data.py %(dependencies)s > %(targets)s'],
'file_dep': ['UK_Sunshine_data.txt'],
'targets': ['UK_Sunshine_data.reformatted.txt'],
}
-- reformat_temperature_data
-- reformat_sunshine_data
We noticed that our two tasks share the same action, and only have different dependencies and targets.
When we want to run a large number of very similar tasks, we can make use of a doit feature called ‘sub-tasks’:
%%doit
# sub_tasks.py
data_sets = ['Tmean', 'Sunshine']
def task_reformat_data():
"""Reformats all raw files for easier analysis"""
for data_type in data_sets:
yield {
'actions': ['python reformat_weather_data.py %(dependencies)s > %(targets)s'],
'file_dep': ['UK_{}_data.txt'.format(data_type)],
'targets': ['UK_{}_data.reformatted.txt'.format(data_type)],
'name': 'UK_{}_data.txt'.format(data_type),
}
-- reformat_data:UK_Sunshine_data.txt
. reformat_data:UK_Tmean_data.txt
In this example, the function defining the task doesn’t return a single task. Instead it returns a python generator object, which then returns a number of sub-tasks. Has anyone heard of generators in python?
If not, here is a small demonstration:
def not_a_generator():
for i in range(5):
return i
not_a_generator()
0
def is_a_generator():
for i in range(5):
yield i
is_a_generator()
<generator object is_a_generator at 0x3052780>
def is_a_generator():
for i in range(5):
yield i
g = is_a_generator()
for x in g:
print x
0
1
2
3
4
I’m not going to go into lots of detail about generators in this lesson, but the essential thing to remember is that a function uses return
to return a single output, whilst a generator uses yield
to return a sequence of outputs in order. When doit finds a generator that yields task dictionaries, it creates a series of sub-tasks.
Take a look at the output of our file again. All the tasks generated by our new generator start with the same name: reformat_data
, which is taken from the name of the generator. After this part, which is called the basename
, comes a colon followed by the sub-task name. Notice that we explicitly gave each sub-task a name by setting the name
key in the task dictionary.
What would happen if we didn’t set sub-task names?
%%doit
# sub_tasks_no_name.py
data_sets = ['Tmean', 'Sunshine']
def task_reformat_data():
"""Reformats all raw files for easier analysis"""
for data_type in data_sets:
yield {
'actions': ['python reformat_weather_data.py %(dependencies)s > %(targets)s'],
'file_dep': ['UK_{}_data.txt'.format(data_type)],
'targets': ['UK_{}_data.reformatted.txt'.format(data_type)],
}
ERROR: Task 'reformat_data' must contain field 'name' or 'basename'. {'file_dep': ['UK_Tmean_data.txt'], 'targets': ['UK_Tmean_data.reformatted.txt'], 'actions': ['python reformat_weather_data.py %(dependencies)s > %(targets)s']}
Doit tells us that the task must define a name. This is because tasks can depend directly on other tasks, so each task must have a unique name by which it can be referenced as a dependency.
Now look at the reformatted data:
!tail UK_Tmean_data.reformatted.txt
2012-03-01,6.4
2012-04-01,8.3
2012-05-01,11.3
2012-06-01,13.7
2012-07-01,15.7
2012-08-01,15.7
2012-09-01,13.3
2012-10-01,10.5
2012-11-01,7.0
2012-12-01,5.3
The last data point in the file is from December 2012, so we probably ought to re-download our raw data. This is a task we will probably end up doing rather a lot, so we should let doit take care of it:
%%doit
# download_temp_data.py
import datetime
from doit.tools import timeout
data_sets = ['Tmean', 'Sunshine']
def task_get_temp_data():
"""Downloads the raw temperature data from the Met Office"""
return {
'actions': ['wget -O %(targets)s http://www.metoffice.gov.uk/climate/uk/datasets/Tmean/ranked/UK.txt'],
'targets': ['UK_Tmean_data.txt'],
}
def task_reformat_data():
"""Reformats all raw files for easier analysis"""
for data_type in data_sets:
yield {
'actions': ['python reformat_weather_data.py %(dependencies)s > %(targets)s'],
'file_dep': ['UK_{}_data.txt'.format(data_type)],
'targets': ['UK_{}_data.reformatted.txt'.format(data_type)],
'name': 'UK_{}_data.txt'.format(data_type),
}
. get_temp_data
-- reformat_data:UK_Sunshine_data.txt
. reformat_data:UK_Tmean_data.txt
--2014-04-05 12:08:16-- http://www.metoffice.gov.uk/climate/uk/datasets/Tmean/ranked/UK.txt
Resolving www.metoffice.gov.uk (www.metoffice.gov.uk)... 23.63.99.234, 23.63.99.216
Connecting to www.metoffice.gov.uk (www.metoffice.gov.uk)|23.63.99.234|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25576 (25K) [text/plain]
Saving to: ‘UK_Tmean_data.txt’
0K .......... .......... .... 100% 2.15M=0.01s
2014-04-05 12:08:16 (2.15 MB/s) - ‘UK_Tmean_data.txt’ saved [25576/25576]
We’ve added a new task that downloads the latest version of the temperature data from the UK Met Office website, so doit followed our instructions and downloaded the file. It then went on to our reformat_data task. Since the sunshine hours data hasn’t changed, it isn’t reformatted. However, there is now a new version of the mean temperature file, so doit automatically recreated the UK_Tmean_data.reformatted.txt
file:
!tail UK_Tmean_data.reformatted.txt
2013-03-01,5.1
2013-04-01,7.0
2013-05-01,10.0
2013-06-01,12.8
2013-07-01,14.5
2013-08-01,14.4
2013-09-01,12.4
2013-10-01,9.2
2013-11-01,5.7
2013-12-01,3.9
Which now contains all the data from 2013.
Challenge:
Edit the download_temp_data.py file and make use of sub-tasks to download both the temperature and the sunlight data.