making Gantt charts is a form of self-care— Colin J. Carlson (@ColinJCarlson) November 26, 2018
If you’ve ever had to explain how all the elements of a big, multi-part project come together, you’ve probably at least considered making something like a Gantt chart. A Gantt chart is a horizontal bar plot with time as the x-axis, illustrating the time required for different activities within a larger project. The basic design is named for turn-of-the-20th-Century American engineer and management consultant Henry Gantt, though examples from Poland and Germany predate Gantt’s original charts.
I’ve just spent more time than I care to admit squinting at draft Gantt charts for a proposal that’s going in soonish, and I’m happy to report that actually making the chart, and making it look nice, was not the hardest part of the process. (That would be, um, figuring out how to fit everything in the proposed project into the allotted funding period.) As you might expect, I did it in R, taking full advantage of the tidyverse packages — as you might not expect, I also used that ancient nemesis of modern data science, Microsoft Excel.
I got to this approach via the simple expedient of Googling “gantt chart r”, which found me a Stack Overflow thread on the topic. The most popular response to that thread ran through a number of options drawing on specialized packages that in multiple cases had nothing whatsoever to do with Gantt charts or scheduling. I found what I wanted in a less popular answer, which suggested using horizontal lines with extra-large width settings in a bog-standard ggplot2 figure. Taking advantage of R’s base ability to parse dates, you can sketch out a project timeline in a delimited text file, and then create a plot from it in just a few lines of code.
Far and away the easiest approach for setting up a delimited text file is, in this context, to use a spreadsheet program — and MS Excel is the one I have on every computer I use regularly. So I cracked open Excel:
You can download this example as a comma-separated text file here. To make the line-graph-as-Gantt work, you need to provide one line for each activity to go in the chart, with the name of the activity and its starting and ending dates as separate columns. If you want to color-code activities to reflect higher-level organization in the project, you’ll want a column for that. Finally, if you want to put multiple activities on the same line — if, say, one activity follows directly from another — you should have a column that uniquely identifies every activity. (The way I’m setting things up, two activities with the same name under “Activity” will end up on the same line of the Gantt chart, but you need the additional identifier to keep them separate.)
For this hypothetical, I’ve sketched out a three-year project, with several components. It features behavioral and morphological studies of the focal organism, the sequencing and assembly of a reference genome, and collection of population genomic data to identify loci associated with various phenotypes, as well as analysis of population structure and environmental associations. There are also elements that enhance the “broader impacts” of the project — there will be a community science project in which interested members of the public help take observations of the study organism, and the PI will lead a research course for undergraduate students that coincides with the field season.
This outline implies a set of activities, as listed in the schedule file. I’m assuming three project years starting in 2019 and running January to December, with fieldwork mostly happening in the summer, while the community science project can run continuously. Given that, I entered start and end times for each item in yyyy.mm.dd format, which Excel recognizes and mostly doesn’t mess with when it exports the delimited text file. (Date formatting is a place where you may get tripped up, though — check that as a first step for any debugging you need to do.) When I have a rough schedule of items figured out, I save the file as comma-separated text in my working directory. (File, Save As; then selecting “CSV” from the File Format field.)
Now I can take it to R. I open a terminal and load up the tidyverse packages:
Then I load the text file containing the schedule data. I also set up two vectors containing the unique names of activities in the schedule file, and the unique names of the project elements. These will come in handy for controlling the display order of the items in the Gantt chart.
gantt <- read.csv("gantt.csv", h=T) acts <- c("Community scientist observations", "Undergraduate field course", "Behavioral field experiments", "Reference genome assembly", "Sampling for DNA and morphology", "Genomic sequencing of field samples", "Reference genome annotation", "Genotype-phenotype association", "Landscape genomic analyses") els <- c("Behavioral observations", "Genomic data", "Analyses", "Broader impacts", "Publication preparation")
I check to make sure the schedule data is read in as expected:
Which should give me
Item Activity Project.element Start End
1 1 Community scientist observations Behavioral observations 2019.01.01 2021.12.31
2 2 Behavioral field experiments Behavioral observations 2019.05.01 2019.08.31
3 3 Behavioral field experiments Behavioral observations 2020.05.01 2020.08.31
4 4 Behavioral field experiments Behavioral observations 2021.05.01 2021.08.31
5 5 Reference genome assembly Genomic data 2019.02.01 2020.03.31
6 6 Sampling for DNA and morphology Genomic data 2020.05.01 2020.08.31
To work neatly with ggplot, I need the start and end dates to be in a single column, with another column to identify whether they’re the start or end date. This is the job of the
gather() function. I’m also going to convert the Activities and Project.element columns into factors, with levels defined to control the order in which they appear in the chart. This is where those two vectors of activity and project element names come in — the ordering of names in those vectors can determine the ordering of levels in the factors.
With tidyverse notation, I can do this all on one line of code:
g.gantt <- gather(gantt, "state", "date", 4:5) %>% mutate(date = as.Date(date, "%Y.%m.%d"), Activity=factor(Activity, acts[length(acts):1]), Project.element=factor(Project.element, els))
And check the results:
Which should give
Item Activity Project.element state date
1 1 Community scientist observations Behavioral observations Start 2019-01-01
2 2 Behavioral field experiments Behavioral observations Start 2019-05-01
3 3 Behavioral field experiments Behavioral observations Start 2020-05-01
4 4 Behavioral field experiments Behavioral observations Start 2021-05-01
5 5 Reference genome assembly Genomic data Start 2019-02-01
6 6 Sampling for DNA and morphology Genomic data Start 2020-05-01
With that all set up, I can create my first attempt at a chart.
ggplot(g.gantt, aes(date, Activity, color = Project.element, group=Item)) + geom_line(size = 10) + labs(x="Project year", y=NULL, title="Project timeline")
This will produce the chart in R’s standard graphics output window, which will probably need some adjusting to get the aspect ratio right. You could also wrap it in a command to write to pdf or another graphical file format. With the aspect ratio set appropriately, it should look something like this:
That’s a fairly presentable chart just with ggplot2 defaults. You can also define your own set of color codings and give a proper name to the color key
actcols <- c("#548235", "#2E75B6", "#BF9000", "#7030A0", "#cd6600") ggplot(g.gantt, aes(date, Activity, colour = Project.element, group=Item)) + geom_line(size = 10) + scale_color_manual(values=actcols, name="Project component") + labs(x="Project year", y=NULL, title="Project timeline")
You might also want a background grid to better track what the dates are within the chart, and you might want to set the time scale in terms of generic project years. These can be accomplished by specifying a
theme —I like
theme_gray()for this, and I specify a 14-pt base font — and by setting a specific x-axis scale, which is doable with some help from a base function called
seq.Date(). Given a first and last date and some interval (in days, weeks, months, quarters, or years) this function returns a vector of dates between the first and last one, spaced at the specified interval. For instance:
seq.Date(as.Date("2019-01-01"), as.Date("2021-12-31"), "quarter")
 "2019-01-01" "2019-04-01" "2019-07-01" "2019-10-01" "2020-01-01" "2020-04-01" "2020-07-01"
 "2020-10-01" "2021-01-01" "2021-04-01" "2021-07-01" "2021-10-01"
Incorporating that into a
scale_x_date() function produces the abstracted year scale I wanted:
actcols <- c("#548235", "#2E75B6", "#BF9000", "#7030A0", "#cd6600") ggplot(g.gantt, aes(date, Activity, colour = Project.element, group=Item)) + geom_line(size = 10) + scale_color_manual(values=actcols, name="Project component") + labs(x="Project year", y=NULL, title="Project timeline") + scale_x_date(breaks=seq.Date(as.Date("2019-01-01"), as.Date("2021-12-31"), "quarter"), labels=c(1, "", "", "", 2, "", "", "", 3, "", "", "")) + theme_gray(base_size=14)
Finally, there’s the interesting problem that some activities really contribute to multiple project elements — that community science program is a major source of behavioral observations, but it’s also a nifty way to get the public engaged in the scientific work, which contributes to broader impacts. The obvious way to communicate this graphically would be to fill that item with two colors, one striped over the other. However, ggplot doesn’t like textured fills like that — and even if it did, the graphic we’ve built is based on really thick lines, not polygons! To get around this, I came up with a truly awful hack: drawing individual stripes across the bar for “Community scientist observations.”
First, I needed a new data object with start and end dates for each stripe to go across the “Community science observations” bar. This is again accomplishable using
seq.Date(), first to produce a series of start dates, and then to produce a matching series of end dates. If I specify y-axis positions so that the lines run across the “Community science observations” bar and offset the start and end dates by a set period, each line will be slanted in parallel. Here’s what that looks like in code:
demosurv <- data.frame(state=rep(c("Start","End"), each=52), date=c(seq.Date(as.Date("2019-01-01"), as.Date("2021-12-12"), "3 week"), seq.Date(as.Date("2019-01-18"), as.Date("2021-12-31"), "3 week")), Activity=rep(c(8.55,9.45), each=52), Project.element="Broader impacts", Item=rep(42:93, 2))
I got to that with some trial and error — figuring out what the y-axis positions needed to be (the values under
Activity), and how many lines I needed to get a nice striping pattern. Having created this
demosurv data frame object, I could add it to the previous code as a new plot layer:
actcols <- c("#548235", "#2E75B6", "#BF9000", "#7030A0", "#cd6600") ggplot(g.gantt, aes(date, Activity, colour = Project.element, group=Item)) + geom_line(size = 10) + geom_line(data=demosurv, size=0.5) + scale_color_manual(values=actcols, name="Project component") + labs(x="Project year", y=NULL, title="Project timeline") + scale_x_date(breaks=seq.Date(as.Date("2019-01-01"), as.Date("2021-12-31"), "quarter"), labels=c(1, "", "", "", 2, "", "", "", 3, "", "", "")) + theme_gray(base_size=14)
And that code produces the chart at the top of this very post.
Happy scheduling! Here’s hoping that in 2019 all your proposal review panelists are sympathetic, and all your funding periods are sufficient to complete the proposed work.