, ,

By Cally Guerin

In recent conversations with PhD candidates, I am reminded of how difficult it can be to manage all that data generated by empirical research. It’s great when dozens of people are willing to be interviewed for your project; when you receive a 90% response rate to the survey; the chance timing of your fieldwork generates much more material than expected; or serendipity in the laboratory leads to a vast increase in usable results. On the one hand, it’s a gift to have so much material to work with; on the other, it’s easy to feel swamped by all that data and wonder how on earth you will find your way around it, let alone analyse and write about it. The storage and management of data is one of the key aspects of any research project, and finding ways to do this effectively sets up researchers for writing about it later on.

The Australian Code for the Responsible Conduct of Research (The Code) has lots to say about the management of research data. Section 2 of The Code specifies that research data and primary materials must be retained for at least 5 years after any publication resulting from that data, but this may be much longer if the data applies to clinical trials or has heritage value. All data must be stored in a secure location. If the research generates very large files, or is particularly sensitive in terms of its confidentiality, intellectual property or potential commercialisation, it may be necessary to make special arrangements. However, for many doctoral writers, ‘secure storage’ just means a password-protected file on their university computer that is backed up on the university’s system. If the research involves other kinds of artefacts or material objects, again, a locked cabinet in a locked university office is often enough, or more secure storage may be required. The data also needs to be in a ‘safe’ place, where it is not at risk of damage from flooding, for example (when you have time, let me tell you about what happens when the basement holding all the central computing facilities meets a burst water pipe).

Reliable record-keeping goes hand in hand with storage concerns. It might seem boringly pedantic at the time, but this will be invaluable later on to ensure easy accessibility. It will also ensure that the researcher can describe how the data was collected, organised and stored.

In the process of saving data and putting it into various folders, the first level of analysis is already occurring. The choices at this stage about what categories to employ and how to assign items to the folders will later inform how that data and the relationships between various items are perceived; the names of those folders and subfolders may even become the headings and subheadings of chapters and sections. Of course, it’s possible to change the categories as the researcher’s familiarity with the material develops.

One of the important messages here seems to be “use long and informative names for files” – a year or two later it can be hard to remember what that cryptic notation means… But it’s not only important for the current project: the origins and labelling of data can also be crucial if other researchers later use that data for further research.

Lots of universities have helpful templates to create an ordered, sensible approach to the task of managing data. You can find examples on the Australian National Data Services website. Just as ethics applications force researchers to think through what they want to do and why, these templates ensure careful consideration of what information is likely to be gathered and how it can be organised in order to be retrieved when the time comes to analyse and write about it. These templates usually include sections covering things like the forms of data to be collected; considerations about file naming conventions and version control; ownership of data; durability of storage forms; access rights; and retention and sharing of data. A systematic process at the early stages will make the writing much more straight forward when the time comes.

What have you learnt about good data management? Do you have some useful advice about ways to avoid the problems that others have encountered? How has this impacted on your writing about that data?