This tutorial is in-progress and will be updated with additional examples and graphics.
Fundamentals
The technology is less important than the planning however…
Excel can cause issues so use with caution
IT vs information professionals ⇒ IT + information professionals
Different types of expertise and experience – IT can recommend software based on institutional licensing agreements, cost, hardware requirements but may have limited experience using the software within the context of research projects. Information professionals (i.e. librarians, archivists, research facilitators) are more likely to have a working knowledge of implementing different software or systems for data management and the legal, ethical, and practical restrictions surrounding information in scholarly work.
Ideally consult with both together!
Portability of data is the most important factor
Avoid proprietary file formats whenever possible – some instruments will require that you use their specific file format but frequently you are able to convert the data after to a nonproprietary format. The ADS Guides to Good Practice includes recommendations for file formats that are both accessible and suitable for preservation.
Avoid database systems and software that do not provide an easy way to export ALL of the data in a structured format. Expect to have to switch systems at some point during your project and run tests to make sure you won’t have to re-enter, re-run experiments or lose anything if you do have to switch. Test both exporting and importing!
Make sure that whatever software or system you use to store data, you are able to get all of the data out without losing anything. Test exports (and imports) before committing!
Planning
Plan for publication and reuse from the start
Funding requirements – frequently require data management plans that detail (or briefly describe) how you will be storing, processing, sharing/disseminating, and archiving the data you collect. They also prompt you to consider any sensitive aspects of your data and to plan for how you will manage protecting those appropriately
ADS Guides to Good Practice – massive resource that has a lot of detail but is quite easy to reference and go to your specific use case. They also regularly update the pages to address emerging file formats and are constantly expanding the types of data covered.
Starting point should be determining what your unit of analysis needs to be aka what does a record represent in your database and how granular do your records need to be to answer your research questions?
Artifact vs trench – store data about an artifact in an artifact record and data about a trench in a trench record, don’t combine into a single table
Tables are structured so that columns are the “fields” and each row is a single “record”
An artifact record can reference the specific trench where the artifact came from without having to reproduce all of the data in that trench’s record within the Trench Table. This is achieved through unique identifiers. What the identifier is or how it is structured is less important than ensuring that they are completely unique and always assigned to a new record.
Each record in a database should have a unique identifier.
Normalization and other terms…
It can be helpful to know some of the jargon surrounding data management but understanding the concepts and why they are important is far more useful
Term
Definition / Description
Significance
Normalization
Database normalization is the process of organizing the attributes of the database to reduce or eliminate data redundancy (having the same data but at different places). Reference
Redundancy not only wastes time because you are entering the same data in multiple spots, it also increases the chances for introducing errors into the dataset.
Primary key
The primary key of a relational table holds a unique value, which identifies a single record in the table.
Primary keys are used to connect tables so that you do not have to enter redundant information (see above) but even more importantly, they are the unique identifiers that allow you to attach all kinds of information to a physical object. They are the bedrock of a database and analyses.
Foreign key
A foreign key links records in one table to another by matching a primary key in that other table.
This is the field that is storing the unique identifier for a record in a different table – so it is how you are able to get all of the other information that would have been redundant to enter in this table.
Relationship
How different tables are linked to each other – there are multiple options including one-to-one, one-to-many, and many-to-many.
If you are building out a database, think through what type of relationship you are going to need. For example, an artifact can only be from one trench but a trench will most likely have multiple artifacts. Similarly, a petrography record can only be related to a single sherd.
Controlled vocabulary
A standardized set of words or phrases.
Think dropdown lists – these are great ways to speed up data entry but also to ensure that words are being spelled the same way each time. Published controlled vocabularies or thesauri can be ways to support research across datasets and collections. They also remove a lot of the work that goes into building a vocabulary and promote collaboration.
Fields and what goes into them
What form does the data need to be in when collected and stored in so that you can run analyses later?
Type of data field (text, numbers, etc)
Integers and decimal fields – precision and storing leading/trailing zeroes
Consider whether you should use a controlled vocabulary
If you want to be able to query or filter easily
If you want to speed up data entry
If you have multiple people entering data (or really just know that consistency isn’t easy for anyone)
If you are collecting data across multiple sessions/seasons (see above regarding consistency)
If your data can be categorized into a vocabulary
Spatial data – instrument/source (accuracy)
Mock up a dataset and try to run the analyses you want to use in the future on your real data – can you? Or do you need to store the data differently?
Gotchas! (Or tips to avoid headaches)
Excel + dates
Different versions will use different systems for calculating the number that Excel will frequently convert your date to so it is easy to introduce errors unintentionally and back calculating to a formatted date may not always be possible.
To avoid: format the column to text prior to entering any dates
Other issues to watch out for: if you have any identifiers or values that might be interpreted as dates, they may also be converted to a serial number by Excel unless you preformat the column to text. For example: an identifier of 1-12 may be interpreted as January 12th even though there isn’t a year.
Dates in general (how granular are you actually going to be? circa? text field, periods/ranges, etc)
Be realistic from the start
Consider using a chronology or resource like PeriodO so that your period or chronological designation is tied to a specific publication/source
Do not store information through design (i.e. hightlights, color coding text or groupings)
This type of formatting is frequently lost when you convert to different file types
It is easy to lose or forget to document what the formatting indicates, or is just not translatable to others
It can also be an accessibility issue for collaborators
Data backups – automated ideally, but don’t forget that a data backup should not rely on a specific service or software continuing to operate (you need to be able to get the data out)
Data entry differences – let’s be honest, nobody reads the manual
don’t rely on a guide or manual or single training session
include units in the field name
include a description of the field with an example of how the data should be entered next to field
regularly review data entered by others and hold discussions to ensure that everyone is on the same page
Measurement diagrams for the win
Quick sketches or standardized diagrams that show where a measurement was taken facilitate consistency and running analyses later
Sensitive data – flag fields that contain or may contain sensitive data from the start and plan how to handle them prior to publishing and/or depositing in a long-term preservation repository
Images, files and attachments being moved or stored in cloud syncing applications like Dropbox, Google Photos, Box
Databases frequently store a reference to the physical location of a file on a drive, not the actual file. If a file is moved or if a cloud storage system adjusts their conventions, the database will not be able to find the file without work on your part.
File naming and storage conventions
Keep an archive of raw data that is never edited aka make a copy of the file before doing any processing or running analyses
Make it clear what has been processed (and document how it was processed)
Consider whether you will be incorporating data from other sources and whether it should be stored differently (i.e. if you are going to be depositing this data also, do you need permission or the rights to deposit?)