Clean datasets have similar properties and look the same, while "dirty" datasets are messy in their own ways. Knowing what clean data looks like and how to clean data is an important skill in assisting researchers in making their data FAIR ( findable, accessible, interoperable, and reusable).
In this webinar, you will learn to identify the components of a clean and tidy dataset and describe the steps needed to process a "dirty" dataset. With these components identified, you will be able to tidy your own data and provide guidance to researchers.
You’ll see, in action, common data issues solved by carrying out data transformation and pivoting operations. You’ll also learn the steps needed to break down observational units into separate tables (“normalize” data) so they can be efficiently stored in databases.
This webinar is a companion to Clean & Tidy Data: Getting Started with Spreadsheet Data. The webinars stand alone and work together synergistically. Getting Started will show you best practices for beginning to work with medical data
Special Note: This webinar is approved for the “under construction” Advanced Level of the Data Services Specialization. A Basic Level Data Services Specialization Certification is currently available.
- At the end of the webinar, participants will be able to:
- Identify the components of a clean and tidy dataset
- Apply knowledge of the components of a clean and tidy dataset to cleaning data
- Identify the steps of normalizing data
Medical librarians and other health information professionals who provide or plan to provide data services. Familiarity with browsing and editing spreadsheets is helpful.
Anne M. Brown is an Assistant Professor in Data Services, University Libraries at Virginia Tech and affiliate faculty member in the Department of Biochemistry and Academy of Integrated Science. She is the author or co-author of a number of publications and presentations on data-related and data literacy topics.
Daniel Chen is a graduate student in Genetics, Bioinformatics, and Computational Biology at Virginia Tech. His research is focused on data science education and pedagogy in the medical and biomedical sciences. He is the author of Pandas for Everyone: Python Data Analysis and a number of other data science learning materials.
Note: This registration is for the Livestream only and does not offer MLA contact hours. If you are a LILRC member health sciences/hospital librarian, please email Sally Stieglitz, at email@example.com, to arrange to view with a unique access code for MLA contact hours. MLA contact hours are not applicable to the MLA Consumer Health Information Specialization
This program is not being recorded. Code of Conduct
For questions, please email Eliscia Cirrone, firstname.lastname@example.org.