A while back I was asked this question, and it provoked an almost subterranean thought process in me. While I can’t promise a deep answer to this question, at least an answer has emerged.
For me, librarianship itself is all about guiding people to knowledge. I love the sherpa meme, and would be honored to call myself a data sherpa.
[The image above appears to be no longer actively used by Open Knowledge Foundation, possibly due to other companies using this tagline as well. I am just referring to it in an academ-y fair use-y way here.]
So, a data librarian is not necessarily someone who collects data and puts on a shelf or on a server (although that certainly can be done by data librarians). For me, the central role of data librarians (as compared to data archivists, data curators, data analysts, data scientists, and other professions with the data-prefix) is that of data navigator or data guide. We help people find and use the data they need, using the librarian side of our skills to understand our user communities and craft solutions to their particular needs. That requires knowing the data landscape, having the hard skills to crunch the data itself, and having the soft skills to adapt our services to our environment.
As has always been the case in librarianship, the balance among the different things we offer changes over time. Data availability is certainly increasing in the general sense, so the “finding data” part of the equation has changed to one that requires more understanding of what kinds of deep analysis are made possible by combining disparate datasets and tools in possibly unexpected ways. Finding the population of a country over time has never been easier, but trying to understand how economic and environmental factors may have contributed to that population change is now a question that permits more sophisticated answers if we bring more and better data to bear.
The tools at our disposal have changed as well. Rather than being dependent on a preinstalled application (for example on a data CD), users expect to extract data into their own preferred analysis platforms and then be able to serve their results back to end users via interactive web interfaces of their own creation. It is amazing that this is possible and is getting easier all the time. But it also means that we cannot stand pat and continue to offer the same data resources of previous decades as if they are everyone’s sine qua non or ne plus ultra or [insert your alma mater‘s latin cliché here].
What else distinguishes a data librarian? Many data scientists may know the data landscape and know about data analysis and be applying those skills in service of a particular community. I would also argue that it is the values of librarianship that are important. Specifically, the commitment to open, shared resources, and to educating the community on their use are critical. This is why I consider many of the things I do that others may not see as “librarian-like” — such as teaching data literacy, or sharing tutorial videos about statistical software — to be some of my most valuable and core work. What I value is this openness and sharing that offers the promise to every person that they can continue to learn and develop their skills, and themselves. I hope that my work as a data librarian helps enable that, and I am glad and privileged to be part of both a local work environment and an international community that supports those goals.
Another new release from ICPSR that is too interesting not to mention. The U.S. County-Level Natality and Mortality Data, 1915-2007 has nearly a century of detailed data on births and infant deaths for those looking for long-term patterns.
Rutgers University Libraries Data Services Workshop Series (New Brunswick)
This Fall, Ryan Womack, Data Librarian, will offer a series of workshops on statistical software, data visualization, and data management, as part of the Rutgers University Libraries Data Services. A detailed calendar and descriptions of each workshop are below. This semester each workshop topic will be repeated twice, once at the Library of Science and Medicine on Busch Campus, and once at Alexander Library on College Ave. These sessions will be identical except for location. Sessions will run approximately 3 hours. Workshops in parts will divide the time in thirds. For example, the first SPSS, Stata, and SAS workshop (running from 12-3 pm) would start with SPSS at 12 pm, Stata at 1 pm, and SAS at 2 pm. You are free to come only to those segments that interest you. There is no need to register, just come!
Location: The Library of Science and Medicine (LSM on Busch) workshops will be held in the Conference Room on the 1st floor of LSM on Wednesdays from 12 to 3 pm. The Alexander Library (College Ave) workshops will be held in room 413 of the Scholarly Communication Center (4th floor of Alexander Library) from on Thursdays from 1:10 to 4:10 pm.
For both locations, you are encouraged to bring your own laptop to work in your native environment. Alternatively, at Alexander Library, you can use a library desktop computer instead of your own laptop. At LSM, we will have laptops available to borrow for the session if you don’t bring your own. Room capacity is 25 in both locations, first come, first served.
If you can’t make the workshops, or would like a preview or refresher, screencast versions of many of the presentations are already available at http://libguides.rutgers.edu/data and https://youtube.com/librarianwomack. Additional screencasts are continually being added to this series. Note that the “special topics” [Time Series, Survival Analysis, and Big Data] are no longer offered in person, but are available via screencast.
Calendar of workshops
12 noon – 3 pm
1:10 pm -4:10 pm
|September 21||Introduction to SPSS, Stata, and SAS||September 22|
|September 28||Introduction to R||September 29|
|October 5||Data Visualization in R||October 6|
|October 19||Introduction to Data Management||October 13|
Description of Workshops:
§ Introduction to SPSS, Stata, and SAS (September 21 or September 22) provides overviews of these three popular commercial statistical software programs, covering the basics of navigation, loading data, graphics, and elementary descriptive statistics and regression using a sample dataset. If you are already using these packages with some degree of success, you may find these sessions too basic for you.
- SPSS is widely used statistical software with strengths in survey analysis and other social science disciplines. Copies of the workshop materials, a screencast, and additional SPSS resources can be found here: http://libguides.rutgers.edu/content.php?pid=115296&sid=1208425. SPSS is made available by OIRT at a discounted academic rate, currently $100/academic year. Find it at software.rutgers.edu. SPSS is also available in campus computer labs and via the Apps server (see below).
- Stata is flexible and allows relatively easy access to programming features. It is popular in economics among other areas. Copies of the workshop materials, a screencast, and additional Stata resources can be found here: http://libguides.rutgers.edu/content.php?pid=115296&sid=1208427. Stata is made available by OIRT via campus license with no additional charge to install for Rutgers users. Find it at software.rutgers.edu.
- SAS is a powerful and long-standing system that handles large data sets well, and is popular in the pharmaceutical industry, among other applications. Copies of the workshop materials, a screencast, and additional SAS resources can be found here: http://libguides.rutgers.edu/content.php?pid=115296&sid=1208423. SAS is made available by OIRT at a discounted academic rate, currently $100/academic year. Find it at software.rutgers.edu. SAS is also available in campus computer labs, online via the SAS University Edition cloud service, and via the Apps server (see below).
Note: Accessing software via apps.rutgers.edu
§ Introduction to R (September 28 or September 29) – This session provides a three-part orientation to the R programming environment. R is freely available, open source statistical software that has been widely adopted in the research community. Due to its open nature, thousands of additional packages have been created by contributors to implement the latest statistical techniques, making R a very powerful tool. No prior knowledge is assumed. The three parts cover:
- Statistical Techniques: getting around in R, descriptive statistics, regression, significance tests, working with packages
- Graphics: comparison of graphing techniques in base R, lattice, and ggplot2 packages
- Data Manipulation: data import and transformation, additional methods for working with large data sets, also plyr and other packages useful for manipulation.
Additional R resources, including handouts, scripts, and screencast versions of the workshops, can be found here: http://libguides.rutgers.edu/data_R
R is freely downloadable from http://r-project.org
§ Data Visualization in R (October 5 or October 6) discusses principles for effective data visualization, and demonstrates techniques for implementing these using R. Some prior familiarity with R is assumed (packages, structure, syntax), but the presentation can be followed without this background. The three parts are:
- Principles & Use in lattice and ggplot2: discusses classic principles of data visualization (Tufte, Cleveland) and illustrates them with the use of the lattice and ggplot2 packages. Some of the material here overlaps with Intro to R, pt 2, but at a higher level.
- Miscellany of Methods: illustrates a wide range of specific graphics for different contexts
- 3-D, Interactive, and Big Data: presentation of 3-D data, interactive exploration data, and techniques for large datasets. Relevant packages such as shiny and tessera are explored.
Additional R resources can be found here: http://libguides.rutgers.edu/data_R
R is freely downloadable from http://r-project.org
§ Introduction to Data Management (October 13 or October 19) covers
- Best Practices for Managing Your Data – methods to organize, describe, backup, and archive your research data in order to ensure its future usability and accessibility. Developing good habits for handling your data from the start will save time and frustration later, and increase the ultimate impact of your research.
- Data Management Plans, Data Sharing and Archiving – targeted to researchers who need to write data management plans (DMPs) and share their data as part of their grant application, research and publication process. Reviews DMP guidelines, checklist, and general advice, along with options for sharing and permanently archiving research data.
- Reproducible Research – covers the growing movement to make the products of research accessible and usable by others in order to verify, replicate, and extend research findings. Reviews how to plan research, to create publications, code, and data in open, reusable formats, and maximize the impact of shared research findings.
Additional data management resources, including presentation slides, can be found here: http://libguides.rutgers.edu/datamanagement
§ Special Topics
Note that the following special topics are no longer covered by in-person workshops, but are available via screencast.
- Time Series in R: review of commands and techniques for basic time series analysis in R. Screencast at https://www.youtube.com/playlist?list=PLCj1LhGni3hOA2q0sfDNKBH9WIlLxXkbn and scripts at http://libguides.rutgers.edu/data_R
- Survival Analysis in R: review of commands and techniques for basic survival analysis in R. Scripts at http://libguides.rutgers.edu/data_R. Screencast at https://www.youtube.com/playlist?list=PLCj1LhGni3hOON9isnuVYIL8dNwkvwqr9.
- Big Data in Brief: an introduction to some of the techniques and software environments used to work with big data, with pointers to resources for further learning at http://libguides.rutgers.edu/bigdata. Screencast at https://www.youtube.com/playlist?list=PLCj1LhGni3hMNhIdrvz1F5-JHIWi1qdX1
I am not regularly posting about interesting datasets as much as I used to. But this High School Longitudinal Study [2009-2013] from ICPSR is fascinating, dealing as it does with the following questions:
- How do parents, teachers, counselors, and students construct choice sets for students, and how are these related to students’ characteristics, attitudes, and behavior?
- How do students select among secondary school courses, postsecondary institutions, and possible careers?
- How do parents and students plan financing for postsecondary experiences? What sources inform these plans?
- What factors influence students’ decisions about taking STEM courses and following through with STEM college majors? Why are some students underrepresented in STEM courses and college majors?
- How students’ plans vary over the course of high school and how decisions in 9th grade impact students’ high school trajectories. When students are followed up in the spring of 11th grade and later, their planning and decision-making in 9th grade may be linked to subsequent behavior.
Subscribe to announcements from ICPSR to learn about more datasets and resources like this.
This post is also slightly off topic – not a data announcement, workshop, video, etc. But it does contain one specific instance of something that everyone should be thinking about – data backup. Everyone knows the rule of three – keep at least three backups of your precious files and make sure at least one of them is offsite in case of disaster.
I needed to develop a new routine for my home computer backup after deciding to seize control of my system back from SpiderOak. I had been using that for a while, but then upgraded to SpiderOak One, and my incremental backups seemed to take forever, with the SpiderOak process constantly using lots of CPU and seemingly not accomplishing much. [This is all on Linux as usual]. I realized that I understood very little of what the client was actually doing, and since the client was unresponsive, could no longer rely on it to actually be able to backup and retrieve my files. I decided to go completely manual so that I would know exactly what my backup status was and what was happening.
Part 0 of my personal rule of three is that all of my family’s home machines get an rsync run periodically (i.e., whenever I remember) to back up their contents to the main home server.
Part 1 is a local backup to an internal hard drive on the server. I leave this drive unmounted most of the time, then mount it and rsync the main drive to it. The total file size is about 600 GB right now, partly because I do not really de-dupe anything or get rid of old stuff. Also, I don’t have a high volume of video files to worry about at this point.
Part 2 is a similar rsync backup to a portable hard drive [encrypted]. I have two drives that I swap and carry back and forth to work every couple of weeks or so. I have decided that I don’t really like frequent automated backup, because I’d be more worried about spreading a problem like accidental deletion of files, or a virus, before the problem is discovered. I can live with the loss of a couple of weeks of my “machine learning” if disaster truly strikes.
But what about Part 3? I wanted to go really offsite, and not pay a great deal for the privilege. I have grown more comfortable with AWS as I learn more about it, and so after some trial and error, devised this scheme…
On the server, I tar and zip my files, then encrypt them, taking checksums along the way
tar -cvf mystuff.tar /home/mystuff
sha256sum mystuff.tar.bz > mystuffsha
gpg -c –sign mystuff.tar.bz2
sha256sum mystuff.tar.bz2.gpg > mystuffgpgpsha
This takes some time to run, and generates some big files, but it is doable. I actually do this in three parts because I have three main directories on my system.
Then we need to get it up to the cloud. Here is where file transfer really slows down. I guess it is around 6 days total wait time for all of my files to transfer, although I do it in pieces. The files need to be small enough that a breakdown in the process will not lose too much work, but large enough so that you don’t have thousands of files to keep track of. I do this to split the data into 2GB chunks:
split -b 2147483648 mystuff.tar.bz2.gpg
Now we have to upload it. I want to get the data into AWS Glacier since it is cheap, and this is a backup just for emergencies. Now Glacier does have a direct command line interface, but it requires the use of long IDs and is just fussy in terms of accepting slow uploads over a home cable modem. Fortunately, getting data into S3 is easier and more reliable. And, S3 allows you to set a file policy that will allow you to automatically transfer your data from S3 to Glacier after a set amount of time. So the extra cost you incur for say, letting your data sit in S3 for a day, is really pretty small. I guess you could do this with regular expressions, but I just have a long shell file with each of my S3 commands on a separate line. This requires you to install the Amazon CLI on your system.
aws s3 cp xaa s3://your_unique_bucket_name_here
aws s3 cp xab s3://your_unique_bucket_name_here
I just run that with a simple shell command that dumps any messages to a file.
sh -xv my_special_shell_script.sh > special_output
And, voila…days later your files will be in the cloud. You can set a hosting zone that will put the files on the other side of the planet from you if you think that will be more reliable.
To bring the files back down, you must request through the AWS interface for the files to be brought back from Glacier to S3, then download from S3, then use “cat” to fuse them together, and in general reverse all the other steps to decrypt, untar, checksum and such. It worked for me on small scale tests, but I guess I should try it on my entire archive at least once to make sure this really works.
At least with this method, I know exactly what is in the cloud, how and when it got there, and how to get it back. And it looks like it will only run me about $6 a month.
Well, it has been a long time in coming, but I have finally finished converting my Data Visualization workshop series to a screencast video version. See this YouTube playlist for the complete series, and the materials at Github. This is the long version of the in-person 3 hour workshop. The video series goes into even more detail, starting from a history of major developments in visualization, to various implementations of specific graphs, interactive visualizations, web viz, big data, and more.
I also have some ideas for some more up-to-date add-ins that I will probably record as lagniappe videos over the next few weeks. Those didn’t quite fit into the existing sequence of videos.
Well, this is off topic for the theme of the blog, but I felt the urge to record an expurgated version of my recent installation of Linux on a new Dell XPS 13 (9350), for the potential edification of the populace. I will try to make it brief, by my standards 🙂 [but I guess I failed..] This is not in any way an objective post, but I am just blowing off steam and letting my opinions fly. Please skip over if you are looking for actual educational material…
As tweeted earlier, a Pi day discount tempted me into the purchase of a 13″ Dell XPS (model 9350). I had been following Dell primarily due to their Linux support on via the Developer Series, and had been tempted on many occasions in the past.
I should also mention that dating back to the fin-de-siècle, I have been a Linux user. Although I have occasionally strayed away, for most of the time, Linux computers have formed the core of my computing infrastructure. In Linux, I can do what I want to do, rather than simply obey the instructions of other OSes. In recent years, I converted from the Fedora sect to become a Debian adherent, and I have been very satisfied with that choice.
Still, the computer I bought was NOT a Developer’s Edition, but a new Windows 10 machine. In the past, I have usually put Linux onto either very standard or slightly older hardware, and didn’t have FEAR that it would not work. My working laptop recently has been a leftover 2010 Macbook Pro running Debian only, and it has no real issues, but it is running hot and noisy. Since this XPS laptop was brand new with the latest technology, I have to confess that for a moment or two I even considered leaving Windows on the machine and using it in dual boot mode. But two minutes in Windows 10 erased any of those doubts. Seriously, why would anyone voluntarily remain in that depressing environment if they had the possibility of escape?
So, I committed to putting Debian on the machine as its sole operating system, and began Googling to get ready. I learned that the Dell-rebranded Broadcom wireless card was not being supported, except in bleeding edge kernels, and was not very good anyway. And also that Intel wireless cards worked easily with Linux. Thanks to Dell, because they put a wonderful service manual online and don’t mind users operating on their own hardware (unlike the fruit-themed gadget company). I ordered an Intel Wireless card. Due to a bit of carelessness, I picked a 9260 instead of the 9265 model. The 9265 is supported natively in the kernel, whereas the 9260 requires a download of Intel proprietary drivers [more on this later]. But, in spite of my nerves, popping open and disassembling my brand new laptop was a piece of cake, and it went back together just as good as new. I am liking Dell from the hardware perspective.
Then, I prepared a USB boot stick to install Debian 8 (Jessie). So, I had to fidget about a bit with the UEFI/BIOS settings to get the Dell to want to boot from the stick, but eventually made it happen. Then I went through a couple of abortive installation attempts because of the aforementioned wireless drivers, which needed to be loaded from a second USB stick. I am sorry I can’t really document it completely here, but only more fiddling until I found the right combination that would recognize one USB as the boot media and one USB as the supplemental driver files allowed me to proceed. During that phase, I began to worry that I had gone too far by buying a slim fancy device without an ethernet port, but I survived. I would still lean towards getting a computer with a real ethernet port in the future though, just for safety. Turns out the Linux drivers for the USB-C to Ethernet are reported to be fussy too.
On to the next complication… Did I mention that I not only like Linux and Debian, but prefer XFCE as my desktop? Because I am old school and don’t care for eye candy at all. It was the horrible broken experience of GNOME 3 that drove me away from Fedora. I just want a desktop that stays out of my way and does the work (is that some antiquated colonialist mindset? perhaps, but I think it is still OK to exploit a computer, right?). Anyway, XFCE has been my go-to for the last 4-5 years. I respect LinuxMint/Cinammon too for their attempt to correct the awful GNOME decisions that were forced on unsuspecting users. But XFCE has done the job for me. So, I was willing to work overtime to get XFCE as my desktop.
Now, I have done Debian/XFCE installs on a number of desktops, and my Macbook too. But somehow, the Debian 8 XFCE install (at the time of writing) had one major issue. I finished the installation, but could only login to XFCE desktop as root, not as a user. There was some kind of weird permissions issue, or some problem with the install scripts. I am experienced in Linux, but not expert, so extensive Googling on this topic failed to resolve the issue. What did work was to do a standard Debian install with Cinammon as the default desktop. And only after that was working did I install XFCE. That worked like a charm. Hopefully someone who has more knowledge of what could cause this would fix it for the young generations.
I also had to try a few different configurations before getting my preferred configuration of encrypted hard drive and swap space, while leaving a bit of open /boot directory. I wish there was a better-documented path for this too. Somehow encryption is still considered to be a slightly exotic option, when it shouldn’t be.
Ok, so now I am excited because I have a working Debian/XFCE install on my new laptop. My hand-installed wireless is working, and everything is looking up. BUT, I have NO SOUND, and it appears that the lack of sound is also causing any standard (e.g., YouTube) videos to play too fast. I take a deep breath. More googling reveals that this is an issue with the Dell 9350 model’s audio, and that future kernels will handle it. But my Debian 8 kernel does not handle it, and I cannot use my expensive laptop to watch my favorite YouTube videos!!!! I use all of my experience in “taming my dog of desire” to reconcile myself to the situation. I can use my laptop as a wonderful distraction-free zone to code and write wonderful things. What do I need sound for? After all, Plato and Muhammad both condemned music. Yes, what do I need sound for? Ok, I will live without sound on my laptop 😦
But, after getting everything else configured to my liking, I was ready to keep experimenting. Is that not the whole point of Linux? To experiment, to control your own working environment? Not to blindly obey when a popup window says, “You must update now”, or “You must click ‘accept’ to continue”, or “Operation not permitted”. Right, this is Linux, so let’s go!
In practice, what that meant was that I attempted a full upgrade from Debian 8 (jessie) to Debian 9 (stretch), even though 9 is not yet stable. What was my motivation? Well, to confess, at least 90% of the motivation was to get that audio working. Because if this is the kind of world where we can’t listen to music on our laptops, is that the kind of world that we want to live in? We want our music, and we want control of our computers too!
Now, the Debian instructions are very clear and very good, and after editing my apt sources, I was able to update and upgrade, and within a very, very short period of time, had my entire OS running a very current set of applications with crystal clear audio and video. I now have pretty close to my full suite of applications (R, RStudio, Mathematica, Claws-mail, LaTeX, LyX, Gummi, all of the old favorites…). Now I am content! I then had to go and customize my desktop settings and browser to a very dark theme and plaster my laptop with some stickers to make it seem more like my own. Too much, probably, but it is a small thing that makes me happy 🙂 My family says I’m crazy, but I get that all the time anyway…
The HUE data covers seven cities (Baltimore, Boston, Brooklyn, Chicago, Cincinnati, Manhattan, and Philadelphia) for the years 1830 to 1930 and provides detailed disease reports by location, and includes in-street sewer and water sanitation systems. Pretty good bet those were lead pipes.
This semester, the Libraries will offer a workshop covering:
- Best Practices for Managing Your Data
- Data Management Plans, Data Sharing and Archiving
- Reproducible Research
The workshop will repeat in two locations on:
- Monday, March 7, 12-1:30 pm in the Library of Science and Medicine Conference Room (1st Floor)
- Tuesday, March 8, 1:10 to 2:40 pm in Alexander Library Teleconference Lecture Hall (4th floor)
The two sessions are identical – no need to come to both.
The first part of the session will focus on Best Practices for Managing Your Data.
- We discuss methods to organize, describe, backup, and archive your research data in order to ensure its future usability and accessibility. Developing good habits for handling your data from the start will save time and frustration later, and increase the ultimate impact of your research.
The second part covers Data Management Plans, Data Sharing and Archiving.
- This portion is targeted to researchers who need to write data management plans (DMPs) and share their data as part of their grant application, research and publication process. It reviews DMP guidelines, checklist, and general advice, along with options for sharing and permanently archiving research data.
The third part discusses Reproducible Research.
- We cover the growing movement to make the products of research accessible and usable by others in order to verify, replicate, and extend research findings. We review how to plan research, to create publications, code, and data in open, reusable formats, and maximize the impact of shared research findings.
No need to register, just come for what you are interested in.
Additional data management resources, including presentation slides, can be found here: http://libguides.rutgers.edu/datamanagement