Data Science in Mongolia in May

Flag_of_Mongolia.svg

I am delighted to announce that I have been invited to present on several data science topics at the Mongolian National Statistics Office (Монгол Улсын Үндэсний Статистикийн Хороо) and the National University of Mongolia (Монгол Улсын Их Сургууль) from May 8th to 10th.

I will post more detailed commentary after the visit, but I am excited by the opportunity to meet colleagues and learn about the Mongolian environment, and explore the potential for collaboration on data issues.

National_University_of_Mongolia_Seal4kj4kj8qcjb4tkr5jc1lomr1de

Survival Analysis in R video available

As promised earlier, the “special topic” material on Survival Analysis is now available on YouTube in lieu of in-person sessions.  Take a look at the Survival Analysis in R Playlist.

Survival analysis deals with data that may have truncated observations, called censored data.  A typical example is studying the time until failure of a part in engineering, or failure of a part of the human body in medicine (colloquially known as “disease”).  We usually have some accurate data on when the problem occurs until the point that the end of the study is reached.  Then we will have some subjects that survived without failure until the end of the study, but we are uncertain just how long they would have lasted until failure.  The methods of survival analysis account for this partial uncertainty in the data.  R can deal with almost all necessary aspects of survival analysis, but requires some mixing and matching of packages to get the best results, as shown in the videos.

As always, my YouTube videos are fueled by music behind the scenes.  Giving a throwback shoutout to Public Image Limited, some holiday Twice, plus the usual Mongolian suspects.

Statistical Software and Data Workshops, Spring 2017

Rutgers University Libraries Data Services Workshop Series (New Brunswick)

Spring 2017

In Spring 2017, Ryan Womack, Data Librarian, will repeat the series of workshops on statistical software, data visualization, and reproducible research as part of the Rutgers University Libraries Data Services.   A detailed calendar and descriptions of each workshop are below.  This semester each workshop topic will be repeated twice, once at the Library of Science and Medicine on Busch Campus, and once at Alexander Library on College Ave.  These sessions will be identical except for location. Sessions will run approximately 3 hours.  Workshops in parts will divide the time in thirds.  For example, the first SPSS, Stata, and SAS workshop (running from 12-3 pm) would start with SPSS at 12 pm, Stata at 1 pm, and SAS at 2 pm.  You are free to come only to those segments that interest you.  There is no need to register, just come!

Logistics

Location: The Library of Science and Medicine (LSM on Busch) workshops will be held in the Conference Room on the 1st floor of LSM on Mondays from 12 to 3 pm.  The Alexander Library (College Ave) workshops will be held in room 413 of the Scholarly Communication Center (4th floor of Alexander Library) from on Tuesdays from 1:10 to 4:10 pm.

For both locations, you are encouraged to bring your own laptop to work in your native environment.  Alternatively, at Alexander Library, you can use a library desktop computer instead of your own laptop.  At LSM, we will have laptops available to borrow for the session if you don’t bring your own.  Room capacity is 25 in both locations, first come, first served.

If you can’t make the workshops, or would like a preview or refresher, screencast versions of many of the presentations are already available at http://libguides.rutgers.edu/data and https://youtube.com/librarianwomack. Additional screencasts are continually being added to this series.  Note that the “special topics” [Time Series, Survival Analysis, and Big Data] are no longer offered in person, but are available via screencast.

Calendar of workshops

Monday (LSM)

12 noon – 3 pm

  Tuesday (Alexander)

1:10 pm -4:10 pm

January 23 Introduction to SPSS, Stata, and SAS January 24
January 30 Introduction to R January 31
February 6 Data Visualization in R February 7
February 13 Reproducible Research February 14

Description of Workshops:

§ Introduction to SPSS, Stata, and SAS (January 23 or January 24) provides overviews of these three popular commercial statistical software programs, covering the basics of navigation, loading data, graphics, and elementary descriptive statistics and regression using a sample dataset.  If you are already using these packages with some degree of success, you may find these sessions too basic for you.

  • SPSS is widely used statistical software with strengths in survey analysis and other social science disciplines.  Copies of the workshop materials, a screencast, and additional SPSS resources can be found here: http://libguides.rutgers.edu/content.php?pid=115296&sid=1208425. SPSS is made available by OIRT at a discounted academic rate, currently $100/academic year.  Find it at software.rutgers.edu.  SPSS is also available in campus computer labs and via the Apps server (see below).
  • Stata is flexible and allows relatively easy access to programming features.  It is popular in economics among other areas.  Copies of the workshop materials, a screencast, and additional Stata resources can be found here: http://libguides.rutgers.edu/content.php?pid=115296&sid=1208427. Stata is made available by OIRT via campus license with no additional charge to install for Rutgers users.  Find it at software.rutgers.edu.
  • SAS is a powerful and long-standing system that handles large data sets well, and is popular in the pharmaceutical industry, among other applications. Copies of the workshop materials, a screencast, and additional SAS resources can be found here: http://libguides.rutgers.edu/content.php?pid=115296&sid=1208423. SAS is made available by OIRT at a discounted academic rate, currently $100/academic year.  Find it at software.rutgers.edu.  SAS is also available in campus computer labs, online via the SAS University Edition cloud service, and via the Apps server (see below).

Note: Accessing software via apps.rutgers.edu

SPSS, SAS, Stata, and R are available for remote access on apps.rutgers.eduapps.rutgers.edu does not require any software installation, but you must activate the service first at netid.rutgers.edu.

 

§ Introduction to R (January 30 or January 31) – This session provides a three-part orientation to the R programming environment.  R is freely available, open source statistical software that has been widely adopted in the research community.  Due to its open nature, thousands of additional packages have been created by contributors to implement the latest statistical techniques, making R a very powerful tool.  No prior knowledge is assumed. The three parts cover:

  • Statistical Techniques: getting around in R, descriptive statistics, regression, significance tests, working with packages
  • Graphics:  comparison of graphing techniques in base R, lattice, and ggplot2 packages
  • Data Manipulation:  data import and transformation, additional methods for working with large data sets, also plyr and other packages useful for manipulation.

Additional R resources, including handouts, scripts, and screencast versions of the workshops, can be found here: http://libguides.rutgers.edu/data_R

R is freely downloadable from http://r-project.org

 

§ Data Visualization in R  (February 6 or February 7) discusses principles for effective data visualization, and demonstrates techniques for implementing these using R.  Some prior familiarity with R is assumed (packages, structure, syntax), but the presentation can be followed without this background.  The three parts are:

  • Principles & Use in lattice and ggplot2: discusses classic principles of data visualization (Tufte, Cleveland) and illustrates them with the use of the lattice and ggplot2 packages.  Some of the material here overlaps with Intro to R, pt 2, but at a higher level.
  • Miscellany of Methods: illustrates a wide range of specific graphics for different contexts
  • 3-D, Interactive, and Big Data: presentation of 3-D data, interactive exploration data, and techniques for large datasets. Relevant packages such as shiny and tessera are explored.

Additional R resources can be found here: http://libguides.rutgers.edu/data_R

R is freely downloadable from http://r-project.org

 

§ Reproducible Research (February 13 or February 14) covers

  • Reproducible research describes the growing movement to make the products of research accessible and usable by others in order to verify, replicate, and extend research findings.  Reviews how to plan research, to create publications, code, and data in open, reusable formats, and maximize the impact of shared research findings.  Examples in LaTeX and Rmarkdown are discussed, along with platforms for reusability such as the Open Science Foundation.

Additional resources on reproducible research and data management, including presentation slides, can be found here: http://libguides.rutgers.edu/datamanagement

 

§ Special Topics

Note that the following special topics are no longer covered by in-person workshops, but are available via screencast.

What is a Data Librarian?

A while back I was asked this question, and it provoked an almost subterranean thought process in me.  While I can’t promise a deep answer to this question, at least an answer has emerged.

For me, librarianship itself is all about guiding people to knowledge.  I love the sherpa meme, and would be honored to call myself a data sherpa.

8413860823_c33de33839_o

[The image above appears to be no longer actively used by Open Knowledge Foundation, possibly due to other companies using this tagline as well.  I am just referring to it in an academ-y fair use-y way here.]

So, a data librarian is not necessarily someone who collects data and puts on a shelf or on a server (although that certainly can be done by data librarians).  For me, the central role of data librarians (as compared to data archivists, data curators, data analysts, data scientists, and other professions with the data-prefix) is that of data navigator or data guide. We help people find and use the data they need, using the librarian side of our skills to understand our user communities and craft solutions to their particular needs.  That requires knowing the data landscape, having the hard skills to crunch the data itself, and having the soft skills to adapt our services to our environment.

As has always been the case in librarianship, the balance among the different things we offer changes over time. Data availability is certainly increasing in the general sense, so the “finding data” part of the equation has changed to one that requires more understanding of what kinds of deep analysis are made possible by combining disparate datasets and tools in possibly unexpected ways.  Finding the population of a country over time has never been easier, but trying to understand how economic and environmental factors may have contributed to that population change is now a question that permits more sophisticated answers if we bring more and better data to bear.

The tools at our disposal have changed as well.  Rather than being dependent on a preinstalled application (for example on a data CD), users expect to extract data into their own preferred analysis platforms and then be able to serve their results back to end users via interactive web interfaces of their own creation.  It is amazing that this is possible and is getting easier all the time.  But it also means that we cannot stand pat and continue to offer the same data resources of previous decades as if they are everyone’s sine qua non or ne plus ultra or [insert your alma mater‘s latin cliché here].

What else distinguishes a data librarian?  Many data scientists may know the data landscape and know about data analysis and be applying those skills in service of a particular community.  I would also argue that it is the values of librarianship that are important. Specifically, the commitment to open, shared resources, and to educating the community on their use are critical.  This is why I consider many of the things I do that others may not see as “librarian-like” — such as teaching data literacy, or sharing tutorial videos about statistical software — to be some of my most valuable and core work.  What I value is this openness and sharing that offers the promise to every person that they can continue to learn and develop their skills, and themselves.  I hope that my work as a data librarian helps enable that, and I am glad and privileged to be part of both a local work environment and an international community that supports those goals.

 

 

 

U.S. County-Level Natality and Mortality Data, 1915-2007

Another new release from ICPSR that is too interesting not to mention.  The U.S. County-Level Natality and Mortality Data, 1915-2007 has nearly a century of detailed data on births and infant deaths for those looking for long-term patterns.

 

National Crime Victimization Survey, Concatenated File, 1992-2015

The National Crime Victimization Survey is published every year, but the Concatenated File 1992-2015 allows easy multi-year comparisons of data.  From ICPSR, try it out!

Statistical Software and Data Workshops, Fall 2016

Rutgers University Libraries Data Services Workshop Series (New Brunswick)

Fall 2016

This Fall, Ryan Womack, Data Librarian, will offer a series of workshops on statistical software, data visualization, and data management, as part of the Rutgers University Libraries Data Services.   A detailed calendar and descriptions of each workshop are below.  This semester each workshop topic will be repeated twice, once at the Library of Science and Medicine on Busch Campus, and once at Alexander Library on College Ave.  These sessions will be identical except for location. Sessions will run approximately 3 hours.  Workshops in parts will divide the time in thirds.  For example, the first SPSS, Stata, and SAS workshop (running from 12-3 pm) would start with SPSS at 12 pm, Stata at 1 pm, and SAS at 2 pm.  You are free to come only to those segments that interest you.  There is no need to register, just come!

Logistics

Location: The Library of Science and Medicine (LSM on Busch) workshops will be held in the Conference Room on the 1st floor of LSM on Wednesdays from 12 to 3 pm.  The Alexander Library (College Ave) workshops will be held in room 413 of the Scholarly Communication Center (4th floor of Alexander Library) from on Thursdays from 1:10 to 4:10 pm.

For both locations, you are encouraged to bring your own laptop to work in your native environment.  Alternatively, at Alexander Library, you can use a library desktop computer instead of your own laptop.  At LSM, we will have laptops available to borrow for the session if you don’t bring your own.  Room capacity is 25 in both locations, first come, first served.

If you can’t make the workshops, or would like a preview or refresher, screencast versions of many of the presentations are already available at http://libguides.rutgers.edu/data and https://youtube.com/librarianwomack. Additional screencasts are continually being added to this series.  Note that the “special topics” [Time Series, Survival Analysis, and Big Data] are no longer offered in person, but are available via screencast.

Calendar of workshops

Wednesday (LSM)

12 noon – 3 pm

  Thursday (Alexander)

1:10 pm -4:10 pm

September 21 Introduction to SPSS, Stata, and SAS September 22
September 28 Introduction to R September 29
October 5 Data Visualization in R October 6
October 19 Introduction to Data Management October 13

 

Description of Workshops:

§ Introduction to SPSS, Stata, and SAS (September 21 or September 22) provides overviews of these three popular commercial statistical software programs, covering the basics of navigation, loading data, graphics, and elementary descriptive statistics and regression using a sample dataset.  If you are already using these packages with some degree of success, you may find these sessions too basic for you.

  • SPSS is widely used statistical software with strengths in survey analysis and other social science disciplines.  Copies of the workshop materials, a screencast, and additional SPSS resources can be found here: http://libguides.rutgers.edu/content.php?pid=115296&sid=1208425. SPSS is made available by OIRT at a discounted academic rate, currently $100/academic year.  Find it at software.rutgers.edu.  SPSS is also available in campus computer labs and via the Apps server (see below).
  • Stata is flexible and allows relatively easy access to programming features.  It is popular in economics among other areas.  Copies of the workshop materials, a screencast, and additional Stata resources can be found here: http://libguides.rutgers.edu/content.php?pid=115296&sid=1208427. Stata is made available by OIRT via campus license with no additional charge to install for Rutgers users.  Find it at software.rutgers.edu.
  • SAS is a powerful and long-standing system that handles large data sets well, and is popular in the pharmaceutical industry, among other applications. Copies of the workshop materials, a screencast, and additional SAS resources can be found here: http://libguides.rutgers.edu/content.php?pid=115296&sid=1208423. SAS is made available by OIRT at a discounted academic rate, currently $100/academic year.  Find it at software.rutgers.edu.  SAS is also available in campus computer labs, online via the SAS University Edition cloud service, and via the Apps server (see below).

Note: Accessing software via apps.rutgers.edu

SPSS, SAS, Stata, and R are available for remote access on apps.rutgers.eduapps.rutgers.edu does not require any software installation, but you must activate the service first at netid.rutgers.edu.

 

§ Introduction to R (September 28 or September 29) – This session provides a three-part orientation to the R programming environment.  R is freely available, open source statistical software that has been widely adopted in the research community.  Due to its open nature, thousands of additional packages have been created by contributors to implement the latest statistical techniques, making R a very powerful tool.  No prior knowledge is assumed. The three parts cover:

  • Statistical Techniques: getting around in R, descriptive statistics, regression, significance tests, working with packages
  • Graphics:  comparison of graphing techniques in base R, lattice, and ggplot2 packages
  • Data Manipulation:  data import and transformation, additional methods for working with large data sets, also plyr and other packages useful for manipulation.

Additional R resources, including handouts, scripts, and screencast versions of the workshops, can be found here: http://libguides.rutgers.edu/data_R

R is freely downloadable from http://r-project.org

 

§ Data Visualization in R  (October 5 or October 6) discusses principles for effective data visualization, and demonstrates techniques for implementing these using R.  Some prior familiarity with R is assumed (packages, structure, syntax), but the presentation can be followed without this background.  The three parts are:

  • Principles & Use in lattice and ggplot2: discusses classic principles of data visualization (Tufte, Cleveland) and illustrates them with the use of the lattice and ggplot2 packages.  Some of the material here overlaps with Intro to R, pt 2, but at a higher level.
  • Miscellany of Methods: illustrates a wide range of specific graphics for different contexts
  • 3-D, Interactive, and Big Data: presentation of 3-D data, interactive exploration data, and techniques for large datasets. Relevant packages such as shiny and tessera are explored.

Additional R resources can be found here: http://libguides.rutgers.edu/data_R

R is freely downloadable from http://r-project.org

 

§ Introduction to Data Management (October 13 or October 19) covers

  • Best Practices for Managing Your Data – methods to organize, describe, backup, and archive your research data in order to ensure its future usability and accessibility.  Developing good habits for handling your data from the start will save time and frustration later, and increase the ultimate impact of your research.
  • Data Management Plans, Data Sharing and Archiving – targeted to researchers who need to write data management plans (DMPs) and share their data as part of their grant application, research and publication process.  Reviews DMP guidelines, checklist, and general advice, along with options for sharing and permanently archiving research data.
  • Reproducible Research – covers the growing movement to make the products of research accessible and usable by others in order to verify, replicate, and extend research findings.  Reviews how to plan research, to create publications, code, and data in open, reusable formats, and maximize the impact of shared research findings.

Additional data management resources, including presentation slides, can be found here: http://libguides.rutgers.edu/datamanagement

 

 

§ Special Topics

Note that the following special topics are no longer covered by in-person workshops, but are available via screencast.

 

High School Longitudinal Study

I am not regularly posting about interesting datasets as much as I used to.  But this High School Longitudinal Study [2009-2013] from ICPSR is fascinating, dealing as it does with the following questions:

  • How do parents, teachers, counselors, and students construct choice sets for students, and how are these related to students’ characteristics, attitudes, and behavior?
  • How do students select among secondary school courses, postsecondary institutions, and possible careers?
  • How do parents and students plan financing for postsecondary experiences? What sources inform these plans?
  • What factors influence students’ decisions about taking STEM courses and following through with STEM college majors? Why are some students underrepresented in STEM courses and college majors?
  • How students’ plans vary over the course of high school and how decisions in 9th grade impact students’ high school trajectories. When students are followed up in the spring of 11th grade and later, their planning and decision-making in 9th grade may be linked to subsequent behavior.

Subscribe to announcements from ICPSR to learn about more datasets and resources like this.

A manual backup routine using AWS

This post is also slightly off topic – not a data announcement, workshop, video, etc.  But it does contain one specific instance of something that everyone should be thinking about – data backup.  Everyone knows the rule of three – keep at least three backups of your precious files and make sure at least one of them is offsite in case of disaster.

I needed to develop a new routine for my home computer backup after deciding to seize control of my system back from SpiderOak.  I had been using that for a while, but then upgraded to SpiderOak One, and my incremental backups seemed to take forever, with the SpiderOak process constantly using lots of CPU and seemingly not accomplishing much.  [This is all on Linux as usual].  I realized that I understood very little of what the client was actually doing, and since the client was unresponsive, could no longer rely on it to actually be able to backup and retrieve my files.  I decided to go completely manual so that I would know exactly what my backup status was and what was happening.

Part 0 of my personal rule of three is that all of my family’s home machines get an rsync run periodically (i.e., whenever I remember) to back up their contents to the main home server.

Part 1 is a local backup to an internal hard drive on the server.  I leave this drive unmounted most of the time, then mount it and rsync the main drive to it.  The total file size is about 600 GB right now, partly because I do not really de-dupe anything or get rid of old stuff.  Also, I don’t have a high volume of video files to worry about at this point.

Part 2 is a similar rsync backup to a portable hard drive [encrypted].  I have two drives that I swap and carry back and forth to work every couple of weeks or so.  I have decided that I don’t really like frequent automated backup, because I’d be more worried about spreading a problem like accidental deletion of files, or a virus, before the problem is discovered.  I can live with the loss of a couple of weeks of my “machine learning” if disaster truly strikes.

But what about Part 3?  I wanted to go really offsite, and not pay a great deal for the privilege.  I have grown more comfortable with AWS as I learn more about it, and so after some trial and error, devised this scheme…

On the server, I tar and zip my files, then encrypt them, taking checksums along the way

tar -cvf mystuff.tar /home/mystuff

bzip mystuff.tar

sha256sum mystuff.tar.bz > mystuffsha

gpg -c –sign mystuff.tar.bz2

sha256sum mystuff.tar.bz2.gpg > mystuffgpgpsha

This takes some time to run, and generates some big files, but it is doable.  I actually do this in three parts because I have three main directories on my system.

Then we need to get it up to the cloud.  Here is where file transfer really slows down.  I guess it is around 6 days total wait time for all of my files to transfer, although I do it in pieces.  The files need to be small enough that a breakdown in the process will not lose too much work, but large enough so that you don’t have thousands of files to keep track of.  I do this to split the data into 2GB chunks:

split -b 2147483648 mystuff.tar.bz2.gpg

Now we have to upload it.  I want to get the data into AWS Glacier since it is cheap, and this is a backup just for emergencies.  Now Glacier does have a direct command line interface, but it requires the use of long IDs and is just fussy in terms of accepting slow uploads over a home cable modem.  Fortunately, getting data into S3 is easier and more reliable.  And, S3 allows you to set a file policy that will allow you to automatically transfer your data from S3 to Glacier after a set amount of time.  So the extra cost you incur for say, letting your data sit in S3 for a day, is really pretty small.  I guess you could do this with regular expressions, but I just have a long shell file with each of my S3 commands on a separate line.  This requires you to install the Amazon CLI on your system.

aws s3 cp xaa s3://your_unique_bucket_name_here

aws s3 cp xab s3://your_unique_bucket_name_here

I just run that with a simple shell command that dumps any messages to a file.

sh -xv my_special_shell_script.sh > special_output

And, voila…days later your files will be in the cloud.  You can set a hosting zone that will put the files on the other side of the planet from you if you think that will be more reliable.

To bring the files back down, you must request through the AWS interface for the files to be brought back from Glacier to S3, then download from S3, then use “cat” to fuse them together, and in general reverse all the other steps to decrypt, untar, checksum and such.  It worked for me on small scale tests, but I guess I should try it on my entire archive at least once to make sure this really works.

At least with this method, I know exactly what is in the cloud, how and when it got there, and how to get it back.  And it looks like it will only run me about $6 a month.

Data Visualization and R

Well, it has been a long time in coming, but I have finally finished converting my Data Visualization workshop series to a screencast video version.  See this YouTube playlist for the complete series, and the materials at Github.  This is the long version of the in-person 3 hour workshop.  The video series goes into even more detail, starting from a history of major developments in visualization, to various implementations of specific graphs, interactive visualizations, web viz, big data, and more.

I also have some ideas for some more up-to-date add-ins that I will probably record as lagniappe videos over the next few weeks.  Those didn’t quite fit into the existing sequence of videos.

The energy to complete these videos came from several musical sources, of which I would credit Harmogu and Linton Kwesi Johnson as leading lights.