Statistical Software and Data Workshops, Fall 2017

New Brunswick Libraries Data Workshop Series

Fall 2017

This Fall, Ryan Womack, Data Librarian, will offer a series of workshops on statistical software, data visualization, and reproducible research as part of New Brunswick Libraries Data Management Services.   A detailed calendar and descriptions of each workshop are below.  This semester each workshop topic will be repeated twice, once at the Library of Science and Medicine on Busch Campus, and once at Alexander Library on College Ave.  These sessions will be identical except for location. Sessions will run approximately 3 hours.  Workshops in parts will divide the time in thirds.  For example, the first SPSS, Stata, and SAS workshop (running from 12-3 pm) would start with SPSS at 12 pm, Stata at 1 pm, and SAS at 2 pm.  You are free to come only to those segments that interest you.  There is no need to register, just come!

Logistics

Location: The Library of Science and Medicine (LSM on Busch) workshops will be held in the Conference Room on the 1st floor of LSM on Wednesdays from 12 to 3 pm.  The Alexander Library (College Ave) workshops will be held in room 413 of the Scholarly Communication Center (4th floor of Alexander Library) from on Tuesdays from 1:10 to 4:10 pm.

For both locations, you are encouraged to bring your own laptop to work in your native environment.  Alternatively, at Alexander Library, you can use a library desktop computer instead of your own laptop.  At LSM, we will have laptops available to borrow for the session if you don’t bring your own.  Room capacity is 25 in both locations, first come, first served.

If you can’t make the workshops, or would like a preview or refresher, screencast versions of many of the presentations are already available at http://libguides.rutgers.edu/data and https://youtube.com/librarianwomack. Additional screencasts are continually being added to this series.  Note that the “special topics” [Time Series, Survival Analysis, and Big Data] are no longer offered in person, but are available via screencast.

Calendar of workshops

Tuesday (Alexander)

1:10 pm -4:10 pm

   Wednesday (LSM)

12 noon – 3 pm

September 12 Introduction to SPSS, Stata, and SAS September 13
September 19 Introduction to R September 20
September 26 Data Visualization in R September 27
October 3 Reproducible Research October 18

Description of Workshops:

§ Introduction to SPSS, Stata, and SAS (September 12 or September 13) provides overviews of these three popular commercial statistical software programs, covering the basics of navigation, loading data, graphics, and elementary descriptive statistics and regression using a sample dataset.  If you are already using these packages with some degree of success, you may find these sessions too basic for you.

  • SPSS is widely used statistical software with strengths in survey analysis and other social science disciplines.  Copies of the workshop materials, a screencast, and additional SPSS resources can be found here: http://libguides.rutgers.edu/content.php?pid=115296&sid=1208425. SPSS is made available by OIRT at a discounted academic rate, currently $100/academic year.  Find it at software.rutgers.edu.  SPSS is also available in campus computer labs and via the Apps server (see below).
  • Stata is flexible and allows relatively easy access to programming features.  It is popular in economics among other areas.  Copies of the workshop materials, a screencast, and additional Stata resources can be found here: http://libguides.rutgers.edu/content.php?pid=115296&sid=1208427. Stata is made available by OIRT via campus license with no additional charge to install for Rutgers users.  Find it at software.rutgers.edu.
  • SAS is a powerful and long-standing system that handles large data sets well, and is popular in the pharmaceutical industry and health sciences, among other applications. Copies of the workshop materials, a screencast, and additional SAS resources can be found here: http://libguides.rutgers.edu/content.php?pid=115296&sid=1208423. SAS is made available by OIRT at a discounted academic rate, currently $100/academic year.  Find it at software.rutgers.edu.  SAS is also available in campus computer labs, online via the SAS University Edition cloud service, and via the Apps server (see below).

Note: Accessing software via apps.rutgers.edu

SPSS, SAS, Stata, and R are available for remote access on apps.rutgers.eduapps.rutgers.edu does not require any software installation, but you must activate the service first at netid.rutgers.edu.

 

§ Introduction to R (September 19 or September 20) – This session provides a three-part orientation to the R programming environment.  R is freely available, open source statistical software that has been widely adopted in the research community.  Due to its open nature, thousands of additional packages have been created by contributors to implement the latest statistical techniques, making R a very powerful tool.  No prior knowledge is assumed. The three parts cover:

  • Statistical Techniques: getting around in R, descriptive statistics, regression, significance tests, working with packages
  • Graphics:  comparison of graphing techniques in base R, lattice, and ggplot2 packages
  • Data Manipulation:  data import and transformation, additional methods for working with large data sets, also dplyr and other packages from the tidyverse useful for manipulation.

Additional R resources, including handouts, scripts, and screencast versions of the workshops, can be found here: http://libguides.rutgers.edu/data_R

R is freely downloadable from http://r-project.org

 

§ Data Visualization in R  (September 26 or September 27) discusses principles for effective data visualization, and demonstrates techniques for implementing these using R.  Some prior familiarity with R is assumed (packages, structure, syntax), but the presentation can be followed without this background.  The three parts are:

  • Principles & Use in lattice and ggplot2: discusses classic principles of data visualization (Tufte, Cleveland) and illustrates them with the use of the lattice and ggplot2 packages.  Some of the material here overlaps with Intro to R, pt 2, but at a higher level.
  • Miscellany of Methods: illustrates a wide range of specific graphics for different contexts
  • 3-D, Interactive, and Big Data: presentation of 3-D data, interactive exploration data, and techniques for large datasets. Relevant packages such as shiny and tessera are explored.

Additional R resources can be found here: http://libguides.rutgers.edu/data_R

R is freely downloadable from http://r-project.org

pyramid

 

§ Reproducible Research (October 3 or October 18) covers

  • Reproducible research describes the growing movement to make the products of research accessible and usable by others in order to verify, replicate, and extend research findings.  This session reviews how to plan research, to create publications, code, and data in open, reusable formats, and maximize the impact of shared research findings.  Examples in LaTeX and Rmarkdown are discussed, along with platforms for reusability such as the Open Science Foundation.

Additional resources on reproducible research and data management, including presentation slides, can be found here: http://libguides.rutgers.edu/datamanagement

 

§ Special Topics

Note that the following special topics are no longer covered by in-person workshops, but are available via screencast.

Advertisements

PirateBox for Data Literacy

pb

Distributing files via PirateBox

A PirateBox is a wireless router that has been reconfigured to serve as a local fileserver.  The PirateBox project develops software to do this.  PirateBox makes it easy to share files with anyone within range of the router, and also supports a local anonymous discussion board for those within range.

I did this for the TP-Link TL-MR3040, a commonly used piece of hardware for PirateBoxes.  The MR3040 is small and battery powered, so you can easily carry it in your pocket to places that have no electricity or internet.  The file system goes on a removable USB, so it is easy to set up by just copying stuff from your computer to the USB.

Configuration is not really that difficult if you follow the instructions here at the PirateBox project site.

To customize your SSID (the name of your wireless device) and Home Page you can follow these instructions.

Workshops hosted on PirateBox

I have used my PirateBox to share workshop slides, articles, and code.  Up to down this has been a small supplement to my normal workshops, but it could have a larger role.

What I would like to do is to create a self-contained training environment that would not depend on the vagaries of local configuration and connectivity issues.  Following a “train the trainer” model, one could build a PirateBox with an entire data literacy course running off of web pages on the PirateBox.  The PirateBox could include all necessary software (a complete R installation, for example) and a collection of supporting documents, datasets, code, and any other information. This material could also be mirrored to/from a regular website, but the portable and self-contained aspect of the PirateBox opens up many possibilities.

So the trainer could walk into a room anywhere in the world (for example a small Mongolian town – сум), with their PirateBox and lead a workshop based on materials that reside in their entirety on the PirateBox.  Then leave the PirateBox behind so that those in the community could continue to work with the materials and any additional modules.  They could adapt, repurpose, and create their own materials too. So the PirateBox can support ongoing learning, far beyond  the limits of one-shot workshops.

These are not especially new ideas, and even as I type people are surely hacking wireless routers and other devices to perform other advanced functions.  Doubtless the technology will continue to develop.  But for now, the PirateBox software allows one to do interesting work with less than $50 in hardware and a couple of hours in setup time.  Who knows? One can dream of hordes of data literacy pirates emerging from this simple technology.

Data Science in Mongolia – Маш их сайн! (very good!)

While it is the subject for another blog post or another blog, Mongol culture has long held my fascination. Thanks to a series of fortunate events, I had the opportunity to bring some of my favorite interests (data, statistics, R, Mongolian) all together to form an unforgettable experience in May 2017.  During one week in Ulaanbaatar, I visited three of the oldest, largest, and most important Mongolian universities, as well as the nerve center of Mongolian data, the National Statistics Office.

4kj4kj8qcjb4tkr5jc1lomr1deNational Statistics Office of Mongolia

The first and most intensive event was my invitation to present two days of workshops on Data Science at the National Statistics Office of Mongolia (Монгол Улсын Үндэсний статистикийн хороо). On May 8 and May 9,  I delivered all-day presentations and interactive training on Data Visualization, Big Data, Reproducible Research, and Data Literacy. The presentation slides in English and accompanying Mongolian translation are available here.

 

This slideshow requires JavaScript.

We covered lots of ground, and I was also able to learn about the data environment in Mongolia and some of NSO’s data dissemination efforts such as the 1212.mn data portal. The facilities at NSO were superb, and the audience of 33, consisting primarily of government data professionals from the NSO and other Mongolian agencies were an outstanding group. It was truly a privilege to be able to work with them.  An article (in Mongolian) about the event is here.

In particular, I would like to thank Ch. Davaasuren (Research and Development Director of the Mongolian Marketing Consulting Group for arranging the event, to L. Myagmarsuren (Director of Information Technology at NSO) for hosting it, and to A. Ariunzaya (Chair of NSO) for the invitation.

These three can be seen at the opening of the event [at the link below], along with me and my poor Mongolian – уучлаарай (sorry!).  I promise it will improve!

NSO Data Science opening remarks

The event was greatly enriched by sponsorship from IASSIST, the International Association for Social Science Information Services and Technology.  IASSIST is developing outreach efforts to areas around the world, and provided translation services and lunch for workshop participants.  We had two days of delicious хуушуур (huushur), сүүтэй цай (milk tea), and other Mongolian specialties at Modern Nomads.  Joining IASSIST is a great way to get in touch with a worldwide network of data professionals!

 

This slideshow requires JavaScript.

National_University_of_Mongolia_Seal National University of Mongolia

On Wednesday, May 9, I spoke on Data Literacy to approximately 70 students of statistics at the National University of Mongolia (Монгол Улсын Их Сургууль). Even though the talk started at 7:40 am, students were attentive and asked probing questions. Clearly, they are the future of data science –very curious about career trends and the nature of the work and skills required.   I am sure they will succeed if they remain as focused as they were that day!  Амжилт хүсьё!

 

This slideshow requires JavaScript.

Thanks to D. Amarjargal for inviting me, and B. Myagmarsuren for translating!

ХААИСMongolian University of Life Sciences

On Thursday, May 10, I traveled to the southern side of Ulaanbaatar to speak at the Mongolian University of Life Sciences (Хөдөө Аж Ахуйн Их Сургууль), giving two presentations on Big Data and Data Visualization to a group of approximately 20 faculty of the School of Economics and Business.

 

 

This slideshow requires JavaScript.

The faculty here were very welcoming and discussed many issues in applying big data and visualization techniques to their work.  Many thanks to P. Munktuhya (Head of the Department of Economics, Statistics, and Mathematical Modeling) for arranging the event, to G. Ganzorig (Senior Lecturer in Agricultural and Applied Economics) for translation, and to B. Baasansukh (Dean of the School of Economics and Business) for the invitation.

I was also able to have a very informative and positive meeting with Ts. Sukhtulga (Chief of Administration and International Affairs) to discuss possibilities for cooperation with Rutgers University.  An article (in Mongolian) about my visit appeared here.  I really regretted not having more time to spend here!

MUST_logo_20130530082631

 

 

Mongolian University of Science and Technology

My final talk on Friday, May 11 was at the Mongolian University of Science and Technology (Шинжлэх Ухаан, Технологийн Их Сургууль), where I spoke on Big Data, Reproducible Research, and Data Visualization, hitting highlights from my earlier presentations during the week.

 

This slideshow requires JavaScript.

Approximately 40 faculty and students from MUST’s School of Business Administration and Humanities attended.  Once again, the audience was attentive and questioning up until the end, even though the talk was held late on Friday afternoon.  I was very impressed by the curiosity and dedication of the Mongolian academic community here, and throughout my trip.

At MUST, I would like to thank J. Oyuntungalag (Professor of Technology Management) for arranging the talk.  I also enjoyed a good meeting with U. Batbaatar and P. Jargaltuya of the Office of International Affairs and Cooperation.

On Friday, I was also able to spend some time at the Mongolian Marketing Consulting Group‘s offices to learn more about how they conduct polling, market research, and other data collection, thanks to the hospitality of Ch. Davaasuren.

It was such a memorable and rewarding experience that I must continue to thank those who made it possible, once again Ch. Davaasuren who helped throughout the week, and especially to M. Bayarmaa who worked tirelessly to organize many aspects of the week’s events and behind the scenes to keep things running smoothly.

I can only hope that this is the start of a long and productive collaboration with the Mongolian data world.

Би цагийг гайхалтай сайхан өнгөрөөсөн! (I had a glorious time!)

More Mongolian Universities

Now I will also be speaking at the Mongolian University of Life Sciences (ХААИС) on Thursday, May 11, and at Mongolian University of Science and Technology (ШУТИС) on Friday, May 12.  That gives me the chance to post their logos here:

I will be posting presentation materials and some reflections later on.

Mongolia GIS

While researching my upcoming Mongolia trip, I was amazed to discover a treasure trove of Mongolian data already at Rutgers.  Christopher Free, a quantitative ecologist, studies Mongolian fisheries, and is an R and GIS expert to boot.  He has compiled a one-stop archive for Mongolian GIS data and R courses that use Mongolian fish data as examples [much better than sports statistics!].

These are exactly the kinds of global connections I am delighted to make!

Traveling with a light digital footprint

Prepping for destination UB, I have decided to travel with a low cost digital devices with a more limited feature set, rather than leaving my electronic life exposed to loss, search, or seizure.  To that end, I am taking with me a low-cost Mobal Phone [sic] and a cheap laptop, albeit reconfigured to run a premium operating system, Debian.

The laptop is suitably tagged with the things that keep me running these days, R, Rutgers, and D Music.  The keyboard has been “Mongolized” as well.

I will report back later on whether this travel gear can get the job done.

20170418_14375120170418_14383720170418_144039

Data Science in Mongolia in May

Flag_of_Mongolia.svg

I am delighted to announce that I have been invited to present on several data science topics at the Mongolian National Statistics Office (Монгол Улсын Үндэсний Статистикийн Хороо) and the National University of Mongolia (Монгол Улсын Их Сургууль) from May 8th to 10th.

I will post more detailed commentary after the visit, but I am excited by the opportunity to meet colleagues and learn about the Mongolian environment, and explore the potential for collaboration on data issues.

National_University_of_Mongolia_Seal4kj4kj8qcjb4tkr5jc1lomr1de

Survival Analysis in R video available

As promised earlier, the “special topic” material on Survival Analysis is now available on YouTube in lieu of in-person sessions.  Take a look at the Survival Analysis in R Playlist.

Survival analysis deals with data that may have truncated observations, called censored data.  A typical example is studying the time until failure of a part in engineering, or failure of a part of the human body in medicine (colloquially known as “disease”).  We usually have some accurate data on when the problem occurs until the point that the end of the study is reached.  Then we will have some subjects that survived without failure until the end of the study, but we are uncertain just how long they would have lasted until failure.  The methods of survival analysis account for this partial uncertainty in the data.  R can deal with almost all necessary aspects of survival analysis, but requires some mixing and matching of packages to get the best results, as shown in the videos.

As always, my YouTube videos are fueled by music behind the scenes.  Giving a throwback shoutout to Public Image Limited, some holiday Twice, plus the usual Mongolian suspects.

Statistical Software and Data Workshops, Spring 2017

Rutgers University Libraries Data Services Workshop Series (New Brunswick)

Spring 2017

In Spring 2017, Ryan Womack, Data Librarian, will repeat the series of workshops on statistical software, data visualization, and reproducible research as part of the Rutgers University Libraries Data Services.   A detailed calendar and descriptions of each workshop are below.  This semester each workshop topic will be repeated twice, once at the Library of Science and Medicine on Busch Campus, and once at Alexander Library on College Ave.  These sessions will be identical except for location. Sessions will run approximately 3 hours.  Workshops in parts will divide the time in thirds.  For example, the first SPSS, Stata, and SAS workshop (running from 12-3 pm) would start with SPSS at 12 pm, Stata at 1 pm, and SAS at 2 pm.  You are free to come only to those segments that interest you.  There is no need to register, just come!

Logistics

Location: The Library of Science and Medicine (LSM on Busch) workshops will be held in the Conference Room on the 1st floor of LSM on Mondays from 12 to 3 pm.  The Alexander Library (College Ave) workshops will be held in room 413 of the Scholarly Communication Center (4th floor of Alexander Library) from on Tuesdays from 1:10 to 4:10 pm.

For both locations, you are encouraged to bring your own laptop to work in your native environment.  Alternatively, at Alexander Library, you can use a library desktop computer instead of your own laptop.  At LSM, we will have laptops available to borrow for the session if you don’t bring your own.  Room capacity is 25 in both locations, first come, first served.

If you can’t make the workshops, or would like a preview or refresher, screencast versions of many of the presentations are already available at http://libguides.rutgers.edu/data and https://youtube.com/librarianwomack. Additional screencasts are continually being added to this series.  Note that the “special topics” [Time Series, Survival Analysis, and Big Data] are no longer offered in person, but are available via screencast.

Calendar of workshops

Monday (LSM)

12 noon – 3 pm

  Tuesday (Alexander)

1:10 pm -4:10 pm

January 23 Introduction to SPSS, Stata, and SAS January 24
January 30 Introduction to R January 31
February 6 Data Visualization in R February 7
February 13 Reproducible Research February 14

Description of Workshops:

§ Introduction to SPSS, Stata, and SAS (January 23 or January 24) provides overviews of these three popular commercial statistical software programs, covering the basics of navigation, loading data, graphics, and elementary descriptive statistics and regression using a sample dataset.  If you are already using these packages with some degree of success, you may find these sessions too basic for you.

  • SPSS is widely used statistical software with strengths in survey analysis and other social science disciplines.  Copies of the workshop materials, a screencast, and additional SPSS resources can be found here: http://libguides.rutgers.edu/content.php?pid=115296&sid=1208425. SPSS is made available by OIRT at a discounted academic rate, currently $100/academic year.  Find it at software.rutgers.edu.  SPSS is also available in campus computer labs and via the Apps server (see below).
  • Stata is flexible and allows relatively easy access to programming features.  It is popular in economics among other areas.  Copies of the workshop materials, a screencast, and additional Stata resources can be found here: http://libguides.rutgers.edu/content.php?pid=115296&sid=1208427. Stata is made available by OIRT via campus license with no additional charge to install for Rutgers users.  Find it at software.rutgers.edu.
  • SAS is a powerful and long-standing system that handles large data sets well, and is popular in the pharmaceutical industry, among other applications. Copies of the workshop materials, a screencast, and additional SAS resources can be found here: http://libguides.rutgers.edu/content.php?pid=115296&sid=1208423. SAS is made available by OIRT at a discounted academic rate, currently $100/academic year.  Find it at software.rutgers.edu.  SAS is also available in campus computer labs, online via the SAS University Edition cloud service, and via the Apps server (see below).

Note: Accessing software via apps.rutgers.edu

SPSS, SAS, Stata, and R are available for remote access on apps.rutgers.eduapps.rutgers.edu does not require any software installation, but you must activate the service first at netid.rutgers.edu.

 

§ Introduction to R (January 30 or January 31) – This session provides a three-part orientation to the R programming environment.  R is freely available, open source statistical software that has been widely adopted in the research community.  Due to its open nature, thousands of additional packages have been created by contributors to implement the latest statistical techniques, making R a very powerful tool.  No prior knowledge is assumed. The three parts cover:

  • Statistical Techniques: getting around in R, descriptive statistics, regression, significance tests, working with packages
  • Graphics:  comparison of graphing techniques in base R, lattice, and ggplot2 packages
  • Data Manipulation:  data import and transformation, additional methods for working with large data sets, also plyr and other packages useful for manipulation.

Additional R resources, including handouts, scripts, and screencast versions of the workshops, can be found here: http://libguides.rutgers.edu/data_R

R is freely downloadable from http://r-project.org

 

§ Data Visualization in R  (February 6 or February 7) discusses principles for effective data visualization, and demonstrates techniques for implementing these using R.  Some prior familiarity with R is assumed (packages, structure, syntax), but the presentation can be followed without this background.  The three parts are:

  • Principles & Use in lattice and ggplot2: discusses classic principles of data visualization (Tufte, Cleveland) and illustrates them with the use of the lattice and ggplot2 packages.  Some of the material here overlaps with Intro to R, pt 2, but at a higher level.
  • Miscellany of Methods: illustrates a wide range of specific graphics for different contexts
  • 3-D, Interactive, and Big Data: presentation of 3-D data, interactive exploration data, and techniques for large datasets. Relevant packages such as shiny and tessera are explored.

Additional R resources can be found here: http://libguides.rutgers.edu/data_R

R is freely downloadable from http://r-project.org

 

§ Reproducible Research (February 13 or February 14) covers

  • Reproducible research describes the growing movement to make the products of research accessible and usable by others in order to verify, replicate, and extend research findings.  Reviews how to plan research, to create publications, code, and data in open, reusable formats, and maximize the impact of shared research findings.  Examples in LaTeX and Rmarkdown are discussed, along with platforms for reusability such as the Open Science Foundation.

Additional resources on reproducible research and data management, including presentation slides, can be found here: http://libguides.rutgers.edu/datamanagement

 

§ Special Topics

Note that the following special topics are no longer covered by in-person workshops, but are available via screencast.

What is a Data Librarian?

A while back I was asked this question, and it provoked an almost subterranean thought process in me.  While I can’t promise a deep answer to this question, at least an answer has emerged.

For me, librarianship itself is all about guiding people to knowledge.  I love the sherpa meme, and would be honored to call myself a data sherpa.

8413860823_c33de33839_o

[The image above appears to be no longer actively used by Open Knowledge Foundation, possibly due to other companies using this tagline as well.  I am just referring to it in an academ-y fair use-y way here.]

So, a data librarian is not necessarily someone who collects data and puts on a shelf or on a server (although that certainly can be done by data librarians).  For me, the central role of data librarians (as compared to data archivists, data curators, data analysts, data scientists, and other professions with the data-prefix) is that of data navigator or data guide. We help people find and use the data they need, using the librarian side of our skills to understand our user communities and craft solutions to their particular needs.  That requires knowing the data landscape, having the hard skills to crunch the data itself, and having the soft skills to adapt our services to our environment.

As has always been the case in librarianship, the balance among the different things we offer changes over time. Data availability is certainly increasing in the general sense, so the “finding data” part of the equation has changed to one that requires more understanding of what kinds of deep analysis are made possible by combining disparate datasets and tools in possibly unexpected ways.  Finding the population of a country over time has never been easier, but trying to understand how economic and environmental factors may have contributed to that population change is now a question that permits more sophisticated answers if we bring more and better data to bear.

The tools at our disposal have changed as well.  Rather than being dependent on a preinstalled application (for example on a data CD), users expect to extract data into their own preferred analysis platforms and then be able to serve their results back to end users via interactive web interfaces of their own creation.  It is amazing that this is possible and is getting easier all the time.  But it also means that we cannot stand pat and continue to offer the same data resources of previous decades as if they are everyone’s sine qua non or ne plus ultra or [insert your alma mater‘s latin cliché here].

What else distinguishes a data librarian?  Many data scientists may know the data landscape and know about data analysis and be applying those skills in service of a particular community.  I would also argue that it is the values of librarianship that are important. Specifically, the commitment to open, shared resources, and to educating the community on their use are critical.  This is why I consider many of the things I do that others may not see as “librarian-like” — such as teaching data literacy, or sharing tutorial videos about statistical software — to be some of my most valuable and core work.  What I value is this openness and sharing that offers the promise to every person that they can continue to learn and develop their skills, and themselves.  I hope that my work as a data librarian helps enable that, and I am glad and privileged to be part of both a local work environment and an international community that supports those goals.