Category Archives: Thoughts

PirateBox for Data Literacy

pb

Distributing files via PirateBox

A PirateBox is a wireless router that has been reconfigured to serve as a local fileserver.  The PirateBox project develops software to do this.  PirateBox makes it easy to share files with anyone within range of the router, and also supports a local anonymous discussion board for those within range.

I did this for the TP-Link TL-MR3040, a commonly used piece of hardware for PirateBoxes.  The MR3040 is small and battery powered, so you can easily carry it in your pocket to places that have no electricity or internet.  The file system goes on a removable USB, so it is easy to set up by just copying stuff from your computer to the USB.

Configuration is not really that difficult if you follow the instructions here at the PirateBox project site.

To customize your SSID (the name of your wireless device) and Home Page you can follow these instructions.

Workshops hosted on PirateBox

I have used my PirateBox to share workshop slides, articles, and code.  Up to down this has been a small supplement to my normal workshops, but it could have a larger role.

What I would like to do is to create a self-contained training environment that would not depend on the vagaries of local configuration and connectivity issues.  Following a “train the trainer” model, one could build a PirateBox with an entire data literacy course running off of web pages on the PirateBox.  The PirateBox could include all necessary software (a complete R installation, for example) and a collection of supporting documents, datasets, code, and any other information. This material could also be mirrored to/from a regular website, but the portable and self-contained aspect of the PirateBox opens up many possibilities.

So the trainer could walk into a room anywhere in the world (for example a small Mongolian town – сум), with their PirateBox and lead a workshop based on materials that reside in their entirety on the PirateBox.  Then leave the PirateBox behind so that those in the community could continue to work with the materials and any additional modules.  They could adapt, repurpose, and create their own materials too. So the PirateBox can support ongoing learning, far beyond  the limits of one-shot workshops.

These are not especially new ideas, and even as I type people are surely hacking wireless routers and other devices to perform other advanced functions.  Doubtless the technology will continue to develop.  But for now, the PirateBox software allows one to do interesting work with less than $50 in hardware and a couple of hours in setup time.  Who knows? One can dream of hordes of data literacy pirates emerging from this simple technology.

Advertisements

Traveling with a light digital footprint

Prepping for destination UB, I have decided to travel with a low cost digital devices with a more limited feature set, rather than leaving my electronic life exposed to loss, search, or seizure.  To that end, I am taking with me a low-cost Mobal Phone [sic] and a cheap laptop, albeit reconfigured to run a premium operating system, Debian.

The laptop is suitably tagged with the things that keep me running these days, R, Rutgers, and D Music.  The keyboard has been “Mongolized” as well.

I will report back later on whether this travel gear can get the job done.

20170418_14375120170418_14383720170418_144039

What is a Data Librarian?

A while back I was asked this question, and it provoked an almost subterranean thought process in me.  While I can’t promise a deep answer to this question, at least an answer has emerged.

For me, librarianship itself is all about guiding people to knowledge.  I love the sherpa meme, and would be honored to call myself a data sherpa.

8413860823_c33de33839_o

[The image above appears to be no longer actively used by Open Knowledge Foundation, possibly due to other companies using this tagline as well.  I am just referring to it in an academ-y fair use-y way here.]

So, a data librarian is not necessarily someone who collects data and puts on a shelf or on a server (although that certainly can be done by data librarians).  For me, the central role of data librarians (as compared to data archivists, data curators, data analysts, data scientists, and other professions with the data-prefix) is that of data navigator or data guide. We help people find and use the data they need, using the librarian side of our skills to understand our user communities and craft solutions to their particular needs.  That requires knowing the data landscape, having the hard skills to crunch the data itself, and having the soft skills to adapt our services to our environment.

As has always been the case in librarianship, the balance among the different things we offer changes over time. Data availability is certainly increasing in the general sense, so the “finding data” part of the equation has changed to one that requires more understanding of what kinds of deep analysis are made possible by combining disparate datasets and tools in possibly unexpected ways.  Finding the population of a country over time has never been easier, but trying to understand how economic and environmental factors may have contributed to that population change is now a question that permits more sophisticated answers if we bring more and better data to bear.

The tools at our disposal have changed as well.  Rather than being dependent on a preinstalled application (for example on a data CD), users expect to extract data into their own preferred analysis platforms and then be able to serve their results back to end users via interactive web interfaces of their own creation.  It is amazing that this is possible and is getting easier all the time.  But it also means that we cannot stand pat and continue to offer the same data resources of previous decades as if they are everyone’s sine qua non or ne plus ultra or [insert your alma mater‘s latin cliché here].

What else distinguishes a data librarian?  Many data scientists may know the data landscape and know about data analysis and be applying those skills in service of a particular community.  I would also argue that it is the values of librarianship that are important. Specifically, the commitment to open, shared resources, and to educating the community on their use are critical.  This is why I consider many of the things I do that others may not see as “librarian-like” — such as teaching data literacy, or sharing tutorial videos about statistical software — to be some of my most valuable and core work.  What I value is this openness and sharing that offers the promise to every person that they can continue to learn and develop their skills, and themselves.  I hope that my work as a data librarian helps enable that, and I am glad and privileged to be part of both a local work environment and an international community that supports those goals.

 

 

 

A manual backup routine using AWS

This post is also slightly off topic – not a data announcement, workshop, video, etc.  But it does contain one specific instance of something that everyone should be thinking about – data backup.  Everyone knows the rule of three – keep at least three backups of your precious files and make sure at least one of them is offsite in case of disaster.

I needed to develop a new routine for my home computer backup after deciding to seize control of my system back from SpiderOak.  I had been using that for a while, but then upgraded to SpiderOak One, and my incremental backups seemed to take forever, with the SpiderOak process constantly using lots of CPU and seemingly not accomplishing much.  [This is all on Linux as usual].  I realized that I understood very little of what the client was actually doing, and since the client was unresponsive, could no longer rely on it to actually be able to backup and retrieve my files.  I decided to go completely manual so that I would know exactly what my backup status was and what was happening.

Part 0 of my personal rule of three is that all of my family’s home machines get an rsync run periodically (i.e., whenever I remember) to back up their contents to the main home server.

Part 1 is a local backup to an internal hard drive on the server.  I leave this drive unmounted most of the time, then mount it and rsync the main drive to it.  The total file size is about 600 GB right now, partly because I do not really de-dupe anything or get rid of old stuff.  Also, I don’t have a high volume of video files to worry about at this point.

Part 2 is a similar rsync backup to a portable hard drive [encrypted].  I have two drives that I swap and carry back and forth to work every couple of weeks or so.  I have decided that I don’t really like frequent automated backup, because I’d be more worried about spreading a problem like accidental deletion of files, or a virus, before the problem is discovered.  I can live with the loss of a couple of weeks of my “machine learning” if disaster truly strikes.

But what about Part 3?  I wanted to go really offsite, and not pay a great deal for the privilege.  I have grown more comfortable with AWS as I learn more about it, and so after some trial and error, devised this scheme…

On the server, I tar and zip my files, then encrypt them, taking checksums along the way

tar -cvf mystuff.tar /home/mystuff

bzip mystuff.tar

sha256sum mystuff.tar.bz > mystuffsha

gpg -c –sign mystuff.tar.bz2

sha256sum mystuff.tar.bz2.gpg > mystuffgpgpsha

This takes some time to run, and generates some big files, but it is doable.  I actually do this in three parts because I have three main directories on my system.

Then we need to get it up to the cloud.  Here is where file transfer really slows down.  I guess it is around 6 days total wait time for all of my files to transfer, although I do it in pieces.  The files need to be small enough that a breakdown in the process will not lose too much work, but large enough so that you don’t have thousands of files to keep track of.  I do this to split the data into 2GB chunks:

split -b 2147483648 mystuff.tar.bz2.gpg

Now we have to upload it.  I want to get the data into AWS Glacier since it is cheap, and this is a backup just for emergencies.  Now Glacier does have a direct command line interface, but it requires the use of long IDs and is just fussy in terms of accepting slow uploads over a home cable modem.  Fortunately, getting data into S3 is easier and more reliable.  And, S3 allows you to set a file policy that will allow you to automatically transfer your data from S3 to Glacier after a set amount of time.  So the extra cost you incur for say, letting your data sit in S3 for a day, is really pretty small.  I guess you could do this with regular expressions, but I just have a long shell file with each of my S3 commands on a separate line.  This requires you to install the Amazon CLI on your system.

aws s3 cp xaa s3://your_unique_bucket_name_here

aws s3 cp xab s3://your_unique_bucket_name_here

I just run that with a simple shell command that dumps any messages to a file.

sh -xv my_special_shell_script.sh > special_output

And, voila…days later your files will be in the cloud.  You can set a hosting zone that will put the files on the other side of the planet from you if you think that will be more reliable.

To bring the files back down, you must request through the AWS interface for the files to be brought back from Glacier to S3, then download from S3, then use “cat” to fuse them together, and in general reverse all the other steps to decrypt, untar, checksum and such.  It worked for me on small scale tests, but I guess I should try it on my entire archive at least once to make sure this really works.

At least with this method, I know exactly what is in the cloud, how and when it got there, and how to get it back.  And it looks like it will only run me about $6 a month.

Installing Debian/XFCE (Linux) on Dell XPS 13 9350

Well, this is off topic for the theme of the blog, but I felt the urge to record an expurgated version of my recent installation of Linux on a new Dell XPS 13 (9350), for the potential edification of the populace.  I will try to make it brief, by my standards 🙂  [but I guess I failed..]  This is not in any way an objective post, but I am just blowing off steam and letting my opinions fly.  Please skip over if you are looking for actual educational material…

As tweeted earlier, a Pi day discount tempted me into the purchase of a 13″ Dell XPS (model 9350).  I had been following Dell primarily due to their Linux support on via the Developer Series, and had been tempted on many occasions in the past.

I should also mention that dating back to the fin-de-siècle, I have been a Linux user.  Although I have occasionally strayed away, for most of the time, Linux computers have formed the core of my computing infrastructure.  In Linux, I can do what I want to do, rather than simply obey the instructions of other OSes.  In recent years, I converted from the Fedora sect to become a Debian adherent, and I have been very satisfied with that choice.

Still, the computer I bought was NOT a Developer’s Edition, but a new Windows 10 machine.  In the past, I have usually put Linux onto either very standard or slightly older hardware, and didn’t have FEAR that it would not work.  My working laptop recently has been a leftover 2010 Macbook Pro running Debian only, and it has no real issues, but it is running hot and noisy.   Since this XPS laptop was brand new with the latest technology, I have to confess that for a moment or two I even considered leaving Windows on the machine and using it in dual boot mode.  But two minutes in Windows 10 erased any of those doubts.  Seriously, why would anyone voluntarily remain in that depressing environment if they had the possibility of escape?

So, I committed to putting Debian on the machine as its sole operating system, and began Googling to get ready.  I learned that the Dell-rebranded Broadcom wireless card was not being supported, except in bleeding edge kernels, and was not very good anyway.  And also that Intel wireless cards worked easily with Linux.  Thanks to Dell, because they put a wonderful service manual online and don’t mind users operating on their own hardware (unlike the fruit-themed gadget company).  I ordered an Intel Wireless card.  Due to a bit of carelessness, I picked a 9260 instead of the 9265 model.  The 9265 is supported natively in the kernel, whereas the 9260 requires a download of Intel proprietary drivers [more on this later].  But, in spite of my nerves, popping open and disassembling my brand new laptop was a piece of cake, and it went back together just as good as new.  I am liking Dell from the hardware perspective.

Then, I prepared a USB boot stick to install Debian 8 (Jessie).  So, I had to fidget about a bit with the UEFI/BIOS settings to get the Dell to want to boot from the stick, but eventually made it happen.  Then I went through a couple of abortive installation attempts because of the aforementioned wireless drivers, which needed to be loaded from a second USB stick.  I am sorry I can’t really document it completely here, but only more fiddling until I found the right combination that would recognize one USB as the boot media and one USB as the supplemental driver files allowed me to proceed.  During that phase, I began to worry that I had gone too far by buying a slim fancy device without an ethernet port, but I survived.  I would still lean towards getting a computer with a real ethernet port in the future though, just for safety.  Turns out the Linux drivers for the USB-C to Ethernet are reported to be fussy too.

On to the next complication… Did I mention that I not only like Linux and Debian, but prefer XFCE as my desktop?  Because I am old school and don’t care for eye candy at all.  It was the horrible broken experience of GNOME 3 that drove me away from Fedora.  I just want a desktop that stays out of my way and does the work (is that some antiquated colonialist mindset? perhaps, but I think it is still OK to exploit a computer, right?).  Anyway, XFCE has been my go-to for the last 4-5 years.  I respect LinuxMint/Cinammon too for their attempt to correct the awful GNOME decisions that were forced on unsuspecting users.  But XFCE has done the job for me.  So, I was willing to work overtime to get XFCE as my desktop.

Now, I have done Debian/XFCE installs on a number of desktops, and my Macbook too.  But somehow, the Debian 8 XFCE install (at the time of writing) had one major issue.  I finished the installation, but could only login to XFCE desktop as root, not as a user.  There was some kind of weird permissions issue, or some problem with the install scripts.  I am experienced in Linux, but not expert, so extensive Googling on this topic failed to resolve the issue.  What did work was to do a standard Debian install with Cinammon as the default desktop.  And only after that was working did I install XFCE.  That worked like a charm.  Hopefully someone who has more knowledge of what could cause this would fix it for the young generations.

I also had to try a few different configurations before getting my preferred configuration of encrypted hard drive and swap space, while leaving a bit of open /boot directory.  I wish there was a better-documented path for this too.  Somehow encryption is still considered to be a slightly exotic option, when it shouldn’t be.

Ok, so now I am excited because I have a working Debian/XFCE install on my new laptop.  My hand-installed wireless is working, and everything is looking up.  BUT, I have NO SOUND, and it appears that the lack of sound is also causing any standard (e.g., YouTube) videos to play too fast.  I take a deep breath.  More googling reveals that this is an issue with the Dell 9350 model’s audio, and that future kernels will handle it.  But my Debian 8 kernel does not handle it, and I cannot use my expensive laptop to watch my favorite YouTube videos!!!!  I use all of my experience in “taming my dog of desire” to reconcile myself to the situation.  I can use my laptop as a wonderful distraction-free zone to code and write wonderful things.  What do I need sound for?  After all, Plato and Muhammad both condemned music.  Yes, what do I need sound for?  Ok, I will live without sound on my laptop 😦

But, after getting everything else configured to my liking, I was ready to keep experimenting.  Is that not the whole point of Linux?  To experiment, to control your own working environment?   Not to blindly obey when a popup window says, “You must update now”, or “You must click ‘accept’ to continue”, or “Operation not permitted”.  Right, this is Linux, so let’s go!

In practice, what that meant was that I attempted a full upgrade from Debian 8 (jessie) to Debian 9 (stretch), even though 9 is not yet stable.  What was my motivation?  Well, to confess, at least 90% of the motivation was to get that audio working.  Because if this is the kind of world where we can’t listen to music on our laptops, is that the kind of world that we want to live in?  We want our music, and we want control of our computers too!

Now, the Debian instructions are very clear and very good, and after editing my apt sources, I was able to update and upgrade, and within a very, very short period of time, had my entire OS running a very current set of applications with crystal clear audio and video.  I now have pretty close to my full suite of applications (R, RStudio, Mathematica, Claws-mail, LaTeX, LyX, Gummi, all of the old favorites…).  Now I am content!  I then had to go and customize my desktop settings and browser to a very dark theme and plaster my laptop with some stickers to make it seem more like my own.  Too much, probably, but it is a small thing that makes me happy 🙂  My family says I’m crazy, but I get that all the time anyway…

20160413_174415 20160413_174124 20160413_174104 20160413_174046 20160413_174014

 

 

 

 

PLE (Personal Learning Environment)

I am responding to this post by my colleague, Francesca Giannetti, from the RUteaching blog that I just became aware of.  The discussion there is about the PLE, or personal learning environment.  I think it’s interesting to reflect on, because most of the time I am just marinating in my PLE rather than conscious of it.

I find that I use different communication mechanisms for different purposes, and they have different time scales and impacts on my thinking.  Enough generalities, let’s jump in!

  • Listservs and mailing lists – mostly keeping up on “official” business, whether from work or my membership organizations (IASSIST, American Statistical Association).  I experience this as a continuous flow of incremental information.
  • RSS – I am committed to this antique technology because I find it is there just when I want it.  A feed reader is a like a faithful servant that doesn’t bother you when you are busy with other things, but keeps everything ready for you when you return.  I monitor my journal TOCs, academic blogs, job postings, and some general news via RSS.  Since Google Reader went belly up, I have been using Liferea for Linux with satisfaction.
  • Blogs  – Interesting to skim, but there are only a few of regular value for me, such as R-bloggers to keep me up to date on new R developments.
  • Email – this is problematic because I am like many who struggle with e-mail.  My e-mail queue becomes my de facto to-do list too often, and I often fail to prioritize the right things.  Too many different kinds of activities end up in e-mail.  I definitely do not like to read long e-mails.
  • PDFs – for long-term thoughtful academic reading, I prefer to accumulate folders full of interesting PDF articles and books, and then devour them on quiet days.
  • Books – My print books tend to be heavy tomes on math, statistics, or literature.  After seeing these in piles for a couple of years making me feel guilty, I feel the necessary pressure to plow through them.  Books express some of my grander (and more unrealistic?) ambitions.
  • Websites and bookmarks – I track these, try to organize my bookmarks, keep them in Libguides, and more, but I find that I don’t return to websites nearly as often as I think I will.  This is primarily because my list of substantive readings in the other formats above are plenty enough to fill all of my time.  So, unless a website is in my face, or a really good aggregator, like R-bloggers, it tends not to get my attention.
  • Videos – I have appreciated some video tutorials (in addition to producing a few of my own), especially in structured courses on Coursera and other sites.  But Youtube is an oceanic resource, but no matter what I go to it for, I always end up in K-pop, so it is dangerous from a productivity perspective [has anyone developed a “serious-only” filter for YouTube?].  I dislike sitting through an hour-long video talk with no associated text or slides to guide some skimming ahead.  If necessary, I will download the video and play it at double speed in VLC.
  • Conference Talks and Presentations – These tend to be more useful for me as a continual update to the current landscape in librarianship, helping to set directions for more focused reading and exploration. Again, I prefer skimmable formats to be made available, even for things I might be attending in person.

Benford’s Law

I somehow skipped through life up to now without encountering Benford’s Law.  Now that I have, I am flabbergasted that it is not more widely known, or maybe I’ve just been hanging with wrong crowd.  Here’s a hint.  If I have a set of measurements, like the population of countries, or a list of atomic weights, how often would you expect the first digit of the measurement to be 1?  Well, it can’t start with a zero, but any of the other 9 digits is possible, right?  If you think the answer is 1/9, think again. The wikipedia post, MathWorld, and this DataGenetics blog posting are good starting points to understand why.  It turns out this is useful in many areas of data analysis.

Government Shutdown => Data Devolution

Many critical government websites that deliver data are offline due to the self-inflicted crisis of the government shutdown.  The Census, National Center for Education Statistics, Bureau of Economic Analysis, and many others are down.  I am already seeing the students impacted by this, and I can only imagine the hidden cost of the bad decisions currently being made on the basis of inadequate information.

Further proof of the U.S.’s backpedaling away from developed world status, IMO.  I would hope some sanity is restored before we are all the way back in a state of nature, because as we know…

“In such condition there is no place for industry, because the fruit thereof is uncertain: and consequently no culture of the earth; no navigation, nor use of the commodities that may be imported by sea; no commodious building; no instruments of moving and removing such things as require much force; no knowledge of the face of the earth; no account of time; no arts; no letters; no society; and which is worst of all, continual fear, and danger of violent death; and the life of man, solitary, poor, nasty, brutish, and short.” (Hobbes, Leviathan)

Knowledge (Census and others), the account of time (NIST), arts and letters (Library of Congress) are already temporarily blotted out.   You can monitor further loss of content on this page from U of Wisconsin Data and Information Services.

Data Dream Team?

Attending IASSIST 2013 was very therapeutic, and I have returned from Germany invigorated and with many new thoughts about improving data services.

One thing I now wonder about is what I would do if I could design a dream lineup for a Data Services team, assuming that I had 4 or 5 staff lines at my disposal to hire from scratch.  What would the ideal configuration look like?  My thoughts are primarily about an academic library setting similar to my own, but this would be an interesting exercise in other settings too.

Is it hierarchical (a head with subordinates)?  A team of equals?  Are responsibilities cleanly divided or shared?

Some technical skills that are required: statistical, mathematical and engineering software (R/SAS/SPSS/Matlab/Mathematica, etc. etc.), GIS, Qualitiative Analysis, Data Visualization.  Scripting languages (Python, Java), Database skills. How are these divided among positions?

We also need a knowledge of public data sources, outreach and instruction skills in person and via electronic media.  Research data management also requires one-on-one people skills to negotiate data acquisition and provide advice across many disciplines.  Is it better to split these along functional lines (RDM specialist vs. Public Services Data Librarian) or along subject lines (e.g., a science data librarian and a social sciences data librarian handle both instruction and individual RDM work in their respective disciplines)?  Does digital humanities fit in here, or is it a separate issue?

So here’s the thought exercise: List the five members of your dream data team…