What are the use cases of a Jupyter Notebook in data science?
- many use cases (my favourite is the case documented by Netflix).
- many folks arguing notebooks are an insecure, unmaintainable tooling that should be stripped of use cases.
- many reasons why you may still ultimately deploy a notebook as part of a productionised process, secure enough & maintainable enough to have meaningful positive impact on your team’s workflow.
Notebooks are primarily useful for data scientists as they need an interface for documenting (and sharing the documentation about) their explorative first interactions with a data-set that will be modelled using machine learning. The most common (and — in the winner-takes-all world of software development where the critical mass of community size tends to select for a single ultimate dominant tooling — the one that will most likely grind the others into the dust) is Jupyter (it’s also the one discussed in articles by Netflix, NetSecurity, and Nature, at the links above). I’ve presented on Jupyter many times and still use it day-to-day as a handy place to quickly spin up a Python server that will accept commands as basic as
for i in range(3): for j in range(3): print((i,j))
… for instance when I need to double-check some syntax or perhaps send a test HTTP request to some API with which I’m familiarising myself.
In terms of productionisable use cases though, that are deployed and meaningful for a data science team, the most important is surely teaching; Jupyter notebooks that can be run according to the following principles, are a great shareable resource that offer a straightforward, interactive way, to supply a learning shortcut to your data scientists being onboarded into an immersed familiarity with some data-set and/or ML algorithm that models it:
- run them in the cloud eg on SageMaker or Azure Notebooks or AIPN
- This enables serverless, clustering, horizontal & vertical scaling.
- It is also vastly more configurable WRT security, than running on your own laptop or hardware locally (eg specify VPCs and SGs).
- manage their dependencies effectively
- This means upgrading wherever possible e.g. to guard against vulnerabilities that may creep into the open source components such as Python, Sci-Kit Learn, or nbextensions.
- It also means having a clear plan as to which packages/libraries won’t be upgraded because you want the specific functionality or syntax of some legacy version eg when patching is preferred.