Reading discussion - Scientific Python, IPython, Jupyter

Point 1

“IPython is built and developed with the general principle of making life easier for scientists and students, but I believe that there are certain dangers in this as a result of my personal experience. If too much computation is abstracted away, then one can take the easy way out and miss out on learning fundamental concepts. Nowadays, one can simply call the appropriate methods in scikitlearn to do machine learning. In fact, my manager at my internship told me that a lot of folks who are trying to get into data science these days don’t even understand the algorithms/methods that they are using. Fifteen years ago, when scikitlearn didn’t exist, one needed to code the actual machine learning algorithm if he/she wanted to use it. This process forces the coder to deepen his/her fundamental understanding of the algorithm itself. The fact that we don’t need to do garbage collection in Python is a time-saver, but we miss out on the opportunity to learn about how our data is stored in memory. In fact, it’s possible for a newbie who has only coded in Python to not even be aware of what garbage collection is! In contrast, a C++ programmer has to become familiar with garbage collection and malloc by necessity. Actually, UC Berkeley’s entire computer science curriculum has embraced Python wholeheartedly; it is used in most of the upper division courses that I have taken. As a result, I haven’t used C and C++ enough. This has caused my understanding of pointers and memory allocation, which I believe are still important for any computer programmer to know, to remain minimal. One counter-argument would be to assert that teachers should insist that students learn the fundamental knowledge. In CS 189, the machine learning class here at Berkeley, we had to implement linear regression and a neural network in Numpy and Scipy in order to understand how they work. But, people often just want to finish a task in the most efficient way possible and will certainly find the most efficient way of doing so, especially if they are just searching online for the answers (through Stack Overflow, Quora, other online forums, etc.)”

Discussion

Data Science from Scratch, by Joel Grus.

The Online Python Tutor is an excellent tool for visualizing the structure of algorithms in a variety of languages. We can even use it live in the notebook thanks to the tutormagic extension, which I’ve installed (you should go ahead and install it locally as well, as per the instructions on the site):

In [12]:
%load_ext tutormagic
The tutormagic extension is already loaded. To reload it, use:
  %reload_ext tutormagic
In [13]:
%%tutor -l python3
x = [1,2,3]
y = x
x.append('hi')

This is the quicksort algorithm as it used to be described in an old version of the Wikipedia page (today the descriptions are more complicated):

function quicksort(array)
     var list less, greater
     if length(array) <= 1
         return array
     select and remove a pivot value pivot from array
     for each x in array
         if x <= pivot then append x to less
         else append x to greater
     return concatenate(quicksort(less), pivot, quicksort(greater))

We can turn this into Python and visualize it:

In [14]:
%%tutor -l python3 -h 600
def qsort(lst):
    """Return a sorted copy of the input list."""
    if len(lst) <= 1: return lst
    pivot, rest   = lst[0], lst[1:]
    less_than     = [ lt for lt in rest if lt < pivot ]
    greater_equal = [ ge for ge in rest if ge >= pivot ]
    return qsort(less_than) + [pivot] + qsort(greater_equal)

qsort([3, 10, -9, 1, 7])

The Python dis module lets you analyze the internal structure of Python bytecode, which is what the interpreter actually executes:

In [15]:
import dis
dis.dis("""
x = [1,2,3]
y = x
x.append('hi')
""")
  2           0 LOAD_CONST               0 (1)
              2 LOAD_CONST               1 (2)
              4 LOAD_CONST               2 (3)
              6 BUILD_LIST               3
              8 STORE_NAME               0 (x)

  3          10 LOAD_NAME                0 (x)
             12 STORE_NAME               1 (y)

  4          14 LOAD_NAME                0 (x)
             16 LOAD_ATTR                2 (append)
             18 LOAD_CONST               3 ('hi')
             20 CALL_FUNCTION            1
             22 POP_TOP
             24 LOAD_CONST               4 (None)
             26 RETURN_VALUE

XTensor

xtensor is a C++ library meant for numerical analysis with multi-dimensional array expressions. Here is a live demo.

Point 2

“Python has its drawbacks. For instance, Python has still not developed well-constructed function for parallel and distributed computing. And also what concerns me is that while lowering the standard for access to data analysis, Python may also lower the standard for preciseness and strictness of academic research. In jupyter notebook, we still have problem inserting academic citations. Anyone can post their research or article on platform like GitHub. The publication is no longer a very formal process as before. With the rapid rise of interactive computing systems, this issue also requires people’s attention.”

Discussion

  • Dask: an excellent library for distributed computing in Python.
  • For numerically-oriented parallel computing, MPI4Py provides Python access to the complete MPI APIs.
  • Not very actively developed anymore, but can be interesting in certain contexts: ipyparallel. If anyone is interested in how ipyparallel and mpi4py can be combined to interactively steer and introspect parallel codes, see me at office hours.
  • Preprints: the ArXiV. In-progress, non-peer-reviewed research hasn’t been the death of physics, and these ideas are now picking up momentum in other areas: BioRxiv, OSF Preprints framework.
  • Citations: not a completely solved problem, but T. Kluyver’s cite2c is a step in the right direction, and we’re working on improvements here.

Point 3

“All in all, the article makes some good points (and to be fair it is out-of-date), but I think it makes a weak/strawman case when comparing Python to C/C++/Fortran/Mathematica/Matlab, because the former three are non-interactive, lower-level, the latter two are high-level interactive, and base Python (as mentioned here, without mentioning IPython engine/shell, much less Jupyter notebooks) is high-level and non-interactive. So really the only competitor that comes to mind for high-level non-interactive (at least with extensions) languages (that existed at the time, i.e. not Julia, which hasn’t even had a stable release yet anyway) is R, but comparisons with R are mostly ignored and/or glossed over and/or inaccurate/unfair. (E.g. with regards to data visualization, statistical algorithms, and most importantly data manipulation, for which Python is absolutely terrible/not even worth using without pandas, and even then Python still doesn’t have any super-killer like dplyr on top of R’s better built-in features for data management/manipulation).”

Discussion

This image is particularly telling:

  • There’s certainly nothing quite like the TidyVerse in Python. Note that at the time of writing of the article, dplyr (and much of the tidyverse) didn’t exist (its initial commit is dated Oct 28, 2012). But it’s certainly true that the R machinery is exceptionally powerful, and many of its tools interoperate with power and elegance.

In today’s data science, research and industry environments, both R and Python play a role. Each has areas of particular strength, each has weaknesses, in some areas they overlap enough to be nearly interchangeable, in others they compete and feed each other. In the end, a good scientist should know more than one tool, and know when to pick the most appropriate for the job. I hope this course will teach you enough Python to know when to use it, and when not to!

Point 4

“I hate the fact the Statistics department here seems set on using R in most classes instead of Python along with packages like NumPy, Sympy, etc, and having taken Stat 133 I have seen time wasted teaching the unintuitive R and RStudio to make R markdown files instead of converting to Python. I didn’t have any issues with these readings, Python is much less of a pain than high-level languages used in the CS department and I think should replace R and Matlab in statistics, numerical analaysis, and engineering classes.”

Discussion

See above 😀!