Lab 5

Some remarks

  1. Repositories and directories for this course: Most course materials are in the course repo.

    We recomend that you clone the repo into a directory called stat159-f17-reference. First, move into your directory for this course (e.g. stat159) and clone the repo:

    git clone https://github.com/berkeley-stat159-f17/stat159-f17.git stat159-f17-reference

    Then copy the contents of the repo into a new directory called stat159-f17-work:

    cp -r stat159-f17-reference stat159-f17-work

    Now in the stat159-f17-work directory you can make changes on notebooks etc. When we add course materials you can pull in the stat159-f17-reference directory, and then copy again.

  2. Absolute paths vs relative paths:

    Repositories are meant to be shared. If you have a path to data that looks like Users/username/repo/data.csv, will it be able to run on another computer? How can we change the path so that it can run from inside the repo directory?

File IO

This section was slightly modified from the Python docs

Opening a file

The function open returns a file object, and is most commonly used with two arguments: open(filename, mode).

In [2]:
f = open('workfile', 'w')

The first argument is a string containing the filename. The second argument is another string containing a few characters describing the way in which the file will be used. mode can be 'r' when the file will only be read, 'w' for only writing (an existing file with the same name will be erased), and 'a' opens the file for appending; any data written to the file is automatically added to the end. 'r+' opens the file for both reading and writing. The mode argument is optional; 'r' will be assumed if it’s omitted.

Normally, files are opened in text mode, that means, you read and write strings from and to the file, which are encoded in a specific encoding. If encoding is not specified, the default is platform dependent (see open). 'b' appended to the mode opens the file in binary mode: now the data is read and written in the form of bytes objects. This mode should be used for all files that don’t contain text.

In text mode, the default when reading is to convert platform-specific line endings (\n on Unix, \r\n on Windows) to just \n. When writing in text mode, the default is to convert occurrences of \n back to platform-specific line endings. This behind-the-scenes modification to file data is fine for text files, but will corrupt binary data like that in JPEG or EXE files. Be very careful to use binary mode when reading and writing such files.

It is good practice to use the with keyword when dealing with file objects. The advantage is that the file is properly closed after its suite finishes, even if an exception is raised at some point. Using with is also much shorter than writing equivalent try-finally blocks:

In [3]:
with open('workfile') as f:
    read_data = f.read()
f.closed
Out[3]:
True

If you’re not using the with keyword, then you should call f.close() to close the file and immediately free up any system resources used by it. If you don’t explicitly close a file, Python’s garbage collector will eventually destroy the object and close the open file for you, but the file may stay open for a while. Another risk is that different Python implementations will do this clean-up at different times.

After a file object is closed, either by a with statement or by calling f.close(), attempts to use the file object will automatically fail. :

In [ ]:
f.close()
f.read()

Exercise: Write the equivalent logic of the with statement with try-finally blocks

In [ ]:
# Your code here

Methods for file objects

First, let’s create a file object for example.txt

In [12]:
f = open("lab5-files/example.txt", "r")

Reading

To read a file’s contents, call f.read(size), which reads some quantity of data and returns it as a string (in text mode) or bytes object (in binary mode). size is an optional numeric argument. When size is omitted or negative, the entire contents of the file will be read and returned; it’s your problem if the file is twice as large as your machine’s memory. Otherwise, at most size bytes are read and returned. If the end of the file has been reached, f.read() will return an empty string (''). :

In [13]:
print(f.read())
This is a temporary text file.
We'll parse this file.

In [14]:
f.read()
Out[14]:
''

f.readline() reads a single line from the file; a newline character (\n) is left at the end of the string, and is only omitted on the last line of the file if the file doesn’t end in a newline. This makes the return value unambiguous; if f.readline() returns an empty string, the end of the file has been reached, while a blank line is represented by '\n', a string containing only a single newline. :

In [15]:
f = open("example.txt", "r")
f.readline()

Out[15]:
'This is a temporary text file.\n'
In [16]:
f.readline()
Out[16]:
"We'll parse this file.\n"
In [17]:
f.readline()
Out[17]:
''

For reading lines from a file, you can loop over the file object. This is memory efficient, fast, and leads to simple code:

In [21]:
f = open("example.txt", "r")
for line in f:
    print(line, end='')
This is a temporary text file.
We'll parse this file.

If you want to read all the lines of a file in a list you can also use list(f) or f.readlines().

Writing

Now let’s create a new file to write to

In [22]:
f = open("our_file.txt", "w")

f.write(string) writes the contents of string to the file, returning the number of characters written. :

In [23]:

f.write('This is a test\n')
Out[23]:
15

Other types of objects need to be converted – either to a string (in text mode) or a bytes object (in binary mode) – before writing them:

In [25]:
value = ('the answer', 42)
s = str(value)  # convert the tuple to string
f.write(s)
f.close()

f.tell() returns an integer giving the file object’s current position in the file represented as number of bytes from the beginning of the file when in binary mode and an opaque number when in text mode.

To change the file object’s position, use f.seek(offset, from_what). The position is computed from adding offset to a reference point; the reference point is selected by the from_what argument. A from_what value of 0 measures from the beginning of the file, 1 uses the current file position, and 2 uses the end of the file as the reference point. from_what can be omitted and defaults to 0, using the beginning of the file as the reference point. :

In [40]:
f = open('our_file.txt', 'rb+')
f.write(b'0123456789abcdef')
Out[40]:
16
In [41]:
f.seek(5)      # Go to the 6th byte in the file

Out[41]:
5
In [42]:
f.read(1)
Out[42]:
b'5'
In [43]:
f.seek(-3, 2)  # Go to the 3rd byte before the end

Out[43]:
48
In [44]:
f.read(1)

Out[44]:
b'4'

In text files (those opened without a b in the mode string), only seeks relative to the beginning of the file are allowed (the exception being seeking to the very file end with seek(0, 2)) and the only valid offset values are those returned from the f.tell(), or zero. Any other offset value produces undefined behaviour.

Other file types

Many times you’ll encounter data not stored in text files. In particular, data is oftentimes compressed. For the second homework you’ll need to read a file which has been compressed with GNU Gzip. You can open files like this in python in similar ways as ordinary text files.

Calisthenics

Exception handling

Using a try-catch-finally block, write a function which takes in a list of numbers and returns a list of all the elements up until the first negative number.

In [ ]:
# Type your code here

Quantiles

Write a function to compute the median of a list of numbers

In [1]:
# Type your code here

Now write a function to compute the \(p^\text{th}\) percentile

In [2]:
# Type your code here

File I/O

Write a function which creates a file with \(n\) lines numbered

In [3]:
# Type your code here

Write a function which appends to that file an extra \(m\) lines

In [4]:
# Type your code here