Standard Library II

Start in JupyterHub

functools

https://docs.python.org/3/library/functools.html

Some useful features of functools include the ability to automatically cache value as well as to automatically generate rich comparison methods from a single definition (see MyFraction). wraps can be used when defining function wrappers in decorators to copy the original function’s name and docstring.

# https://en.wikipedia.org/wiki/Levenshtein_distance
def naive_distance(text1, text2):
    
    def compute(ofst1, ofst2):
        if min(ofst1, ofst2) == 0:
            return max(ofst1, ofst2)
        
        return min(
            compute(ofst1 - 1, ofst2) + 1,
            compute(ofst1, ofst2 - 1) + 1,
            compute(ofst1 - 1, ofst2 - 1) + (text1[ofst1 - 1] != text2[ofst2 - 1]),
        )
        
    return compute(len(text1), len(text2))
%%timeit
# huge number of redundant calls
naive_distance("shellfish", "selfish")
179 ms ± 11.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
from functools import lru_cache

def distance(text1, text2):
    
    # automatically cache results
    @lru_cache(None)
    def compute(ofst1, ofst2):
        if min(ofst1, ofst2) == 0:
            return max(ofst1, ofst2)
        
        return min(
            compute(ofst1 - 1, ofst2) + 1,
            compute(ofst1, ofst2 - 1) + 1,
            compute(ofst1 - 1, ofst2 - 1) + (text1[ofst1 - 1] != text2[ofst2 - 1]),
        )

    return compute(len(text1), len(text2))
%%timeit
distance("shellfish", "fellshish")
99.2 µs ± 1.48 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

contextlib

https://docs.python.org/3/library/contextlib.html

contextlib provides basic context managers that you would have to write yourself otherwise. @contextmanager can be used to define a context manager using generator syntax. Take a look at nullcontext, closing, suppress and the redirect_* functions.

Assignment SL7

  1. Unbounded Cache lru_cache(None) speichert beliebig viele Elemente anstatt sich nur die zuletzt angefragten Elemente zu merken. Schreiben Sie einen Decorator unbounded_cache, der genau dasselbe macht wie lru_cache(None). unbounded_cache soll ein dict nutzen, um bereits berechnete Funktionswerte zu speichern und zurückzugeben, ohne dass die gewrappte Funktion aufgerufen wird. Es ist okay, wenn Ihre Version keine Keyword-Argumente unterstützt.

def unbounded_cache(f):
    cache = {}
    @functools.wraps(f)
    def wrapper(*args):
        try:
            return cache[args]
        except KeyError:
            result = f(*args)
            cache[args] = result
            return result
    return wrapper
  1. @contextmanager Im Kurs haben wir den Kontextmanager string_style als Klasse implementiert. Implementieren Sie string_style erneut unter Verwendung von @contextmanager aus contextlib.

from contextlib import contextmanager

@contextmanager
def string_style(new_style):
    global _STRING_REPRESENTATION
    old_style = _STRING_REPRESENTATION
    _STRING_REPRESENTATION = new_style
    try:
        yield
    finally:
        _STRING_REPRESENTATION = old_style

pathlib

https://docs.python.org/3/library/pathlib.html

pathlib provides path objects that represent file system paths. This makes directory structures more intuitive to navigate in Python. Take a close look at glob and rglob since these can recursively enumerate files in a folder hierarchy. An an example, we can use pathlib to merge multiple folders containing files with overlapping names. This is especially useful for photos.

argparse

https://docs.python.org/3/library/argparse.html

Library for argument parsing. See pathlib example code.

re

https://docs.python.org/3/library/re.html

Also see https://regexr.com/

In this example, we want to parse a list of ranges from a string. A range including index 1 to 5 is written as 1-5, but we also want to allow arbitrary spacing, e.g. 1 -    5.

import re
# match one or more digits
re.match("[0-9]+", "1")
<re.Match object; span=(0, 1), match='1'>
# this one does not match
re.match("[0-9]+", "")
# match one or more digits, a hyphen, and one or more digits again
re.match("[0-9]+-[0-9]+", "1-5-")
<re.Match object; span=(0, 3), match='1-5'>
# force the match to take up the entire string
re.match("^[0-9]+-[0-9]+$", "1-5-")
# same as above, but more readable
re.fullmatch("[0-9]+-[0-9]+", "1-5-")
# we do not account for spaces yet
re.fullmatch("[0-9]+-[0-9]+", "1 - 5")
# \s* matches zero or more whitespace characters
re.fullmatch("[0-9]+\s*-\s*[0-9]+", "1 - 5")
<re.Match object; span=(0, 5), match='1 - 5'>
# we still do not account for spaces at the beginning and end of the string
re.fullmatch("[0-9]+\s*-\s*[0-9]+", " 1 - 5 ")
# this is easily fixable
re.fullmatch("\s*[0-9]+\s*-\s*[0-9]+\s*", " 1 - 5 ")
<re.Match object; span=(0, 7), match=' 1 - 5 '>
# parentheses denote capture groups that can be read later
re.fullmatch("\s*([0-9]+)\s*-\s*([0-9]+)\s*", " 1 - 5 ").groups()
('1', '5')
# capture groups can be named for better readability
re.fullmatch("\s*(?P<from>[0-9]+)\s*-\s*(?P<to>[0-9]+)\s*", " 1 - 5 ").groupdict()
{'from': '1', 'to': '5'}
# "?" means the previous token is optional
re.fullmatch("Aufgaben?", "Aufgabe")
<re.Match object; span=(0, 7), match='Aufgabe'>
re.fullmatch("Aufgaben?", "Aufgaben")
<re.Match object; span=(0, 8), match='Aufgaben'>
# regex is case-sensitive by default
re.fullmatch("Aufgaben?", "aufgaben")
# use the "I" flag for case-insensitive matching
re.fullmatch("Aufgaben?", "aufgaben", re.I)
<re.Match object; span=(0, 8), match='aufgaben'>
# make upper limit optional so we can write unbounded ranges like "5-"
# the addition of parentheses introduced another group, but we only want 2 groups
re.fullmatch("\s*(?P<from>[0-9]+)\s*-(\s*(?P<to>[0-9]+))?\s*", "1-5").groups()
('1', '5', '5')
# use "?:" to denote a non-capturing group
re.fullmatch("\s*(?P<from>[0-9]+)\s*-(?:\s*(?P<to>[0-9]+))?\s*", "1-5").groups()
('1', '5')
# also allows single numbers
re.fullmatch("\s*(?P<from>[0-9]+)\s*(?:-(?:\s*(?P<to>[0-9]+))?)?\s*", "1").groups()
('1', None)
# ranges can be separated by "," or ";"
indices = "1,2-4;6,10-"
shitty_indices = "    1      ;2-      4   , 6,10   -"
invalid_indices = "4;;7"
# no capture groups this time, but match multiple ranges using "*"
range_format = r"\s*[0-9]+\s*(?:-(?:\s*[0-9]+)?\s*)?"
multiple_ranges = f"{range_format}(?:[,;]{range_format})*"
# matches
re.fullmatch(multiple_ranges, indices)
<re.Match object; span=(0, 11), match='1,2-4;6,10-'>
# still matches
re.fullmatch(multiple_ranges, shitty_indices)
<re.Match object; span=(0, 34), match='    1      ;2-      4   , 6,10   -'>
# this does not match
re.fullmatch(multiple_ranges, invalid_indices)

The next step is parsing the actual ranges. We could use split and multiple iterations of strip, but that makes detecting illegal strings very tedious. Instead, fullmatch is used to check the format beforehand and finditer enumerates all matches for us and automatically extracts the relevant parts.

from collections import namedtuple

IndexRange = namedtuple('IndexRange', ['start', 'end'])
import re

RANGE_FORMAT = r"\s*[0-9]+\s*(?:-(?:\s*[0-9]+)?\s*)?"
MULTIPLE_RANGES_FORMAT = f"{RANGE_FORMAT}([,;]{RANGE_FORMAT})*"
# make sure the range is terminated by ";" or "," or the end of the string
TERMINATED_RANGE_WITH_LABELS = r"\s*(?:(?P<index>[0-9]+)|(?P<start>[0-9]+)\s*-\s*(?P<end>[0-9]+)?)\s*?(?:[,;]|$)"

def parse_index_ranges(ranges):
    if re.fullmatch(MULTIPLE_RANGES_FORMAT, ranges) is None:
        raise ValueError("Illegal format")
    out = []
    for match in re.finditer(TERMINATED_RANGE_WITH_LABELS, ranges):
        groups = match.groupdict()
        if groups["index"] is not None:
            index = int(groups["index"])
            out.append(IndexRange(index, index + 1))
        else:
            start = int(groups["start"])
            if groups["end"] is not None:
                end = int(groups["end"]) + 1
            else:
                end = None
            out.append(IndexRange(start, end))
    return out
parse_index_ranges("    1      ;2-      4   , 6,10   -")
[IndexRange(start=1, end=2),
 IndexRange(start=2, end=5),
 IndexRange(start=6, end=7),
 IndexRange(start=10, end=None)]
# use ranges to slice a sequence
def index_range_to_slice(index_range):
    return slice(index_range.start, index_range.end)
even_numbers = list(range(0, 50, 3))
[n for r in parse_index_ranges(indices) for n in even_numbers[index_range_to_slice(r)]]
[3, 6, 9, 12, 18, 30, 33, 36, 39, 42, 45, 48]

Note: Sometimes, re ist not the right tool for the job, especially when dealing with non-regular grammars. Consider using urllib for URLs, a JSON library for JSON data, and some SGML/XML library for HTML/XHTML. A famous quote goes

Some people, when confronted with a problem, think
“I know, I'll use regular expressions.”
Now they have two problems. 

Assignment SL8 🧑‍🏫

  1. Ranges als String Schreiben Sie eine Funktion, die die mit Regex geparste Liste von Ranges wieder in einen String umwandelt.

def _range_as_string(r):
    if r.end is None:
        return f"{r.start}-"
    if r.start == r.end - 1:
        return f"{r.start}"
    return f"{r.start}-{r.end - 1}"
    
def ranges_as_string(ranges):
    return ",".join(_range_as_string(r) for r in ranges)
ranges = parse_index_ranges("    1      ;2-      4   , 6,10   -")
ranges_as_string(ranges)
'1,2-4,6,10-'
  1. Punktzahl finden Sie erhalten eine Textdatei im unten dargestellen Format. Schreiben Sie eine Funktion, die die Gesamtpunktzahl aus der Datei extrahiert und als int zurückgibt.

Bewertung:

Aufgabe 1:
schön, 2 Punkte

Aufgabe 2:
nichts abgegeben
0 Punkte

Aufgabe 3:
literally bester Code, 8 Punkte

Gesamtpunktzahl: 10 Punkte

Musterlösungen sind im Reader!
import re

def find_total_points(file):
    # file is expected to be small, read everything at once
    content = file.read()
    match = re.search(r"Gesamtpunktzahl:\s*(?P<points>[+-]?\s*[0-9]+)\s*Punkte?", content)
    if match is None:
        raise ValueError("Total points not found")
    points = match.group("points").replace(" ", "")
    return int(points)
  1. Gebrochene Punktzahl finden Wie vorher, aber die Gesamtpunktzahl kann jetzt eine Kommazahl sein, also 2.3 oder 2,3. Passen Sie Ihre Funktion so an, dass sie damit zurechtkommt.

import re

def find_total_points_float(file):
    content = file.read()
    match = re.search(r"Gesamtpunktzahl:\s*(?P<points>[+-]?\s*[0-9]+(?:[,.][0-9]+)?)\s*Punkte?", content)
    if match is None:
        raise ValueError("Total points not found")
    points = match.group("points").replace(" ", "").replace(",", ".")
    return float(points)
import io

find_total_points_float(io.StringIO("""\
Bewertung:

Aufgabe 1:
schön, 2 Punkte

Aufgabe 2:
nichts abgegeben
0 Punkte

Aufgabe 3:
literally bester Code, 8 Punkte

Gesamtpunktzahl: - 10.3 Punkte

Musterlösungen sind im Reader!
"""))
-10.3

subprocess/ctypes

In some cases, you might not be able to find an appropriate library for your needs. Python still allows you to reuse existing code by providing interfaces to the shell and to C code. Using subprocess or ctypes might still be easier than writing code yourself.

https://docs.python.org/3/library/subprocess.html

https://docs.python.org/3/library/ctypes.html

Assignment LF6 🧑‍🏫

  1. Wörter zählen mit Sonderzeichen Schreiben Sie eine Variante von count_words aus SL1, die mittels regulärer Ausdrücke Sonderzeichen aus dem String entfernt, bevor die Wörter gezählt werden.