This is a write-up of what I did during the 2018 core sprint, hosted by Microsoft in Redmond Washington. For the second year in a row, I was invited. I am a long-time core developer. I joined python-dev when it was an invitation only mailing list. However, I have not been an active developer for many years. In the past, other personal commitments (children, running a large family farm in SK, etc) kept me for having time to work on Python. More recently, I have a bit more free time and so I'm very grateful to be invited.
First, I will explain the state of Python, as I see it. Python has become a hugely popular programming language. Any large software company likely depends on in some fashion to run their business. Originally, Python was popular for scripting (Guido's original motivation) or as a prototyping language. However, since software is extremely expensive to develop, Python has moved out of these areas and into "production". Companies like Facebook, Microsoft, Dropbox, and Google all use Python to run some of their production software systems.
This popularity is great but it has put pressure on the language and on language implementations (let's call them Python VMs, PyPy and IronPython are examples) to do more and do it with less CPU and memory resources. Programmers like Python for its high level nature and friendliness but sometimes run into performance issues. My goal at the sprint was to focus on some of the performance problems and to brain-storm with other core developers about possible solutions.
There are two common ways that Python gets used. First, it can be used for short running "scripts" or tools. In those cases, the startup time of the Python VM is significant because it can be a large fraction of the total execution time. A lot of work has been done to optimize Python 3 but in this regard, it is still lagging behind Python 2.7. I would like to finally surpass 2.7 and hopefully even leave it in the dust.
A second way Python is used is for long-running programs that act as servers. One example is a web application or some other service that is long lived and serves multiple network requests. For example, a web server that implements a dynamic web application. In those cases, startup time probably irrelevant. Instead, what matters is the execution speed of the program, the amount of memory used (some applications can easily chew up gigabytes of RAM) and the amount of OS load (e.g. page faults) generated by the program.
There were a number of ideas discussed with the goal of decreased startup time. They have two common themes: reduce the time needed to load module data (currently bytecode serialized by the marshal module, stored in .pyc files), or reduce the amount module data loaded.
For reducing the load time, there are a few ideas: analyze the marshal module and see if optimizations can be made. Yury suggested that the pickle module might actually be faster than marshal. If so, swapping out marshal for pickle is easy to implement.
The experimental frozen modules idea is promising but it will take some work before the feature is ready to merge (assuming we can decide on a design). With the patch as proposed, the frozen data is arch specific (unlike .pyc files). Making them arch specific gives some speed advantage. In the long term, loading might become even faster if the data stored in the data segment is "close" to the format of Python objects in RAM. In that case, since the data segment is already memory-mapped into the process space, only a quick "fix up" pass would be needed. E.g. fix object pointers that reference parts of the Python runtime.
My preferred design for the frozen modules is to allow putting the data in DLLs and then allow zero or more of them to be specified. You could use a command line option or an env var. The only thing embedded in the Python executable by default would be importlib (currently embedded using an older frozen format). That would allow a lot of flexibility. For users who want fat executables that contain everything (like Go), we could allow all need modules to be embedded. The user would have the option of adding more DLLs to a special frozen module path. So, they might have one DLL for stdlib, one for base 3rd party libraries, and one for each of their apps. If the user wants to load everything from .pyc files, just like current Python, they can not provide any DLLs.
Another idea to improve startup was my lazy module (i.e. multi-part module bytecode) scheme. It a crazy idea, probably not so likely to work in practice. However, it was useful to brainstorm about how it could work and what are some of the issues with it. I have a poor cell phone picture of my whiteboard scratchings. The theory is that many programs don't use the majority of globals, functions and class definitions that are defined in modules they import. So, keep those objects as unmarshalled code (maybe even leave it on the disk) until we know that we need it.
My proof of concept implementation of this idea has to modify the LOAD_GLOBAL opcode. If a global is loaded that has not yet been unmarshalled, that load operation must be trapped and the execution of the defining statement happens then. Doing this in a way that doesn't modify the module semantics is challenging. At minimum, we need modules to op-in to this behavior, e.g. by declaring:
__lazy_safe__ = True
at the global scope. Another problem is there are quite a lot of top-level statements that result in code execution at module import time. E.g. decorators, deriving from a base class with a meta-class hook, keyword arguments, etc. The behavior of import statement would be changed so that they all happen lazily. E.g. import foo would not execute the body of the foo until something actually touches a property of that module.
I enjoyed discussion of this idea with other core developers. I remember a lively discussion with Larry, Yury, Barry and Eric Snow in particular. Because of their deep knowledge of Python, they were able to quickly understand the general idea and to poke holes in it. E.g. that won't work because of X, where X is a deep and tricky detail of Python internals. While discussing the idea, Larry suggested an idea based on his previous experience with PHP. What if we can defer the unmarshal of code objects? E.g. when loading the code for a module, leave the code objects as serialized strings. The theory is that many functions get defined but not all of them get executed. It should be relatively easy to trap access to the code object and "wake" it up on demand. I am excited to try implementing this idea. It sounds easy implement, safe and could have nearly the same benefit of my more complicated lazy module idea.
While discussing Larry's PR, there was understandably some confusion about what kind of objects are being loaded from the data segment. E.g. if modules do platform specific initialization, would the frozen module still be cross-platform? The answer is yes. With proposed PR, the objects frozen are the code objects for module bodies. We are only removing the unmarshal step for loading .pyc data. The frozen data is equivalent to what is stored in the .pyc file. It is stored in a arch specific format however, to speed up loading.
It doesn't take a big leap to imagine getting rid of the module body execution step and dumping the module global state instead of the module code objects. That's interesting as it could drastically improve startup time. It would be similar to the "unexec" hack used by some dynamic languages. E.g. save the current heap image and then re-exec it on startup. If you do that, the time taken to execute things like global regex patterns would be shifted to compile time rather than program load/run time.
This is a crazy idea but perhaps worth thinking about. I like to think about crazy ideas. Does it make sense to compile regex every time you start a command-line tool? I would love the ability to pre-compute some things. I think you would need a mechanism to declare which code runs at compile time and what at import time. I seem to recall that Common Lisp systems have similar kinds of declarations and so we could look there for prior art. Maybe the syntax could look like the following:
PAT = __compiled__(re.compile(r'...'))
The basic idea is to selectively do some evaluation of code at compile time, rather than doing everyting at run-time. That raises a lot of potential issues. Implementing something like Larry's PR would open the door to experiments.
In addition to trying to decrease startup time, I also focused on trying to improve Python's performance by making it easier for alternative Python VMs to implement a compatible extension module API. The Design a better C API project is being lead by Victor. You should read that page if you are interested. I will explain my motivations for spending time on this effort. As stated in the intro, Python has become a hugely popular language. It would be nice if we can make it perform better. Companies like Facebook, Google, and Microsoft seem to be willing to put in effort in trying to improve things. One strategy is to put improvements and optimizations into CPython. Personally, I think focusing all optimization effort on CPython is a bad idea. In general, high performance comes with other trade-offs. For example, a poorer ability to debug and trace program execution. Or, a more complicated and harder to maintain implementation. So, I think it would be better if high performance experiments could be done in alternative implementations. That would be projects like PyPy, IronPython, and others.
Alternative Python VMs face a number of problems. Perhaps most serious is the fact that Python as a language is poorly specified (i.e. do whatever CPython does) and CPython is evolving quickly. More on that below regarding my "core Python idea". Another serious problem is that to be a useful Python VM, you likely need to support C extension modules. The PyPy developers have done an amazing job to make a higher performing VM that is also extremely language compatible with pure Python code. However, without being able to support the huge collection of C extension modules that CPython has, adoption has been limited. PyPy has gone the next step and built a highly compatible extension API (another amazing feat) but using the API gives up a lot of the performance benefit of PyPy.
The current Python API has been highly successful (look at all the useful extensions in the world). But, it exposes too many details about the internals of the CPython VM. VMs like PyPy have to work very hard to give extensions the same API. One of the goals of the new API project is to hide some of these details and make it possible for compatible extensions to run with good performance in PyPy while still being compatible with CPython. In short, we want an API that modules can use and perform well in both CPython and other VMs.
As with any language change, introducing compatibility problems can have huge costs for the Python ecosystem. No one wants another 2to3 experience. So, we have to proceed carefully. Victor addresses this on the readthedocs site linked above. My preferred plan is as follows. Introduce new APIs that extension modules can op-in to use. We would provide shim code so that these updated modules could still compile with old versions of CPython. Modules using the old API would continue to compile and work. Modules using the new API can expect better performance if used from VMs like PyPy. In the long term, we might warn if old APIs are being used. However, I don't see any immediate future where we would stop supporting the old API. Perhaps there could be optional CPython features that would require that no old APIs are used (e.g. my tagged pointers implementing fixed int experiment).
A relatively easy API change to make is to change the PyObject structure into an "opaque" type. That means from within extension modules, it is not possible to "see" the structure members. The PyObject structure is small: it contains ob_refcnt and ob_type. Most code does not interact with ob_refcnt directly. So, changing the few places that modify it to use C99 inline functions (_Py_GET_REFCNT, _Py_SET_REFCNT) instead is quite easy. For ob_type, it is used in a lot more places but changing the code to use a new Py_TP macro is also quite easy. There is quite a lot of code churn for the Py_TP changes, see my issue #34704 in the bug tracker.
Aside from good program design (i.e. do not reveal the implementation details of data structures), this change provides some specific benefits. For alternative VMs, implementing these fields on every object exposed via the extension API could be a big burden. If the VM uses non-ref counting GC internally, they might not have a ob_refcnt field and putting one in every object just chews up memory. For ob_type, the type structure might not map one-to-one from Python language type (e.g. dict) to underlying VM type. E.g. the VMs could use specialized types in certain places for better performance. So, allowing the VM a "hook" in the form of _Py_TYPE could give them more freedom of implementation strategy.
I think doing an opaque type without using inline functions could be quite challenging. However since we can now rely on C99 inline functions, it is not hard to make low (or zero) overhead data hiding APIs. We should make use of this C99 feature.
Another thing an opaque PyObject allows is the idea of using "tagged pointers" to represent certain data types. Tagged pointers are a common implementation trick for dynamic language VMs. Many implement fixed size integers using the trick. Pointer tags abuse the fact that on real machines, pointers are always aligned to some machine word boundary. E.g. on AMD64, pointers are aligned on 8 byte boundaries. So, the bottom bits of all real pointers are zeros. Those bits can be "stolen" by the Python VM and then the rest of the word can hold other data. E.g. for fixed ints, you could steal the low order bit to denote a tagged int. The rest of the bits (63 on AMD64) would be used for a 2^63 integer.
While at the sprint, I greatly enjoyed discussing issues related to the tagged pointers and other things with Dino Viehland. Dino has a wealth of language implementation experience as he has worked on IronPython, CLR and the Pyjion JIT. I am a hobbyist and so getting to pick the brain of a real expert is a treat. BTW, I got to meet Anders Hejlsberg one night. That was cool. In college my intro computer science courses used Turbo Pascal. I loved that IDE: clean, fast and powerful. Later, I had some experience with Delphi. For GUI programming on Windows, Delphi was amazing. For the little I know of C#, it is an excellent language. So, I have great respect for Anders Hejlsberg in terms of programming language implementation.
To prove the tagged pointer idea works with Python, I implemented an experimental version in my tagged_int branch of cpython. As of yet, I'm not getting a net performance win (INCREF/DECREF and Py_TP() are more expensive). Some early benchmark data is below (PGO build, fixedint5 is version with tagged pointers for fixed ints).
$ ./python -m perf compare_to -G ../cpython-profile-tagged-off/base4.json fixedint5.json --min-speed 5 Slower (24): - pickle_list: 3.06 us +- 0.04 us -> 3.74 us +- 0.03 us: 1.22x slower (+22%) - pickle_dict: 22.2 us +- 0.1 us -> 26.2 us +- 0.2 us: 1.18x slower (+18%) - raytrace: 501 ms +- 5 ms -> 565 ms +- 6 ms: 1.13x slower (+13%) - crypto_pyaes: 113 ms +- 1 ms -> 126 ms +- 0 ms: 1.12x slower (+12%) - logging_silent: 210 ns +- 4 ns -> 234 ns +- 3 ns: 1.11x slower (+11%) - telco: 6.00 ms +- 0.09 ms -> 6.68 ms +- 0.14 ms: 1.11x slower (+11%) - float: 111 ms +- 2 ms -> 123 ms +- 1 ms: 1.11x slower (+11%) - nbody: 122 ms +- 1 ms -> 135 ms +- 2 ms: 1.10x slower (+10%) - mako: 17.1 ms +- 0.1 ms -> 18.8 ms +- 0.1 ms: 1.10x slower (+10%) - json_dumps: 12.3 ms +- 0.2 ms -> 13.5 ms +- 0.1 ms: 1.10x slower (+10%) - scimark_monte_carlo: 103 ms +- 2 ms -> 113 ms +- 1 ms: 1.10x slower (+10%) - pickle_pure_python: 467 us +- 3 us -> 508 us +- 6 us: 1.09x slower (+9%) - logging_format: 10.2 us +- 0.1 us -> 11.1 us +- 2.2 us: 1.09x slower (+9%) - chameleon: 9.27 ms +- 0.09 ms -> 10.1 ms +- 0.1 ms: 1.09x slower (+9%) - sqlalchemy_imperative: 30.4 ms +- 0.8 ms -> 32.9 ms +- 0.9 ms: 1.08x slower (+8%) - django_template: 122 ms +- 2 ms -> 131 ms +- 2 ms: 1.08x slower (+8%) - sympy_str: 184 ms +- 2 ms -> 198 ms +- 5 ms: 1.07x slower (+7%) - unpickle_pure_python: 368 us +- 5 us -> 394 us +- 9 us: 1.07x slower (+7%) - sympy_expand: 426 ms +- 10 ms -> 452 ms +- 12 ms: 1.06x slower (+6%) - sympy_sum: 90.4 ms +- 0.6 ms -> 96.0 ms +- 1.0 ms: 1.06x slower (+6%) - regex_compile: 181 ms +- 7 ms -> 192 ms +- 7 ms: 1.06x slower (+6%) - scimark_lu: 173 ms +- 6 ms -> 182 ms +- 5 ms: 1.05x slower (+5%) : - genshi_xml: 62.7 ms +- 0.8 ms -> 66.1 ms +- 0.8 ms: 1.05x slower (+5%) - pickle: 9.11 us +- 0.13 us -> 9.59 us +- 0.06 us: 1.05x slower (+5%) Faster (2): - unpack_sequence: 49.1 ns +- 0.7 ns -> 45.0 ns +- 1.3 ns: 1.09x faster (-8%) - scimark_sparse_mat_mult: 3.75 ms +- 0.05 ms -> 3.47 ms +- 0.05 ms: 1.08x faster (-8%) Benchmark hidden because not significant (29): 2to3, chaos, deltablue, dulwich_log, fannkuch, genshi_text, go, hexiom, html5lib, json_loads, logging_simple, meteor_contest, nqueens, pathlib, pidigits, python_startup, python_startup_no_site, regex_dna, regex_effbot, regex_v8, richards, scimark_fft, scimark_sor, spectral_norm, sqlite_synth, sympy_integrate, tornado_http, unpickle, unpickle_list Ignored benchmarks (5) of ../cpython-profile-tagged-off/base4.json: sqlalchemy_declarative, xml_etree_generate, xml_etree_iterparse, xml_etree_parse, xml_etree_process Ignored benchmarks (4) of fixedint5.json: xml_etree_pure_python_generate, xml_etree_pure_python_iterparse, xml_etree_pure_python_parse, xml_etree_pure_python_process
One problematic feature of the current extension API is that a number of heavily used functions return a borrowed reference to a PyObject. In CPython that is done to avoid the cost and trouble of doing an INCREF/DECREF on it. If the reference is only used for a "short time" (sometimes not so clear what exactly is safe), the callee does not call INCREF and the caller does not call DECREF. Victor has a more detailed description on the readthedocs site.
While discussing some language implementation challenges with Dino and Carl Shapiro, Carl made an interesting comment about borrowed references. If I understand his explanation correctly, the borrowed reference APIs are not a big challenge for his VM. His VM uses a compacting garbage collector and so I would have expected some trouble. Instead, Carl says that he must already implement support for weak references. So, he just creates a weak reference before returning the borrowed reference. This design has a bonus feature. If can be expensive to box and/or create a handle for an object returned to an extension. By creating the weakref and keeping it around a bit, the VM can sometimes avoid repeatedly doing this operation for an object that is returned frequently.
If it turns out the borrowed reference APIs are okay, that would be great. Fixing their usage in extension modules is non-trivial. In general, you cannot use an automated tool since you have to manually add the DECREF call in the correct spot. Also, for CPython, there would be a performance cost to switching all extensions to the strong reference APIs. Extensions can switch on their own as I believe there are already APIs (e.g. PyObject_GetItem rather than PyList_GetItem).
Carl suggests that the PyModule_GetDict is a problem. VMs might want to implement module namespaces using something other than dict. So, forcing them to give up their "dict" is not friendly. Instead, extensions can use other existing APIS like PyObject_SetAttr.
Python takes a large performance hit by using dicts as namespace (module globals, instance properties). I had some discussion with Carl and Dino about other strategies. We could implement something like fast globals, using a slot based data structure tied to modules. If code is executed in a different namespace, we would fix up the slot offsets or switch to name based global set/get. I think instance attributes can be optimized similarly using the "hidden classes" design. We dicussed changing the frames so that instead of referring to 'globals' and 'builtins', they simply refer to a module object. Then, the module can implement its own get/set operations for global variables. Calling PyModule_GetDict() could return a proxy object that looks like a dict but modifies the module behind it.
Data types allocated on the data segment are also an issue. Ideally, all objects, including types will be allocated by the VM runtime. That gives them the ability to place them in the correct area for GC to work best. So, Carl would like to see PyType_Ready go away and people use PyType_FromSpec or a similar API. PyType_FromSpec currently has a few issues. However, I don't think those issues are too hard to fix. Moving extensions to move from statically allocated types to PyType_FromSpec might be a little trouble. Maybe it can be mostly automated by code fixing tools?
Fixing Python's extension API is not a matter of throwing everything out and starting over. Instead, the API we already have contains a mostly good API hidden under a large and not so good API. So, we mostly just need to deprecate the bad stuff, provide a few new APIs (e.g. Py_TP(), Py_SET_TYPE(), etc), and we will mostly be where we want. Trying to stick with exactly the API we currently have is not good for the evolution of Python. We need to let alternative implementations bloom and for them to be successful, they need to support the extension modules that people depend on.
This is an idea that occurred to me during the sprint. It is not a new one. The idea of the stdlib being separate has been around for years. With the goal of allowing alternative Python VMs to bloom, I think we should revisit this idea and see what can be done. Most programming language implementations strive to be self-hosted. I don't think that is ever in the future for CPython. Still, we can take some steps to become closer to being self-hosted. Currently, we have a giant "ball of mud". Whatever is in the cpython git repo is CPython. That's not so great.
I would like to see things split into a core and non-core part. Everything absolutely needed to get the VM up and running would be part of core. Everything else can go into the non-core part. As an example, I don't think the read-eval-print (REPL) needs to be in the core. Instead, it can be implemented in pure Python, perhaps with some accelerator extension modules, and live in the non-core part. For day-to-day use, the core and non-core parts would be re-assembled and so for end users, at least initially, they would see no difference in how Python behaves. In the long term, we could consider ideas like a separate versioning and release schedule for the core.
This separation would achieve at least two useful things. First, alternative implementations will have a smaller set of functionality to implement. If they implement the core then ideally there would be pure Python implementations for most of the non-core parts. For extension modules that wrap libraries, we should use something like CFFI rather than extension modules if feasible. Wrapping a library with CFFI should make it usable by multiple Python VMs.
Second, it will make it more clear that a certain new feature is core or non-core. We should try to reduce new core features if they can be accomplished some way in the existing language. PyPy struggles to keep up with the fast evolution of CPython. The dataclasses is a good example of something that was done as a non-core (pure Python) implementation. Now, having a pure Python implementation doesn't preclude you from providing a C extension that accelerates it (e.g. cpickle vs pickle). However, the behavior of the library should be defined by the pure Python version. That prevents weird implementation specific quirks from sneaking into the language "spec".
I haven't been closely following the development of asyncio but I think it could be another example of the benefit of the core Python idea. With that design, I suspect large parts of the asyncio package could be moved to non-core. Then, what is left will become easier to review, document, test and refine. Alternative async libraries like Trio would get to use the same core functionally as asyncio. I don't know enough to know how Trio and asyncio compare but it would seem that friendly competition is healthy. The current asyncio is trying to do two jobs: first provide the needed low-level features to efficiently implement async stuff. Second, it is trying to provide a friendly and high-level API. The former probably needs to be in core. The latter is probably better in non-core.