The Internal Structure of Python Eggs

STOP! This is not the first document you should read!

Eggs and their Formats

A “Python egg” is a logical structure embodying the release of a specific version of a Python project, comprising its code, resources, and metadata. There are multiple formats that can be used to physically encode a Python egg, and others can be developed. However, a key principle of Python eggs is that they should be discoverable and importable. That is, it should be possible for a Python application to easily and efficiently find out what eggs are present on a system, and to ensure that the desired eggs’ contents are importable.

There are two basic formats currently implemented for Python eggs:

  1. .egg format: a directory or zipfile containing the project’s code and resources, along with an EGG-INFO subdirectory that contains the project’s metadata

  2. .egg-info format: a file or directory placed adjacent to the project’s code and resources, that directly contains the project’s metadata.

Both formats can include arbitrary Python code and resources, including static data files, package and non-package directories, Python modules, C extension modules, and so on. But each format is optimized for different purposes.

The .egg format is well-suited to distribution and the easy uninstallation or upgrades of code, since the project is essentially self-contained within a single directory or file, unmingled with any other projects’ code or resources. It also makes it possible to have multiple versions of a project simultaneously installed, such that individual programs can select the versions they wish to use.

The .egg-info format, on the other hand, was created to support backward-compatibility, performance, and ease of installation for system packaging tools that expect to install all projects’ code and resources to a single directory (e.g. site-packages). Placing the metadata in that same directory simplifies the installation process, since it isn’t necessary to create .pth files or otherwise modify sys.path to include each installed egg.

Its disadvantage, however, is that it provides no support for clean uninstallation or upgrades, and of course only a single version of a project can be installed to a given directory. Thus, support from a package management tool is required. (This is why setuptools’ “install” command refers to this type of egg installation as “single-version, externally managed”.) Also, they lack sufficient data to allow them to be copied from their installation source. easy_install can “ship” an application by copying .egg files or directories to a target location, but it cannot do this for .egg-info installs, because there is no way to tell what code and resources belong to a particular egg – there may be several eggs “scrambled” together in a single installation location, and the .egg-info format does not currently include a way to list the files that were installed. (This may change in a future version.)

Code and Resources

The layout of the code and resources is dictated by Python’s normal import layout, relative to the egg’s “base location”.

For the .egg format, the base location is the .egg itself. That is, adding the .egg filename or directory name to sys.path makes its contents importable.

For the .egg-info format, however, the base location is the directory that contains the .egg-info, and thus it is the directory that must be added to sys.path to make the egg importable. (Note that this means that the “normal” installation of a package to a sys.path directory is sufficient to make it an “egg” if it has an .egg-info file or directory installed alongside of it.)

Project Metadata

If eggs contained only code and resources, there would of course be no difference between them and any other directory or zip file on sys.path. Thus, metadata must also be included, using a metadata file or directory.

For the .egg format, the metadata is placed in an EGG-INFO subdirectory, directly within the .egg file or directory. For the .egg-info format, metadata is stored directly within the .egg-info directory itself.

The minimum project metadata that all eggs must have is a standard Python PKG-INFO file, named PKG-INFO and placed within the metadata directory appropriate to the format. Because it’s possible for this to be the only metadata file included, .egg-info format eggs are not required to be a directory; they can just be a .egg-info file that directly contains the PKG-INFO metadata. This eliminates the need to create a directory just to store one file. This option is not available for .egg formats, since setuptools always includes other metadata. (In fact, setuptools itself never generates .egg-info files, either; the support for using files was added so that the requirement could easily be satisfied by other tools, such as distutils).

In addition to the PKG-INFO file, an egg’s metadata directory may also include files and directories representing various forms of optional standard metadata (see the section on Standard Metadata, below) or user-defined metadata required by the project. For example, some projects may define a metadata format to describe their application plugins, and metadata in this format would then be included by plugin creators in their projects’ metadata directories.

Filename-Embedded Metadata

To allow introspection of installed projects and runtime resolution of inter-project dependencies, a certain amount of information is embedded in egg filenames. At a minimum, this includes the project name, and ideally will also include the project version number. Optionally, it can also include the target Python version and required runtime platform if platform-specific C code is included. The syntax of an egg filename is as follows:

name ["-" version ["-py" pyver ["-" required_platform]]] "." ext

The “name” and “version” should be escaped using pkg_resources functions safe_name() and safe_version() respectively then using to_filename(). Note that the escaping is irreversible and the original name can only be retrieved from the distribution metadata. For a detailed description of these transformations, please see the “Parsing Utilities” section of the pkg_resources manual.

The “pyver” string is the Python major version, as found in the first 3 characters of sys.version. “required_platform” is essentially a distutils get_platform() string, but with enhancements to properly distinguish Mac OS versions. (See the get_build_platform() documentation in the “Platform Utilities” section of the pkg_resources manual for more details.)

Finally, the “ext” is either .egg or .egg-info, as appropriate for the egg’s format.

Normally, an egg’s filename should include at least the project name and version, as this allows the runtime system to find desired project versions without having to read the egg’s PKG-INFO to determine its version number.

Setuptools, however, only includes the version number in the filename when an .egg file is built using the bdist_egg command, or when an .egg-info directory is being installed by the install_egg_info command. When generating metadata for use with the original source tree, it only includes the project name, so that the directory will not have to be renamed each time the project’s version changes.

This is especially important when version numbers change frequently, and the source metadata directory is kept under version control with the rest of the project. (As would be the case when the project’s source includes project-defined metadata that is not generated from by setuptools from data in the setup script.)

Standard Metadata

In addition to the minimum required PKG-INFO metadata, projects can include a variety of standard metadata files or directories, as described below. Except as otherwise noted, these files and directories are automatically generated by setuptools, based on information supplied in the setup script or through analysis of the project’s code and resources.

Most of these files and directories are generated via “egg-info writers” during execution of the setuptools egg_info command, and are listed in the egg_info.writers entry point group defined by setuptools’ own setup.py file.

Project authors can register their own metadata writers as entry points in this group (as described in the setuptools manual under “Adding new EGG-INFO Files”) to cause setuptools to generate project-specific metadata files or directories during execution of the egg_info command. It is up to project authors to document these new metadata formats, if they create any.

.txt File Formats

Files described in this section that have .txt extensions have a simple lexical format consisting of a sequence of text lines, each line terminated by a linefeed character (regardless of platform). Leading and trailing whitespace on each line is ignored, as are blank lines and lines whose first nonblank character is a # (comment symbol). (This is the parsing format defined by the yield_lines() function of the pkg_resources module.)

All .txt files defined by this section follow this format, but some are also “sectioned” files, meaning that their contents are divided into sections, using square-bracketed section headers akin to Windows .ini format. Note that this does not imply that the lines within the sections follow an .ini format, however. Please see an individual metadata file’s documentation for a description of what the lines and section names mean in that particular file.

Sectioned files can be parsed using the split_sections() function; see the “Parsing Utilities” section of the pkg_resources manual for for details.

Dependency Metadata

requires.txt

This is a “sectioned” text file. Each section is a sequence of “requirements”, as parsed by the parse_requirements() function; please see the pkg_resources manual for the complete requirement parsing syntax.

The first, unnamed section (i.e., before the first section header) in this file is the project’s core requirements, which must be installed for the project to function. (Specified using the install_requires keyword to setup()).

The remaining (named) sections describe the project’s “extra” requirements, as specified using the extras_require keyword to setup(). The section name is the name of the optional feature, and the section body lists that feature’s dependencies.

Note that it is not normally necessary to inspect this file directly; pkg_resources.Distribution objects have a requires() method that can be used to obtain Requirement objects describing the project’s core and optional dependencies.

setup_requires.txt

Much like requires.txt except represents the requirements specified by the setup_requires parameter to the Distribution.

depends.txt – Obsolete, do not create!

This file follows an identical format to requires.txt, but is obsolete and should not be used. The earliest versions of setuptools required users to manually create and maintain this file, so the runtime still supports reading it, if it exists. The new filename was created so that it could be automatically generated from setup() information without overwriting an existing hand-created depends.txt, if one was already present in the project’s source .egg-info directory.

namespace_packages.txt – Namespace Package Metadata

A list of namespace package names, one per line, as supplied to the namespace_packages keyword to setup(). Please see the manuals for setuptools and pkg_resources for more information about namespace packages.

entry_points.txt – “Entry Point”/Plugin Metadata

This is a “sectioned” text file, whose contents encode the entry_points keyword supplied to setup(). All sections are named, as the section names specify the entry point groups in which the corresponding section’s entry points are registered.

Each section is a sequence of “entry point” lines, each parseable using the EntryPoint.parse classmethod; please see the pkg_resources manual for the complete entry point parsing syntax.

Note that it is not necessary to parse this file directly; the pkg_resources module provides a variety of APIs to locate and load entry points automatically. Please see the setuptools and pkg_resources manuals for details on the nature and uses of entry points.

The scripts Subdirectory

This directory is currently only created for .egg files built by the setuptools bdist_egg command. It will contain copies of all of the project’s “traditional” scripts (i.e., those specified using the scripts keyword to setup()). This is so that they can be reconstituted when an .egg file is installed.

The scripts are placed here using the distutils’ standard install_scripts command, so any #! lines reflect the Python installation where the egg was built. But instead of copying the scripts to the local script installation directory, EasyInstall writes short wrapper scripts that invoke the original scripts from inside the egg, after ensuring that sys.path includes the egg and any eggs it depends on. For more about script wrappers, see the section below on Installation and Path Management Issues.

Zip Support Metadata

native_libs.txt

A list of C extensions and other dynamic link libraries contained in the egg, one per line. Paths are /-separated and relative to the egg’s base location.

This file is generated as part of bdist_egg processing, and as such only appears in .egg files (and .egg directories created by unpacking them). It is used to ensure that all libraries are extracted from a zipped egg at the same time, in case there is any direct linkage between them. Please see the Zip File Issues section below for more information on library and resource extraction from .egg files.

eager_resources.txt

A list of resource files and/or directories, one per line, as specified via the eager_resources keyword to setup(). Paths are /-separated and relative to the egg’s base location.

Resource files or directories listed here will be extracted simultaneously, if any of the named resources are extracted, or if any native libraries listed in native_libs.txt are extracted. Please see the setuptools manual for details on what this feature is used for and how it works, as well as the Zip File Issues section below.

zip-safe and not-zip-safe

These are zero-length files, and either one or the other should exist. If zip-safe exists, it means that the project will work properly when installed as an .egg zipfile, and conversely the existence of not-zip-safe means the project should not be installed as an .egg file. The zip_safe option to setuptools’ setup() determines which file will be written. If the option isn’t provided, setuptools attempts to make its own assessment of whether the package can work, based on code and content analysis.

If neither file is present at installation time, EasyInstall defaults to assuming that the project should be unzipped. (Command-line options to EasyInstall, however, take precedence even over an existing zip-safe or not-zip-safe file.)

Note that these flag files appear only in .egg files generated by bdist_egg, and in .egg directories created by unpacking such an .egg file.

top_level.txt – Conflict Management Metadata

This file is a list of the top-level module or package names provided by the project, one Python identifier per line.

Subpackages are not included; a project containing both a foo.bar and a foo.baz would include only one line, foo, in its top_level.txt.

This data is used by pkg_resources at runtime to issue a warning if an egg is added to sys.path when its contained packages may have already been imported.

(It was also once used to detect conflicts with non-egg packages at installation time, but in more recent versions, setuptools installs eggs in such a way that they always override non-egg packages, thus preventing a problem from arising.)

SOURCES.txt – Source Files Manifest

This file is roughly equivalent to the distutils’ MANIFEST file. The differences are as follows:

  • The filenames always use / as a path separator, which must be converted back to a platform-specific path whenever they are read.

  • The file is automatically generated by setuptools whenever the egg_info or sdist commands are run, and it is not user-editable.

Although this metadata is included with distributed eggs, it is not actually used at runtime for any purpose. Its function is to ensure that setuptools-built source distributions can correctly discover what files are part of the project’s source, even if the list had been generated using revision control metadata on the original author’s system.

In other words, SOURCES.txt has little or no runtime value for being included in distributed eggs, and it is possible that future versions of the bdist_egg and install_egg_info commands will strip it before installation or distribution. Therefore, do not rely on its being available outside of an original source directory or source distribution.

Other Technical Considerations

Zip File Issues

Although zip files resemble directories, they are not fully substitutable for them. Most platforms do not support loading dynamic link libraries contained in zipfiles, so it is not possible to directly import C extensions from .egg zipfiles. Similarly, there are many existing libraries – whether in Python or C – that require actual operating system filenames, and do not work with arbitrary “file-like” objects or in-memory strings, and thus cannot operate directly on the contents of zip files.

To address these issues, the pkg_resources module provides a “resource API” to support obtaining either the contents of a resource, or a true operating system filename for the resource. If the egg containing the resource is a directory, the resource’s real filename is simply returned. However, if the egg is a zipfile, then the resource is first extracted to a cache directory, and the filename within the cache is returned.

The cache directory is determined by the pkg_resources API; please see the set_cache_path() and get_default_cache() documentation for details.

The Extraction Process

Resources are extracted to a cache subdirectory whose name is based on the enclosing .egg filename and the path to the resource. If there is already a file of the correct name, size, and timestamp, its filename is returned to the requester. Otherwise, the desired file is extracted first to a temporary name generated using mkstemp(".$extract",target_dir), and then its timestamp is set to match the one in the zip file, before renaming it to its final name. (Some collision detection and resolution code is used to handle the fact that Windows doesn’t overwrite files when renaming.)

If a resource directory is requested, all of its contents are recursively extracted in this fashion, to ensure that the directory name can be used as if it were valid all along.

If the resource requested for extraction is listed in the native_libs.txt or eager_resources.txt metadata files, then all resources listed in either file will be extracted before the requested resource’s filename is returned, thus ensuring that all C extensions and data used by them will be simultaneously available.

Extension Import Wrappers

Since Python’s built-in zip import feature does not support loading C extension modules from zipfiles, the setuptools bdist_egg command generates special import wrappers to make it work.

The wrappers are .py files (along with corresponding .pyc and/or .pyo files) that have the same module name as the corresponding C extension. These wrappers are located in the same package directory (or top-level directory) within the zipfile, so that say, foomodule.so will get a corresponding foo.py, while bar/baz.pyd will get a corresponding bar/baz.py.

These wrapper files contain a short stanza of Python code that asks pkg_resources for the filename of the corresponding C extension, then reloads the module using the obtained filename. This will cause pkg_resources to first ensure that all of the egg’s C extensions (and any accompanying “eager resources”) are extracted to the cache before attempting to link to the C library.

Note, by the way, that .egg directories will also contain these wrapper files. However, Python’s default import priority is such that C extensions take precedence over same-named Python modules, so the import wrappers are ignored unless the egg is a zipfile.

Installation and Path Management Issues

Python’s initial setup of sys.path is very dependent on the Python version and installation platform, as well as how Python was started (i.e., script vs. -c vs. -m vs. interactive interpreter). In fact, Python also provides only two relatively robust ways to affect sys.path outside of direct manipulation in code: the PYTHONPATH environment variable, and .pth files.

However, with no cross-platform way to safely and persistently change environment variables, this leaves .pth files as EasyInstall’s only real option for persistent configuration of sys.path.

But .pth files are rather strictly limited in what they are allowed to do normally. They add directories only to the end of sys.path, after any locally-installed site-packages directory, and they are only processed in the site-packages directory to start with.

This is a double whammy for users who lack write access to that directory, because they can’t create a .pth file that Python will read, and even if a sympathetic system administrator adds one for them that calls site.addsitedir() to allow some other directory to contain .pth files, they won’t be able to install newer versions of anything that’s installed in the systemwide site-packages, because their paths will still be added after site-packages.

So EasyInstall applies two workarounds to solve these problems.

The first is that EasyInstall leverages .pth files’ “import” feature to manipulate sys.path and ensure that anything EasyInstall adds to a .pth file will always appear before both the standard library and the local site-packages directories. Thus, it is always possible for a user who can write a Python-read .pth file to ensure that their packages come first in their own environment.

Second, when installing to a PYTHONPATH directory (as opposed to a “site” directory like site-packages) EasyInstall will also install a special version of the site module. Because it’s in a PYTHONPATH directory, this module will get control before the standard library version of site does. It will record the state of sys.path before invoking the “real” site module, and then afterwards it processes any .pth files found in PYTHONPATH directories, including all the fixups needed to ensure that eggs always appear before the standard library in sys.path, but are in a relative order to one another that is defined by their PYTHONPATH and .pth-prescribed sequence.

The net result of these changes is that sys.path order will be as follows at runtime:

  1. The sys.argv[0] directory, or an empty string if no script is being executed.

  2. All eggs installed by EasyInstall in any .pth file in each PYTHONPATH directory, in order first by PYTHONPATH order, then normal .pth processing order (which is to say alphabetical by .pth filename, then by the order of listing within each .pth file).

  3. All eggs installed by EasyInstall in any .pth file in each “site” directory (such as site-packages), following the same ordering rules as for the ones on PYTHONPATH.

  4. The PYTHONPATH directories themselves, in their original order

  5. Any paths from .pth files found on PYTHONPATH that were not eggs installed by EasyInstall, again following the same relative ordering rules.

  6. The standard library and “site” directories, along with the contents of any .pth files found in the “site” directories.

Notice that sections 1, 4, and 6 comprise the “normal” Python setup for sys.path. Sections 2 and 3 are inserted to support eggs, and section 5 emulates what the “normal” semantics of .pth files on PYTHONPATH would be if Python natively supported them.

For further discussion of the tradeoffs that went into this design, as well as notes on the actual magic inserted into .pth files to make them do these things, please see also the following messages to the distutils-SIG mailing list:

Script Wrappers

EasyInstall never directly installs a project’s original scripts to a script installation directory. Instead, it writes short wrapper scripts that first ensure that the project’s dependencies are active on sys.path, before invoking the original script. These wrappers have a #! line that points to the version of Python that was used to install them, and their second line is always a comment that indicates the type of script wrapper, the project version required for the script to run, and information identifying the script to be invoked.

The format of this marker line is:

"# EASY-INSTALL-" script_type ": " tuple_of_strings "\n"

The script_type is one of SCRIPT, DEV-SCRIPT, or ENTRY-SCRIPT. The tuple_of_strings is a comma-separated sequence of Python string constants. For SCRIPT and DEV-SCRIPT wrappers, there are two strings: the project version requirement, and the script name (as a filename within the scripts metadata directory). For ENTRY-SCRIPT wrappers, there are three: the project version requirement, the entry point group name, and the entry point name. (See the “Automatic Script Creation” section in the setuptools manual for more information about entry point scripts.)

In each case, the project version requirement string will be a string parseable with the pkg_resources modules’ Requirement.parse() classmethod. The only difference between a SCRIPT wrapper and a DEV-SCRIPT is that a DEV-SCRIPT actually executes the original source script in the project’s source tree, and is created when the “setup.py develop” command is run. A SCRIPT wrapper, on the other hand, uses the “installed” script written to the EGG-INFO/scripts subdirectory of the corresponding .egg zipfile or directory. (.egg-info eggs do not have script wrappers associated with them, except in the “setup.py develop” case.)

The purpose of including the marker line in generated script wrappers is to facilitate introspection of installed scripts, and their relationship to installed eggs. For example, an uninstallation tool could use this data to identify what scripts can safely be removed, and/or identify what scripts would stop working if a particular egg is uninstalled.