containertree

Why is tagging of container trees useful? If we store information at each node about the containers that have the packages, we can parse the tree to extract data or calculate similarity. Let’s start with showing how to export package data. If you use this feature, you will additionally need the pandas module installed. Here are a few ways to install pandas:

pip install pandas
pip install containertree[analysis]
apt-get install -y python3-pandas

We would likely want to do some kind of analysis over container packages, and the first step would be to extract a data frame. Here is how to do that. First, create your tree and add some containers to it.

from containertree.tree import ContainerAptTree

# Initilize a tree with packages from a container, add others
apt = ContainerAptTree('singularityhub/sregistry-cli', tag='singularityhub/sregistry-cli')
apt.update('library/debian', tag='library/debian')
apt.update('library/ubuntu', tag='library/ubuntu')

Here is how to export a pandas data frame with the packages:

df = apt.export_vectors()
df.head()
                              adduser  apt  autoconf  automake  autotools-dev  \
singularityhub/sregistry-cli      1.0  1.0       1.0       1.0            1.0   
library/ubuntu                    1.0  1.0       NaN       NaN            NaN   
library/debian                    1.0  1.0       NaN       NaN            NaN  

The rows represent the containers, and the columns the packages. A value of NaN indicates the package isn’t installed in the container, and 1.0 indicates that it is. Here is how to fill in 0 for the NaN values, if you prefer.

df = df.fillna(0)

You can optionally subset to a particular set of tags, either including only a specific set:

df = apt.export_vectors(include_tags=['library/debian'])
df.head()
                adduser  apt  base-files  base-passwd  bash  bsdutils  \
library/debian      1.0  1.0         1.0          1.0   1.0       1.0   

or skipping specific containers:

df = apt.export_vectors(skip_tags=['library/debian'])
df.head()
                              adduser  apt  autoconf  automake  autotools-dev  \
singularityhub/sregistry-cli      1.0  1.0       1.0       1.0            1.0   
library/ubuntu                    1.0  1.0       NaN       NaN            NaN 

Or using a regular expression to filter the tags (useful for collection names, such as finding all containers in the “library” namespace):

df = apt.export_vectors(regexp_tags="^library")
df.head()
                adduser  apt  base-files  base-passwd  bash  bsdutils  bzip2  \
library/ubuntu      1.0  1.0         1.0          1.0   1.0       1.0    1.0   
library/debian      1.0  1.0         1.0          1.0   1.0       1.0    NaN 

If you want more detail for your features, you can specify to include package versions:

df = apt.export_vectors(include_versions=True)
df.head()
                             adduser-v3.115  adduser-v3.116ubuntu1  \
library/debian                           1.0                    NaN   
singularityhub/sregistry-cli             1.0                    NaN   
library/ubuntu                           NaN                    1.0   

The same can be done for Pip (python) Package trees:

from containertree import ContainerPipTree
pip = ContainerPipTree('singularityhub/container-tree', tag='singularityhub/container-tree')

pip.export_vectors()
                               configobj  mercurial  pip  setuptools  six  \
singularityhub/container-tree        1.0        1.0  1.0         1.0  1.0   

                               wheel  
singularityhub/container-tree    1.0

And with versions!

pip.export_vectors(include_versions=True)
                               configobj-v5.0.6  mercurial-v4.0  pip-v18.1  \
singularityhub/container-tree               1.0             1.0        1.0   

                               setuptools-v40.6.3  six-v1.10.0  wheel-v0.32.3  
singularityhub/container-tree                 1.0          1.0            1.0