Add import/ tools

This is a generic tool which allows using metadata from foreign packaging systems to create morphologies. So far it supports RubyGems, but it should be extendable to other packaging systems. It should be considered 'beta' quality right now.
author: Sam Thursfield <sam.thursfield@codethink.co.uk> 2014-07-24 16:32:21 +0100
committer: Sam Thursfield <sam.thursfield@codethink.co.uk> 2014-09-11 18:42:57 +0100
commit: 0ee1b57965ceeb94a47fcb6898acb96ce91e3cf9 (patch)
tree: be636f287377873231ad40e4cc012937358f096a /import
parent: ccb5a47915da9f0d5ab25d33d6d21409e7f898d0 (diff)
download: morph-0ee1b57965ceeb94a47fcb6898acb96ce91e3cf9.tar.gz
7 files changed, 1371 insertions, 0 deletions
diff --git a/import/README b/import/README
new file mode 100644
index 00000000..3ac7997d
--- /dev/null
+++ b/import/README
@@ -0,0 +1,100 @@
+How to use the Baserock Import Tool
+===================================
+
+The tool helps you generate Baserock build instructions by importing metadata
+from a foreign packaging system.
+
+The process it follows is this:
+
+1. Pick a package from the processing queue.
+2. Find its source code, and generate a suitable .lorry file.
+3. Make it available as a local Git repo.
+4. Check out the commit corresponding to the requested version of the package.
+5. Analyse the source tree and generate a suitable chunk .morph to build the
+   requested package.
+6. Analyse the source tree and generate a list of dependencies for the package.
+7. Enqueue any new dependencies, and repeat.
+
+Once the queue is empty:
+
+8. Generate a stratum .morph for the package(s) the user requested.
+
+The tool is not magic. It can be taught the conventions for each packaging
+system, but these will not work in all cases. When an import fails it will
+continue to the next package, so that the first run does as many imports as
+possible.
+
+For imports that could not be done automatically, you will need to write an
+appropriate .lorry or .morph file manually and rerun the tool. It will resume
+processing where it left off.
+
+It's possible to teach the code about more conventions, but it is only
+worthwhile to do that for common patterns.
+
+
+Package-system specific code and data
+-------------------------------------
+
+For each supported package system, there should be an xxx.to_lorry program, and
+a xxx.to_chunk program. These should output on stdout a .lorry file and a .morph
+file, respectively.
+
+Each packaging system can have static data saved in a .yaml file, for known
+metadata that the programs cannot discover automatically.
+
+The following field should be honoured by most packaging systems:
+`known-source-uris`. It maps package name to source URI.
+
+
+Help with .lorry generation
+---------------------------
+
+The simplest fix is to add the source to the 'known-source-uris` dict in the
+static metadata.
+
+If you write a .lorry file by hand, be sure to fill in the `x-products-YYY`
+field. 'x' means this field is an extension to the .lorry format. YYY is the
+name of the packaging system, e.g. 'rubygems'. It should contain a list of
+which packages this repository contains the source code for.
+
+
+Help with linking package version to Git tag
+--------------------------------------------
+
+Some projects do not tag releases.
+
+Currently, you must create a tag in the local checkout for the tool to continue.
+In future, the Lorry tool should be extended to handle creation of missing
+tags, so that they are propagated to the project Trove. The .lorry file would
+need to contain a dict mapping product version number to commit SHA1.
+
+If you are in a hurry, you can use the `--use-master-if-no-tag` option. Instead
+of an error, the tool will use whatever is the `master` ref of the component
+repo.
+
+
+Help with chunk .morph generation
+---------------------------------
+
+If you create a chunk morph by hand, you must add some extra fields:
+
+  - `x-build-dependencies-YYY`
+  - `x-runtime-dependencies-YYY`
+
+These are a dict mapping dependency name to dependency version. For example:
+
+    x-build-dependencies-rubygems: {}
+    x-runtime-dependencies-rubygems:
+        hashie: 2.1.2
+        json: 1.8.1
+        mixlib-log: 1.6.0
+        rack: 1.5.2
+
+All dependencies will be included in the resulting stratum. Those which are build
+dependencies of other components will be added to the relevant 'build-depends'
+field.
+
+These fields are non-standard extensions to the morphology format.
+
+For more package-system specific information, see the relevant README file, e.g
+README.rubygems for RubyGem imports.
diff --git a/import/README.rubygems b/import/README.rubygems
new file mode 100644
index 00000000..4b3b7721
--- /dev/null
+++ b/import/README.rubygems
@@ -0,0 +1,36 @@
+Here is some information I have learned while importing RubyGem packages into
+Baserock.
+
+First, beware that RubyGem .gemspec files are actually normal Ruby programs,
+and are executed when loaded. A Bundler Gemfile is also a Ruby program, and
+could run arbitrary code when loaded.
+
+The Standard Case
+-----------------
+
+Most Ruby projects provide one or more .gemspec files, which describe the
+runtime and development dependencies of the Gem.
+
+Using the .gemspec file and the `gem build` command it is possible to create
+the .gem file. It can then be installed with `gem install`.
+
+Note that use of `gem build` is discouraged by its own help file in favour
+of using Rake, but there is much less standardisation among Rakefiles and they
+may introduce requirements on Hoe, rake-compiler, Jeweler or other tools.
+
+The 'development' dependencies includes everything useful to test, document,
+and create a Gem of the project. All we want to do is create a Gem, which I'll
+refer to as 'building'.
+
+
+Gem with no .gemspec
+--------------------
+
+Some Gems choose not to include a .gemspec, like [Nokigori]. In the case of
+Nokigori, and others, [Hoe] is used, which adds Rake tasks that create the Gem.
+The `gem build` command cannot not be used in these cases.
+
+You may be able to use the `rake gem` command instead of `gem build`.
+
+[Nokigori]: https://github.com/sparklemotion/nokogiri/blob/master/Y_U_NO_GEMSPEC.md
+[Hoe]: http://www.zenspider.com/projects/hoe.html
diff --git a/import/importer_base.py b/import/importer_base.py
new file mode 100644
index 00000000..7775aa4a
--- /dev/null
+++ b/import/importer_base.py
@@ -0,0 +1,72 @@
+# Base class for import tools written in Python.
+#
+# Copyright (C) 2014  Codethink Limited
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; version 2 of the License.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License along
+# with this program; if not, write to the Free Software Foundation, Inc.,
+# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+
+
+import logging
+import os
+import sys
+
+
+class ImportException(Exception):
+    pass
+
+
+class ImportExtension(object):
+    '''A base class for import extensions.
+
+    A subclass should subclass this class, and add a ``process_args`` method.
+
+    Note that it is not necessary to subclass this class for import extensions.
+    This class is here just to collect common code.
+
+    '''
+
+    def __init__(self):
+        self.setup_logging()
+
+    def setup_logging(self):
+        '''Direct all logging output to MORPH_LOG_FD, if set.
+
+        This file descriptor is read by Morph and written into its own log
+        file.
+
+        This overrides cliapp's usual configurable logging setup.
+
+        '''
+        log_write_fd = int(os.environ.get('MORPH_LOG_FD', 0))
+
+        if log_write_fd == 0:
+            return
+
+        formatter = logging.Formatter('%(message)s')
+
+        handler = logging.StreamHandler(os.fdopen(log_write_fd, 'w'))
+        handler.setFormatter(formatter)
+
+        logger = logging.getLogger()
+        logger.addHandler(handler)
+        logger.setLevel(logging.DEBUG)
+
+    def process_args(self, args):
+        raise NotImplementedError()
+
+    def run(self):
+        try:
+            self.process_args(sys.argv[1:])
+        except ImportException as e:
+            sys.stderr.write('ERROR: %s' % e.message)
+            sys.exit(1)
diff --git a/import/main.py b/import/main.py
new file mode 100644
index 00000000..8d3194af
--- /dev/null
+++ b/import/main.py
@@ -0,0 +1,691 @@
+#!/usr/bin/python
+# Import foreign packaging systems into Baserock
+#
+# Copyright (C) 2014  Codethink Limited
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; version 2 of the License.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License along
+# with this program; if not, write to the Free Software Foundation, Inc.,
+# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+
+
+import cliapp
+import morphlib
+import networkx
+
+import contextlib
+import copy
+import json
+import logging
+import os
+import sys
+import time
+
+from logging import debug
+
+
+@contextlib.contextmanager
+def cwd(path):
+    old_cwd = os.getcwd()
+    try:
+        os.chdir(path)
+        yield
+    finally:
+        os.chdir(old_cwd)
+
+
+class LorrySet(object):
+    '''Manages a set of .lorry files.
+
+    The structure of .lorry files makes the code a little more confusing than
+    I would like. A lorry "entry" is a dict of one entry mapping name to info.
+    A lorry "file" is a dict of one or more of these entries merged together.
+    If it were a list of entries with 'name' fields, the code would be neater.
+
+    '''
+    def __init__(self, lorries_path):
+        self.path = lorries_path
+
+        if os.path.exists(lorries_path):
+            self.data = self.parse_all_lorries()
+        else:
+            os.makedirs(lorries_path)
+            self.data = {}
+
+    def all_lorry_files(self):
+        for dirpath, dirnames, filenames in os.walk(self.path):
+            for filename in filenames:
+                if filename.endswith('.lorry'):
+                    yield os.path.join(dirpath, filename)
+
+    def parse_all_lorries(self):
+        lorry_set = {}
+        for lorry_file in self.all_lorry_files():
+            lorry = self.parse_lorry(lorry_file)
+
+            lorry_items = lorry.items()
+
+            for key, value in lorry_items:
+                if key in lorry_set:
+                    raise Exception(
+                        '%s: duplicates existing lorry %s' % (lorry_file, key))
+
+            lorry_set.update(lorry_items)
+
+        return lorry_set
+
+    def parse_lorry(self, lorry_file):
+        try:
+            with open(lorry_file, 'r') as f:
+                lorry = json.load(f)
+            return lorry
+        except ValueError as e:
+            raise cliapp.AppException(
+                "Error parsing %s: %s" % (lorry_file, e))
+
+    def get_lorry(self, name):
+        return {name: self.data[name]}
+
+    def find_lorry_for_package(self, kind, package_name):
+        key = 'x-products-%s' % kind
+        for name, lorry in self.data.iteritems():
+            products = lorry.get(key, [])
+            for entry in products:
+                if entry == package_name:
+                    return {name: lorry}
+
+        return None
+
+    def _check_for_conflicts_in_standard_fields(self, existing, new):
+        '''Ensure that two lorries for the same project do actually match.'''
+        for field, value in existing.iteritems():
+            if field.startswith('x-'):
+                continue
+            if field == 'url':
+                # FIXME: need a much better way of detecting whether the URLs
+                # are equivalent ... right now HTTP vs. HTTPS will cause an
+                # error, for example!
+                matches = (value.rstrip('/') == new[field].rstrip('/'))
+            else:
+                matches = (value == new[field])
+            if not matches:
+                raise Exception(
+                    'Lorry %s conflicts with existing entry %s at field %s' %
+                    (new, existing, field))
+
+    def _merge_products_fields(self, existing, new):
+        '''Merge the x-products- fields from new lorry into an existing one.'''
+        is_product_field = lambda x: x.startswith('x-products-')
+
+        existing_fields = [f for f in existing.iterkeys() if
+                           is_product_field(f)]
+        new_fields = [f for f in new.iterkeys() if f not in existing_fields and
+                      is_product_field(f)]
+
+        for field in existing_fields:
+            existing[field].extend(new[field])
+            existing[field] = list(set(existing[field]))
+
+        for field in new_fields:
+            existing[field] = new[field]
+
+    def add(self, filename, lorry_entry):
+        logging.debug('Adding %s to lorryset', filename)
+
+        filename = os.path.join(self.path, '%s.lorry' % filename)
+
+        assert len(lorry_entry) == 1
+
+        project_name = lorry_entry.keys()[0]
+        info = lorry_entry.values()[0]
+
+        if len(project_name) == 0:
+            raise cliapp.AppException(
+                'Invalid lorry %s: %s' % (filename, lorry_entry))
+
+        if project_name in self.data:
+            stored_lorry = self.get_lorry(project_name)
+
+            self._check_for_conflicts_in_standard_fields(
+                stored_lorry[project_name], lorry_entry[project_name])
+            self._merge_products_fields(
+                stored_lorry[project_name], lorry_entry[project_name])
+            lorry_entry = stored_lorry
+        else:
+            self.data[project_name] = info
+
+        with morphlib.savefile.SaveFile(filename, 'w') as f:
+            json.dump(lorry_entry, f, indent=4)
+
+
+class MorphologySet(morphlib.morphset.MorphologySet):
+    def __init__(self, path):
+        super(MorphologySet, self).__init__()
+
+        self.path = path
+        self.loader = morphlib.morphloader.MorphologyLoader()
+
+        if os.path.exists(path):
+            self.load_all_morphologies()
+        else:
+            os.makedirs(path)
+
+    def load_all_morphologies(self):
+        logging.info('Loading all .morph files under %s', self.path)
+
+        class FakeGitDir(morphlib.gitdir.GitDirectory):
+            '''Ugh
+
+            This is here because the default constructor will search up the
+            directory heirarchy until it finds a '.git' directory, but that
+            may be totally the wrong place for our purpose: we don't have a
+            Git directory at all.
+
+            '''
+            def __init__(self, path):
+                self.dirname = path
+                self._config = {}
+
+        gitdir = FakeGitDir(self.path)
+        finder = morphlib.morphologyfinder.MorphologyFinder(gitdir)
+        for filename in (f for f in finder.list_morphologies()
+                         if not gitdir.is_symlink(f)):
+            text = finder.read_morphology(filename)
+            morph = self.loader.load_from_string(text, filename=filename)
+            morph.repo_url = None  # self.root_repository_url
+            morph.ref = None  # self.system_branch_name
+            self.add_morphology(morph)
+
+    def get_morphology(self, repo_url, ref, filename):
+        return self._get_morphology(repo_url, ref, filename)
+
+    def save_morphology(self, filename, morphology):
+        self.add_morphology(morphology)
+        morphology_to_save = copy.copy(morphology)
+        self.loader.unset_defaults(morphology_to_save)
+        filename = os.path.join(self.path, filename)
+        self.loader.save_to_file(filename, morphology_to_save)
+
+
+class GitDirectory(morphlib.gitdir.GitDirectory):
+    def has_ref(self, ref):
+        try:
+            self._rev_parse(ref)
+            return True
+        except morphlib.gitdir.InvalidRefError:
+            return False
+
+
+class BaserockImportException(cliapp.AppException):
+    pass
+
+
+class Package(object):
+    '''A package in the processing queue.
+
+    In order to provide helpful errors, this item keeps track of what
+    packages depend on it, and hence of why it was added to the queue.
+
+    '''
+    def __init__(self, name, version):
+        self.name = name
+        self.version = version
+        self.required_by = []
+        self.morphology = None
+        self.is_build_dep = False
+        self.version_in_use = version
+
+    def __cmp__(self, other):
+        return cmp(self.name, other.name)
+
+    def __repr__(self):
+        return '<Package %s-%s>' % (self.name, self.version)
+
+    def __str__(self):
+        if len(self.required_by) > 0:
+            required_msg = ', '.join(self.required_by)
+            required_msg = ', required by: ' + required_msg
+        else:
+            required_msg = ''
+        return '%s-%s%s' % (self.name, self.version, required_msg)
+
+    def add_required_by(self, item):
+        self.required_by.append('%s-%s' % (item.name, item.version))
+
+    def match(self, name, version):
+        return (self.name==name and self.version==version)
+
+    def set_morphology(self, morphology):
+        self.morphology = morphology
+
+    def set_is_build_dep(self, is_build_dep):
+        self.is_build_dep = is_build_dep
+
+    def set_version_in_use(self, version_in_use):
+        self.version_in_use = version_in_use
+
+
+def find(iterable, match):
+    return next((x for x in iterable if match(x)), None)
+
+
+def run_extension(filename, args):
+    output = []
+    errors = []
+
+    ext_logger = logging.getLogger(filename)
+
+    def report_extension_stdout(line):
+        output.append(line)
+
+    def report_extension_stderr(line):
+        errors.append(line)
+        sys.stderr.write('%s\n' % line)
+
+    def report_extension_logger(line):
+        ext_logger.debug(line)
+
+    ext = morphlib.extensions.ExtensionSubprocess(
+        report_stdout=report_extension_stdout,
+        report_stderr=report_extension_stderr,
+        report_logger=report_extension_logger,
+    )
+
+    returncode = ext.run(os.path.abspath(filename), args, '.', os.environ)
+
+    if returncode == 0:
+        ext_logger.info('succeeded')
+    else:
+        for line in errors:
+            ext_logger.error(line)
+        message = '%s failed with code %s: %s' % (
+            filename, returncode, '\n'.join(errors))
+        raise BaserockImportException(message)
+
+    return '\n'.join(output)
+
+
+class BaserockImportApplication(cliapp.Application):
+    def add_settings(self):
+        self.settings.string(['lorries-dir'],
+                             "location for Lorry files",
+                             metavar="PATH",
+                             default=os.path.abspath('./lorries'))
+        self.settings.string(['definitions-dir'],
+                             "location for morphology files",
+                             metavar="PATH",
+                             default=os.path.abspath('./definitions'))
+        self.settings.string(['checkouts-dir'],
+                             "location for Git checkouts",
+                             metavar="PATH",
+                             default=os.path.abspath('./checkouts'))
+
+        self.settings.boolean(['update-existing'],
+                              "update all the checked-out Git trees and "
+                              "generated definitions",
+                              default=False)
+        self.settings.boolean(['use-master-if-no-tag'],
+                              "if the correct tag for a version can't be "
+                              "found, use 'master' instead of raising an "
+                              "error",
+                              default=False)
+
+    def setup(self):
+        self.add_subcommand('rubygems', self.import_rubygems,
+                            arg_synopsis='GEM_NAME')
+
+    def setup_logging_formatter_for_file(self):
+        root_logger = logging.getLogger()
+        root_logger.name = 'main'
+
+        # You need recent cliapp for this to work, with commit "Split logging
+        # setup into further overrideable methods".
+        return logging.Formatter("%(name)s: %(levelname)s: %(message)s")
+
+    def process_args(self, args):
+        if len(args) == 0:
+            # Cliapp default is to just say "ERROR: must give subcommand" if
+            # no args are passed, I prefer this.
+            args = ['help']
+
+        super(BaserockImportApplication, self).process_args(args)
+
+    def status(self, msg, *args):
+        print msg % args
+        logging.info(msg % args)
+
+    def import_rubygems(self, args):
+        '''Import one or more RubyGems.'''
+        if len(args) != 1:
+            raise cliapp.AppException(
+                'Please pass the name of a RubyGem on the commandline.')
+
+        self.import_package_and_all_dependencies('rubygems', args[0])
+
+    def process_dependency_list(self, current_item, deps, to_process,
+                                processed, these_are_build_deps):
+        # All deps are added as nodes to the 'processed' graph. Runtime
+        # dependencies only need to appear in the stratum, but build
+        # dependencies have ordering constraints, so we add edges in
+        # the graph for build-dependencies too.
+
+        for dep_name, dep_version in deps.iteritems():
+            dep_package = find(
+                processed, lambda i: i.match(dep_name, dep_version))
+
+            if dep_package is None:
+                # Not yet processed
+                queue_item = find(
+                    to_process, lambda i: i.match(dep_name, dep_version))
+                if queue_item is None:
+                    queue_item = Package(dep_name, dep_version)
+                    to_process.append(queue_item)
+                dep_package = queue_item
+
+            dep_package.add_required_by(current_item)
+
+            if these_are_build_deps or current_item.is_build_dep:
+                # A runtime dep of a build dep becomes a build dep
+                # itself.
+                dep_package.set_is_build_dep(True)
+                processed.add_edge(dep_package, current_item)
+
+    def get_dependencies_from_morphology(self, morphology, field_name):
+        # We need to validate this field because it doesn't go through the
+        # normal MorphologyFactory validation, being an extension.
+        value = morphology.get(field_name, {})
+        if not hasattr(value, 'iteritems'):
+            value_type = type(value).__name__
+            raise cliapp.AppException(
+                "Morphology for %s has invalid '%s': should be a dict, but "
+                "got a %s." % (morphology['name'], field_name, value_type))
+
+        return value
+
+    def import_package_and_all_dependencies(self, kind, goal_name,
+                                            goal_version='master'):
+        start_time = time.time()
+        start_displaytime = time.strftime('%x %X %Z', time.localtime())
+
+        self.status('%s: Import of %s %s started', start_displaytime, kind,
+                    goal_name)
+
+        if not self.settings['update-existing']:
+            self.status('Not updating existing Git checkouts or existing definitions')
+
+        lorry_set = LorrySet(self.settings['lorries-dir'])
+        morph_set = MorphologySet(self.settings['definitions-dir'])
+
+        chunk_dir = os.path.join(morph_set.path, 'strata', goal_name)
+        if not os.path.exists(chunk_dir):
+            os.makedirs(chunk_dir)
+
+        to_process = [Package(goal_name, goal_version)]
+        processed = networkx.DiGraph()
+
+        errors = {}
+
+        while len(to_process) > 0:
+            current_item = to_process.pop()
+            name = current_item.name
+            version = current_item.version
+
+            try:
+                lorry = self.find_or_create_lorry_file(lorry_set, kind, name)
+
+                source_repo, url = self.fetch_or_update_source(lorry)
+
+                checked_out_version, ref = self.checkout_source_version(
+                    source_repo, name, version)
+                current_item.set_version_in_use(checked_out_version)
+                chunk_morph = self.find_or_create_chunk_morph(
+                    morph_set, goal_name, kind, name, checked_out_version,
+                    source_repo, url, ref)
+
+                current_item.set_morphology(chunk_morph)
+
+                build_deps = self.get_dependencies_from_morphology(
+                    chunk_morph, 'x-build-dependencies-%s' % kind)
+                runtime_deps = self.get_dependencies_from_morphology(
+                    chunk_morph, 'x-runtime-dependencies-%s' % kind)
+            except BaserockImportException as e:
+                # Don't print the exception on stdout; the error messages will
+                # have gone to stderr already.
+                errors[current_item] = e
+                build_deps = runtime_deps = {}
+
+            processed.add_node(current_item)
+
+            self.process_dependency_list(
+                current_item, build_deps, to_process, processed, True)
+            self.process_dependency_list(
+                current_item, runtime_deps, to_process, processed, False)
+
+        if len(errors) > 0:
+            for package, exception in errors.iteritems():
+                self.status('\n%s: %s', package.name, exception)
+            self.status(
+                '\nErrors encountered, not generating a stratum morphology.')
+            self.status(
+                'See the README files for guidance.')
+        else:
+            self.generate_stratum_morph_if_none_exists(processed, goal_name)
+
+        duration = time.time() - start_time
+        end_displaytime = time.strftime('%x %X %Z', time.localtime())
+
+        self.status('%s: Import of %s %s ended (took %i seconds)',
+                    end_displaytime, kind, goal_name, duration)
+
+    def generate_lorry_for_package(self, kind, name):
+        tool = '%s.to_lorry' % kind
+        self.status('Calling %s to generate lorry for %s', tool, name)
+        lorry_text = run_extension(tool, [name])
+        lorry = json.loads(lorry_text)
+        return lorry
+
+    def find_or_create_lorry_file(self, lorry_set, kind, name):
+        # Note that the lorry file may already exist for 'name', but lorry
+        # files are named for project name rather than package name. In this
+        # case we will generate the lorry, and try to add it to the set, at
+        # which point LorrySet will notice the existing one and merge the two.
+        lorry = lorry_set.find_lorry_for_package(kind, name)
+
+        if lorry is None:
+            lorry = self.generate_lorry_for_package(kind, name)
+
+            if len(lorry) != 1:
+                raise Exception(
+                    'Expected generated lorry file with one entry.')
+
+            lorry_filename = lorry.keys()[0]
+
+            if lorry_filename == '':
+                raise cliapp.AppException(
+                    'Invalid lorry data for %s: %s' % (name, lorry))
+
+            lorry_set.add(lorry_filename, lorry)
+        else:
+            lorry_filename = lorry.keys()[0]
+            logging.info(
+                'Found existing lorry file for %s: %s', name, lorry_filename)
+
+        return lorry
+
+    def fetch_or_update_source(self, lorry):
+        assert len(lorry) == 1
+        lorry_entry = lorry.values()[0]
+
+        url = lorry_entry['url']
+        reponame = os.path.basename(url.rstrip('/'))
+        repopath = os.path.join(self.settings['checkouts-dir'], reponame)
+
+        # FIXME: we should use Lorry here, so that we can import other VCSes.
+        # But for now, this hack is fine!
+        if os.path.exists(repopath):
+            if self.settings['update-existing']:
+                self.status('Updating repo %s', url)
+                cliapp.runcmd(['git', 'remote', 'update', 'origin'],
+                              cwd=repopath)
+        else:
+            self.status('Cloning repo %s', url)
+            try:
+                cliapp.runcmd(['git', 'clone', '--quiet', url, repopath])
+            except cliapp.AppException as e:
+                raise BaserockImportException(e.msg.rstrip())
+
+        repo = GitDirectory(repopath)
+        if repo.dirname != repopath:
+            # Work around strange/unintentional behaviour in GitDirectory class
+            # when 'repopath' isn't a Git repo. If 'repopath' is contained
+            # within a Git repo then the GitDirectory will traverse up to the
+            # parent repo, which isn't what we want in this case.
+            logging.error(
+                'Got git directory %s for %s!', repo.dirname, repopath)
+            raise cliapp.AppException(
+                '%s exists but is not the root of a Git repository' % repopath)
+        return repo, url
+
+    def checkout_source_version(self, source_repo, name, version):
+        # FIXME: we need to be a bit smarter than this. Right now we assume
+        # that 'version' is a valid Git ref.
+
+        possible_names = [
+            version,
+            'v%s' % version,
+            '%s-%s' % (name, version)
+        ]
+
+        for tag_name in possible_names:
+            if source_repo.has_ref(tag_name):
+                source_repo.checkout(tag_name)
+                ref = tag_name
+                break
+        else:
+            if self.settings['use-master-if-no-tag']:
+                logging.warning(
+                    "Couldn't find tag %s in repo %s. Using 'master'.",
+                    tag_name, source_repo)
+                source_repo.checkout('master')
+                ref = version = 'master'
+            else:
+                raise BaserockImportException(
+                    'Could not find ref for %s version %s.' % (name, version))
+
+        return version, ref
+
+    def generate_chunk_morph_for_package(self, kind, source_repo, name,
+                                         filename):
+        tool = '%s.to_chunk' % kind
+        self.status('Calling %s to generate chunk morph for %s', tool, name)
+        text = run_extension(tool, [source_repo.dirname, name])
+        loader = morphlib.morphloader.MorphologyLoader()
+        return loader.load_from_string(text, filename)
+
+    def find_or_create_chunk_morph(self, morph_set, goal_name, kind, name,
+                                   version, source_repo, repo_url, named_ref):
+        morphology_filename = 'strata/%s/%s-%s.morph' % (
+            goal_name, name, version)
+        sha1 = source_repo.resolve_ref_to_commit(named_ref)
+
+        def generate_morphology():
+            morphology = self.generate_chunk_morph_for_package(
+                kind, source_repo, name, morphology_filename)
+            morph_set.save_morphology(morphology_filename, morphology)
+            return morphology
+
+        if self.settings['update-existing']:
+            morphology = generate_morphology()
+        else:
+            morphology = morph_set.get_morphology(
+                repo_url, sha1, morphology_filename)
+
+            if morphology is None:
+                # Existing chunk morphologies loaded from disk don't contain
+                # the repo and ref information. That's stored in the stratum
+                # morph. So the first time we touch a chunk morph we need to
+                # set this info.
+                logging.debug("Didn't find morphology for %s|%s|%s", repo_url,
+                            sha1, morphology_filename)
+                morphology = morph_set.get_morphology(
+                    None, None, morphology_filename)
+
+                if morphology is None:
+                    logging.debug("Didn't find morphology for None|None|%s",
+                                morphology_filename)
+                    morphology = generate_morphology()
+
+        morphology.repo_url = repo_url
+        morphology.ref = sha1
+        morphology.named_ref = named_ref
+
+        return morphology
+
+    def generate_stratum_morph_if_none_exists(self, graph, goal_name):
+        filename = os.path.join(
+            self.settings['definitions-dir'], 'strata', '%s.morph' % goal_name)
+
+        if os.path.exists(filename) and not self.settings['update-existing']:
+            self.status(msg='Found stratum morph for %s at %s, not overwriting'
+                        % (goal_name, filename))
+            return
+
+        self.status(msg='Generating stratum morph for %s' % goal_name)
+
+        order = reversed(sorted(graph.nodes()))
+        chunk_packages = networkx.topological_sort(graph, nbunch=order)
+        chunk_entries = []
+
+        for package in chunk_packages:
+            m = package.morphology
+            if m is None:
+                raise cliapp.AppException('No morphology for %s' % package)
+
+            def format_build_dep(name, version):
+                dep_package = find(graph, lambda p: p.match(name, version))
+                return '%s-%s' % (name, dep_package.version_in_use)
+
+            build_depends = [
+                format_build_dep(name, version) for name, version in
+                m['x-build-dependencies-rubygems'].iteritems()
+            ]
+
+            entry = {
+                'name': m['name'],
+                'repo': m.repo_url,
+                'ref': m.ref,
+                'unpetrify-ref': m.named_ref,
+                'morph': m.filename,
+                'build-depends': build_depends,
+            }
+            chunk_entries.append(entry)
+
+        stratum_name = goal_name
+        stratum = {
+            'name': stratum_name,
+            'kind': 'stratum',
+            'description': 'Autogenerated by Baserock import tool',
+            'build-depends': [
+                {'morph': 'strata/ruby.morph'}
+            ],
+            'chunks': chunk_entries,
+        }
+
+        loader = morphlib.morphloader.MorphologyLoader()
+        morphology = loader.load_from_string(json.dumps(stratum),
+                                             filename=filename)
+
+        loader.unset_defaults(morphology)
+        loader.save_to_file(filename, morphology)
+
+
+app = BaserockImportApplication(progname='import')
+app.run()
diff --git a/import/rubygems.to_chunk b/import/rubygems.to_chunk
new file mode 100755
index 00000000..578ab7bd
--- /dev/null
+++ b/import/rubygems.to_chunk
@@ -0,0 +1,264 @@
+#!/usr/bin/env ruby
+#
+# Create a chunk morphology to integrate a RubyGem in Baserock
+#
+# Copyright (C) 2014  Codethink Limited
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; version 2 of the License.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License along
+# with this program; if not, write to the Free Software Foundation, Inc.,
+# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+
+require 'bundler'
+require 'logger'
+require 'optparse'
+require 'yaml'
+
+# Log information was passed in from the main import process, probably.
+# This global constant approach seems a little ugly, but it seems to be
+# recommended here:
+# <https://stackoverflow.com/questions/1681745/share-global-logger-among-module-classes>
+#
+log_fd = Integer(ENV['MORPH_LOG_FD'])
+if log_fd
+  log_stream = IO.new(log_fd, 'w')
+  Log = Logger.new(log_stream)
+  Log.level = Logger::DEBUG
+  Log.formatter = proc { |severity, datetime, progname, msg| "#{msg}\n" }
+else
+  Log = Logger.new('/dev/null')
+end
+
+
+class << Bundler
+  def default_gemfile
+    # This is a hack to make things not crash when there's no Gemfile
+    Pathname.new('.')
+  end
+end
+
+def spec_is_from_current_source_tree(spec)
+  spec.source.instance_of? Bundler::Source::Path and
+    spec.source.path.fnmatch?('.')
+end
+
+class RubyGemChunkMorphologyGenerator
+  def initialize
+    local_data = YAML.load_file("rubygems.yaml")
+    @build_dependency_whitelist = local_data['build-dependency-whitelist']
+  end
+
+  def parse_options(arguments)
+    # No options so far ..
+    opts = OptionParser.new
+
+    opts.banner = "Usage: rubygems.to_chunk SOURCE_DIR GEM_NAME"
+    opts.separator ""
+    opts.separator "This tool reads the Gemfile and optionally the " +
+             "Gemfile.lock from a Ruby project "
+    opts.separator "source tree in SOURCE_DIR. It outputs a chunk " +
+             "morphology for GEM_NAME on stdout."
+    opts.separator ""
+    opts.separator "It is intended for use with the `baserock-import` tool."
+
+    parsed_arguments = opts.parse!(arguments)
+
+    if parsed_arguments.length != 2
+      STDERR.puts opts.help
+      exit 1
+    end
+
+    parsed_arguments
+  end
+
+  def error(message)
+    Log.error(message)
+    STDERR.puts(message)
+  end
+
+  def load_local_gemspecs()
+    # Look for .gemspec files in the source repo.
+    #
+    # If there is no .gemspec, but you set 'name' and 'version' then
+    # inside Bundler::Source::Path.load_spec_files this call will create a
+    # fake gemspec matching that name and version. That's probably not useful.
+
+    dir = '.'
+
+    source = Bundler::Source::Path.new({
+      'path' => dir,
+    })
+
+    Log.info "Loaded #{source.specs.count} specs from source dir."
+    source.specs.each do |spec|
+      Log.debug "  * #{spec.inspect} #{spec.dependencies.inspect}"
+    end
+
+    source
+  end
+
+  def get_spec_for_gem(specs, gem_name)
+    found = specs[gem_name].select {|s| Gem::Platform.match(s.platform)}
+    if found.empty?
+      raise Exception,
+        "No Gemspecs found matching '#{gem_name}'"
+    elsif found.length != 1
+      raise Exception,
+        "Unsure which Gem to use for #{gem_name}, got #{found}"
+    end
+    found[0]
+  end
+
+  def chunk_name_for_gemspec(spec)
+    # Chunk names are the Gem's "full name" (name + version number), so
+    # that we don't break in the rare but possible case that two different
+    # versions of the same Gem are required for something to work. It'd be
+    # nicer to only use the full_name if we detect such a conflict.
+    spec.full_name
+  end
+
+  def generate_chunk_morph_for_gem(spec)
+    description = 'Automatically generated by rubygems.to_chunk'
+
+    bin_dir = "\"$DESTDIR/$PREFIX/bin\""
+    gem_dir = "\"$DESTDIR/$(gem environment home)\""
+
+    # There's more splitting to be done, but putting the docs in the
+    # correct artifact is the single biggest win for enabling smaller
+    # system images.
+    #
+    # Adding this to Morph's default ruleset is painful, because:
+    #   - Changing the default split rules triggers a rebuild of everything.
+    #   - The whole split rule code needs reworking to prevent overlaps and to
+    #     make it possible to extend rules without creating overlaps. It's
+    #     otherwise impossible to reason about.
+
+    split_rules = [
+      {
+        'artifact' => "#{spec.full_name}-doc",
+        'include' => [
+          'usr/lib/ruby/gems/\d[\w.]*/doc/.*'
+        ]
+      }
+    ]
+
+    # It'd be rather tricky to include these build instructions as a
+    # BuildSystem implementation in Morph. The problem is that there's no
+    # way for the default commands to know what .gemspec file they should
+    # be building. It doesn't help that the .gemspec may be in a subdirectory
+    # (as in Rails, for example).
+    #
+    # Note that `gem help build` says the following:
+    #
+    #   The best way to build a gem is to use a Rakefile and the
+    #   Gem::PackageTask which ships with RubyGems.
+    #
+    # It's often possible to run `rake gem`, but this may require Hoe,
+    # rake-compiler, Jeweler or other assistance tools to be present at Gem
+    # construction time. It seems that many Ruby projects that use these tools
+    # also maintain an up-to-date generated .gemspec file, which means that we
+    # can get away with using `gem build` just fine in many cases.
+    #
+    # Were we to use `setup.rb install` or `rake install`, programs that loaded
+    # with the 'rubygems' library would complain that required Gems were not
+    # installed. We must have the Gem metadata available, and `gem build; gem
+    # install` seems the easiest way to achieve that.
+
+    build_commands = [
+      "gem build #{spec.name}.gemspec",
+    ]
+
+    install_commands = [
+      "mkdir -p #{gem_dir}",
+      "gem install --install-dir #{gem_dir} --bindir #{bin_dir} " +
+        "--ignore-dependencies --local ./#{spec.full_name}.gem"
+    ]
+
+    {
+      'name' => chunk_name_for_gemspec(spec),
+      'kind' => 'chunk',
+      'description' => description,
+      'build-system' => 'manual',
+      'products' => split_rules,
+      'build-commands' => build_commands,
+      'install-commands' => install_commands,
+    }
+  end
+
+  def build_deps_for_gem(spec)
+    deps = spec.dependencies.select do |d|
+      d.type == :development && @build_dependency_whitelist.member?(d.name)
+    end
+  end
+
+  def runtime_deps_for_gem(spec)
+    spec.dependencies.select {|d| d.type == :runtime}
+  end
+
+  def write_morph(file, morph)
+    file.write(YAML.dump(morph))
+  end
+
+  def run
+    source_dir_name, gem_name = parse_options(ARGV)
+
+    Log.info("Creating chunk morph for #{gem_name} based on " +
+             "source code in #{source_dir_name}")
+
+    Dir.chdir(source_dir_name)
+
+    # Instead of reading the real Gemfile, invent one that simply includes the
+    # chosen .gemspec. If present, the Gemfile.lock will be honoured.
+    fake_gemfile = Bundler::Dsl.new
+    fake_gemfile.source('https://rubygems.org')
+    fake_gemfile.gemspec({:name => gem_name})
+
+    definition = fake_gemfile.to_definition('Gemfile.lock', true)
+    resolved_specs = definition.resolve_remotely!
+
+    spec = get_spec_for_gem(resolved_specs, gem_name)
+
+    if not spec_is_from_current_source_tree(spec)
+      error "Specified gem '#{spec.name}' doesn't live in the source in " +
+            "'#{source_dir_name}'"
+      Log.debug "SPEC: #{spec.inspect} #{spec.source}"
+      exit 1
+    end
+
+    morph = generate_chunk_morph_for_gem(spec)
+
+    # One might think that you could use the Bundler::Dependency.groups
+    # field to filter but it doesn't seem to be useful. Instead we go back to
+    # the Gem::Specification of the target Gem and use the dependencies fild
+    # there. We look up each dependency in the resolved_specset to find out
+    # what version Bundler has chosen of it.
+
+    def format_deps_for_morphology(specset, dep_list)
+      info = dep_list.collect do |dep|
+        spec = specset[dep][0]
+        [spec.name, spec.version.to_s]
+      end
+      Hash[info]
+    end
+
+    build_deps = format_deps_for_morphology(
+      resolved_specs, build_deps_for_gem(spec))
+    runtime_deps = format_deps_for_morphology(
+      resolved_specs, runtime_deps_for_gem(spec))
+
+    morph['x-build-dependencies-rubygems'] = build_deps
+    morph['x-runtime-dependencies-rubygems'] = runtime_deps
+
+    write_morph(STDOUT, morph)
+  end
+end
+
+RubyGemChunkMorphologyGenerator.new.run
diff --git a/import/rubygems.to_lorry b/import/rubygems.to_lorry
new file mode 100755
index 00000000..cd83e33b
--- /dev/null
+++ b/import/rubygems.to_lorry
@@ -0,0 +1,163 @@
+#!/usr/bin/python
+#
+# Create a Baserock .lorry file for a given RubyGem
+#
+# Copyright (C) 2014  Codethink Limited
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; version 2 of the License.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License along
+# with this program; if not, write to the Free Software Foundation, Inc.,
+# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+
+
+import requests
+import requests_cache
+import yaml
+
+import logging
+import json
+import os
+import sys
+import urlparse
+
+from importer_base import ImportException, ImportExtension
+
+
+class GenerateLorryException(ImportException):
+    pass
+
+
+class RubyGemsWebServiceClient(object):
+    def __init__(self):
+        # Save hammering the rubygems.org API: 'requests' API calls are
+        # transparently cached in an SQLite database, instead.
+        requests_cache.install_cache('rubygems_api_cache')
+
+    def _request(self, url):
+        r = requests.get(url)
+        if r.ok:
+            return json.loads(r.text)
+        else:
+            raise GenerateLorryException(
+                'Request to %s failed: %s' % (r.url, r.reason))
+
+    def get_gem_info(self, gem_name):
+        info = self._request(
+            'http://rubygems.org/api/v1/gems/%s.json' % gem_name)
+
+        if info['name'] != gem_name:
+            # Sanity check
+            raise GenerateLorryException(
+                 'Received info for Gem "%s", requested "%s"' % info['name'],
+                  gem_name)
+
+        return info
+
+
+class RubyGemLorryGenerator(ImportExtension):
+    def __init__(self):
+        super(RubyGemLorryGenerator, self).__init__()
+
+        with open('rubygems.yaml', 'r') as f:
+            local_data = yaml.load(f.read())
+
+        self.known_source_uris = local_data['known-source-uris']
+
+        logging.debug(
+            "Loaded %i known source URIs from local metadata.", len(self.known_source_uris))
+
+    def process_args(self, args):
+        if len(args) != 1:
+            raise ImportException(
+                'Please call me with the name of a RubyGem as an argument.\n')
+
+        gem_name = args[0]
+
+        lorry = self.generate_lorry_for_gem(gem_name)
+        self.write_lorry(sys.stdout, lorry)
+
+    def find_upstream_repo_for_gem(self, gem_name, gem_info):
+        source_code_uri = gem_info['source_code_uri']
+
+        if gem_name in self.known_source_uris:
+            logging.debug('Found %s in known-source-uris', gem_name)
+            known_uri = self.known_source_uris[gem_name]
+            if source_code_uri is not None and known_uri != source_code_uri:
+                sys.stderr.write(
+                    '%s: Hardcoded source URI %s doesn\'t match spec URI %s\n' %
+                    (gem_name, known_uri, source_code_uri))
+            return known_uri
+
+        if source_code_uri is not None and len(source_code_uri) > 0:
+            logging.debug('Got source_code_uri %s', source_code_uri)
+            if source_code_uri.endswith('/tree'):
+                source_code_uri = source_code_uri[:-len('/tree')]
+
+            return source_code_uri
+
+        homepage_uri = gem_info['homepage_uri']
+        if homepage_uri is not None and len(homepage_uri) > 0:
+            logging.debug('Got homepage_uri %s', source_code_uri)
+            netloc = urlparse.urlsplit(homepage_uri)[1]
+            if netloc == 'github.com':
+                return homepage_uri
+
+        # Further possible leads on locating source code.
+        # http://ruby-toolbox.com/projects/$gemname -> sometimes contains an
+        #   upstream link, even if the gem info does not.
+        # https://github.com/search?q=$gemname -> often the first result is
+        #   the correct one, but you can never know.
+
+        raise GenerateLorryException(
+            "Did not manage to find the upstream source URL for Gem '%s'. "
+            "Please manually create a .lorry file, or add the Gem to "
+            "known-source-uris in rubygems.yaml." % gem_name)
+
+    def project_name_from_repo(self, repo_url):
+        if repo_url.endswith('/tree/master'):
+            repo_url = repo_url[:-len('/tree/master')]
+        if repo_url.endswith('/'):
+            repo_url = repo_url[:-1]
+        if repo_url.endswith('.git'):
+            repo_url = repo_url[:-len('.git')]
+        return os.path.basename(repo_url)
+
+    def generate_lorry_for_gem(self, gem_name):
+        rubygems_client = RubyGemsWebServiceClient()
+
+        gem_info = rubygems_client.get_gem_info(gem_name)
+
+        gem_source_url = self.find_upstream_repo_for_gem(gem_name, gem_info)
+        logging.info('Got URL <%s> for %s', gem_source_url, gem_name)
+
+        project_name = self.project_name_from_repo(gem_source_url)
+
+        # One repo may produce multiple Gems. It's up to the caller to merge
+        # multiple .lorry files that get generated for the same repo.
+
+        lorry = {
+            project_name: {
+                'type': 'git',
+                'url': gem_source_url,
+                'x-products-rubygems': [gem_name]
+            }
+        }
+
+        return lorry
+
+    def write_lorry(self, stream, lorry):
+        json.dump(lorry, stream, indent=4)
+        # Needed so the morphlib.extensions code will pick up the last line.
+        stream.write('\n')
+
+
+if __name__ == '__main__':
+    RubyGemLorryGenerator().run()
diff --git a/import/rubygems.yaml b/import/rubygems.yaml
new file mode 100644
index 00000000..d31a625a
--- /dev/null
+++ b/import/rubygems.yaml
@@ -0,0 +1,45 @@
+---
+
+# The :development dependency set is way too broad for our needs: for most Gems,
+# it includes test tools and development aids that aren't necessary for just
+# building the Gem. It's hard to even get a stratum if we include all these
+# tools because of the number of circular dependencies. Instead, only those
+# tools which are known to be required at Gem build time are listed as
+# build-dependencies, and any other :development dependencies are ignored.
+build-dependency-whitelist:
+  - hoe
+  # rake is bundled with Ruby, so it is not included in the whitelist.
+
+# The following Gems don't provide a source_code_uri in their Gem metadata.
+# Ideally ... they would do.
+known-source-uris:
+  ast: https://github.com/openSUSE/ast
+  brass: https://github.com/rubyworks/brass
+  coveralls: https://github.com/lemurheavy/coveralls-ruby
+  diff-lcs: https://github.com/halostatue/diff-lcs
+  erubis: https://github.com/kwatch/erubis
+  fog-brightbox: https://github.com/brightbox/fog-brightbox
+  highline: https://github.com/JEG2/highline
+  hoe: https://github.com/seattlerb/hoe
+  indexer: https://github.com/rubyworks/indexer
+  json: https://github.com/flori/json
+  method_source: https://github.com/banister/method_source
+  mixlib-authentication: https://github.com/opscode/mixlib-authentication
+  mixlib-cli: https://github.com/opscode/mixlib-cli
+  mixlib-log: https://github.com/opscode/mixlib-log
+  mixlib-shellout: http://github.com/opscode/mixlib-shellout
+  ohai: http://github.com/opscode/ohai
+  rack-cache: https://github.com/rtomayko/rack-cache
+  actionmailer: https://github.com/rails/rails
+  actionpack: https://github.com/rails/rails
+  actionview: https://github.com/rails/rails
+  activejob: https://github.com/rails/rails
+  activemodel: https://github.com/rails/rails
+  activerecord: https://github.com/rails/rails
+  activesupport: https://github.com/rails/rails
+  rails: https://github.com/rails/rails
+  railties: https://github.com/rails/rails
+  pg: https://github.com/ged/ruby-pg
+  sigar: https://github.com/hyperic/sigar
+  sprockets: https://github.com/sstephenson/sprockets
+  tins: https://github.com/flori/tins
author	Sam Thursfield <sam.thursfield@codethink.co.uk>	2014-07-24 16:32:21 +0100
committer	Sam Thursfield <sam.thursfield@codethink.co.uk>	2014-09-11 18:42:57 +0100
commit	0ee1b57965ceeb94a47fcb6898acb96ce91e3cf9 (patch)
tree	be636f287377873231ad40e4cc012937358f096a /import
parent	ccb5a47915da9f0d5ab25d33d6d21409e7f898d0 (diff)
download	morph-0ee1b57965ceeb94a47fcb6898acb96ce91e3cf9.tar.gz