| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Always report failure to host, but report failure to fabric only
outside of _check_if_nic_is_primary() which is expected to fail if
nic is not primary.
Add two types of reportable errors for IMDS metadata:
- add ReportableErrorImdsUrlError() for url errors.
- add ReportableErrorImdsMetadataParsingException() for parsing errors.
Tweak ReportableError repr to be a bit friendlier.
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- Add host_only flag to _report_failure() to allow caller to only
report the failure to host. This is for cases where we don't want
_report_failure() to attempt DHCP or we expect that we may recover
from the reported error (there is no issue reporting multiple times
to host, whereas fabric reports will immediately fail the VM
provisioning).
- Add ReportableErrorDhcpLease() to report lease failures.
- Add ReportableErrorDhcpInterfaceNotFound() to report errors where the
DHCP interface hasn't been found yet.
- Add TestReportFailure class with new test coverage. Will migrate other
_report_failure() tests in the future as they currently depend on
TestAzureDataSource/CiTestCase.
Future work will add the interface name to supporting data, but as that
information is not available with iface=None, another PR will explicitly
add a call to net.find_fallback_nic() to specify it.
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|
|
|
|
|
|
|
|
|
|
| |
It was only used by Hyper-V which now has a filtering
mechanism that does not require the use of a denylist.
This exposed some issues with tests misspelling "hv_netvsc"
and using unmatched mac addresses. This fixes those to work
with the current filter that does not rely on the driver name.
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|
|
|
|
|
|
| |
Provide an option to suppress error logging from mount_cb as some
errors can be expected error and handled appropriately by
DataSources. For example: failure to mount NTFS volumes on VMs that
do not have NTFS drivers.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add success reporting to the host via KVP.
- Move _report_failure_to_host() into kvp module.
- Tweak error description to use result=error instead of
PROVISIONING_ERROR: ...
- Use result=success for the successful ("ready") reports.
- report_x_via_kvp => report_x_to_host for consistency with fabric.
ReportableError.as_description() => as_encoded_report()
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Azure can report provisioning failures via the Wireserver health
endpoint. However, in the event of networking failures or Wireserver
issues, this report cannot be made and the VM will result in an OS
provisioning timeout and a generic error is presented to the user.
Report the failure via KVP using the "PROVISIONING_REPORT" key so
that the host can relay the provisioning error report to the user
when the VM fails to provision.
The format used is subject to change and/or removal.
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Instead of a fixed number of retries, allow up to 5 minutes to fetch
metadata from IMDS. The current approach allows for up to 11 attempts
depending on the path. Given the timeout setting, this can vary from
~11 seconds up to ~32 seconds depending on whether or not read/connection
timeouts are encountered.
Delaying boot on the rare occasion that IMDS is delayed is better than
ignoring the metadata as it ensures the VM is configured as expected.
This is a very conservative timeout and may be reduced in the future.
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Move isc-dhclient code to dhcp.py
In support of the upcoming deprecation of
isc-dhcp-client, this code refactors current
dhcp code into classes in dhcp.py. The
primary user-visible change should be the
addition of the following log:
dhcp.py[DEBUG]: DHCP client selected: dhclient
This code lays groundwork to enable
alternate implementations to live side by
side in the codebase to be selected with
distro-defined priority fallback. Note that
maybe_perform_dhcp_discovery() now selects
which dhcp client to call, and then runs the
corresponding client's dhcp_discovery()
method. Currently only class IscDhclient is
implemented, however a yet-to-be-implemented
class Dhcpcd exists to test fallback behavior
and this will be implemented in part two of
this series.
Part of this refactor includes shifting
dhclient service management from hardcoded
calls to the distro-defined manage_service()
method in the *BSDs. Future work is required
in this area to support multiple clients via
select_dhcp_client().
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When provisioning failures occur an Azure, a generic description is
used in the report and ultimately returned to the user. To improve
the user experience, report details of the failure in a manner that is
parsable, readable and succinct. The current approach is to use csv
with a custom delimiter ("|") and quote character ("'"). This format
may change in the future.
Gracefully handle reportable errors thrown while crawling metadata and
treat other exceptions as ReportableErrorUnhandledException. Future
work will introduce more reportable errors to handle the expected
failure cases.
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- Add query_system_uuid() for getting system uuid from dmi in
normalized (lower-cased) form.
- Add byte_swap_system_uuid() to convert a system uuid for gen1
instances to the compute.vmId as presented by IMDS.
- Add convert_system_uuid_to_vm() to convert system uuid to vm
id depending on whether it is gen1 or gen2.
- Add is_vm_gen1() to determine if VM is Azure's gen1 by checking
for available of EFI (used in gen2).
- Add query_vm_id() helper to get VM id without system uuid.
- Move ChassisAssetTag from Azure helpers into identity.
- Update DataSourceAzure._iid() to use this module.
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Pull out remaining PPS handling bits from _poll_imds() and add two
explicit methods for the overloaded path:
- _wait_for_pps_running_reuse() for running PPS logic.
- _wait_for_pps_unknown_reuse() for unknown and recovery PPS logic.
For consistency:
- Rename _wait_for_all_nics_ready() -> _wait_for_pps_savable_reuse().
- Move reporting ready logic into _wait_for_pps_os_disk_shutdown().
Drop several impacted tests as coverage already exists in
TestProvisioning, and update the rest to handle the +/- 1 DHCP attempt
due to varying assumptions around PPS state and DHCP.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit d1ffbea556a06105 enabled skipping python datasource detection on
OpenStack when no other datasources (besides DataSourceNone) can be discovered.
This allowed one to override detection, which is a requirement for OpenStack
Ironic which does not advertise itself to cloud-init.
Since no further datasources can be detected at this stage in the code, this
pattern can be generalized to other datasources to facilitate troubleshooting
or providing a general workaround to runtime detection bugs.
Additionally, this pattern can be extended to kernel commandline datasource
definition. Since kernel commandline is highest priority of the
configurations, it makes sense to override python code datasource
detection as well.
Include an integration test on LXD for this behavior that configures kernel
commandline and reboots to verify that the specified datasource is forced.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There is a networking check in _poll_imds() which will attempt DHCP
again if networking is not up for source PPS. With the previous
change to wait at least 20 minutes during provisioning for DHCP,
this additional round is not necessary.
Report failure if networking is not up for any mode of source PPS.
In practice, this is very unlikely as provisioning will typically
timeout within the 20 minute window the VM is attempting DHCP and
the source PPS VM will be deleted.
This fixes an (unobserved) issue where Savable PPS does not have
networking prior to _wait_for_all_nics_ready().
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are effectively two regressions in the recent IMDS refactor:
1. The metadata check len(imds_md["interface"]) in
_check_if_nic_is_primary() is no longer correct as the refactor
switched URLs and did not update this call to account for the
fact that this metadata now lives under "network".
2. Network metadata was fetched with infinite=True and is now limited
to ten retries. This callback had the twist of only allowing up to
ten connection errors but otherwise would retry indefinetely.
For check_if_nic_is_primary():
- Drop the interface count check for _check_if_nic_is_primary(),
we don't need it anyways.
- Fix/update the unit tests mocks that allowed the tests to pass,
adding another test to verify max retries for http and connection
errors.
- Use 300 retries. We do want to hit a case where we spin forever,
but this should be more than enough time for IMDS to respond in
the Savable PPS case (~5 minutes).
For IMDS:
- Consolidate IMDS retry handlers into a new ReadUrlRetryHandler class
that supports the options required for each variant of request.
- Minor tweaks to log and expand logging checks in unit tests.
- Move all unit tests to mocking via mock_requests_session_request
and replace mock_readurl fixture with wrapped_readurl to improve
consistency between tests.
Note that this change drops usage of `retry_on_url_exc` and can probably
be removed altogether as it is no longer used AFAICT.
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|
|
|
| |
Usage was dropped in de7851b93c5a2d4658.
|
|
|
|
|
|
|
|
|
|
| |
Create new azure package for better organization and move
IMDS logic for fetching into it.
Future work will clean up the test_azure.py tests a little
further thanks to these changes, but wanted to minimize churn
here to make changes fairly visible.
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Early attempts to fetch metadata on Azure may fail with connection
errors. While this class of errors is not ideal to retry on,
the impact is minimal given that:
1. retries are fairly limited (10)
2. Persistent connection errors would indicate that cloud-init is
using a non-primary NIC which is a rare case of failure that
will be addressed in the future.
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- Initialize md and cfg to the fallback used when no OVF is found
and IMDS is required.
- Rename metadata_source -> ovf_source and drop usage of "IMDS" as
a valid value.
- Set `self.seed` to "IMDS" when ovf_source is unset.
- Remove late check for metadata source. This is already done
by the earlier check where we'll fail with "No OVF or IMDS
available".
- Move "Found provisioning metadata" diagnostic up to where we
read OVF. Suggesting it was "IMDS" prior to querying IMDS
is misleading.
- Add warning when falling back to IMDS-only provisioning.
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|
|
|
|
|
|
|
|
|
| |
The same default description is used for all error cases.
Remove this parameter in favor of assuming the default in all
cases. Future work will allow for error reporting with a
customizable description using a different interface.
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The ordering of NICs provided by IMDS may not match the order enumerated
by kernel. As such, we do not have any guarantee that the nic we're
checking the driver for is the nic we think it is.
Instead of making any assumptions about how the nics are named, check
all interfaces by mac address. If there is an interface using
"hv_netvsc", match against that. If there is only one interface driver
that is not blacklisted, use that (in case it is not "hv_netvsc"), but
log a debug event. If there are multiple hits, don't match against any
of the names and report a warning.
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|
|
|
|
|
|
|
|
|
| |
Currently, get_instance_id() assumes that the instance ID is in the metadata.
If not found, it falls back to a hardcoded string "iid-datasource".
Override this behavior to query the instance id as needed.
Signed-off-by: Chris Patterson cpatterson@microsoft.com
|
|
|
|
|
|
|
|
|
| |
A new attribute was added to DataSourceAzure[1].
Since the base class uses CloudInitPickleMixin,
we need to define this new attribute in _unpickle()
Add multiple tests to improve pickle coverage.
[1] https://github.com/canonical/cloud-init/pull/1523
|
|
|
|
|
|
|
|
| |
In the case cloudinit.temp_utils points to a fs mounted as noexec
and needs_exe=True, fallback to use
os.join.path(Distro.usr_lib_exec, "cloud-init/clouddir) that
will be mounted with exec perms.
LP: #1962343
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Upon reporting ready for Savable PPS, the VM may be suspended before
we see the http request complete. When the VM resumes,
http_with_retries will keep retrying even though it sees
"Network is unreachable" errors due to the unplugged NIC (and
perhaps a new unconfigured one) or "Read timed out" raised.
- Do not retry when "Network is unreachable", this will not
resolve itself in any case.
- Ignore all url errors for Savable PPS. Worst case scenario is
we failed to report ready anyways (for whatever reason) and the
source PPS VM will soon be discarded.
Signed-off-by: Chris Patterson cpatterson@microsoft.com
|
|
|
|
|
|
|
|
|
|
|
|
| |
Some pre-provisioning scenarios require that the VM be started and
shut back down as part of preparing the VM for future use.
When the PPS type is PreprovisionedOSDisk, report ready and wait for
host to shut down the VM.
Provisioning will resume normally on next boot, so do not write
a reported ready marker file.
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|
|
|
|
|
|
|
|
| |
Use `hashed_passwd` instead of `passwd`.
The password is still set for the default (admin) user but isn't
immediately expired as a result of this change:
https://github.com/canonical/cloud-init/pull/1577
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In preparation of adding a new tag to support, move the
current DMI chassis asset tag into an enum. No change
in behavior should be present other than reporting.
- Create ChassisAssetTag enum for containing all Azure
DMI chassis asset tags and logic to query system for it.
- Add current DMI asset tag to enum as AZURE_CLOUD.
- Reporting: drop event frame and report valid asset tag.
- Update tests for platform viability to pytest.
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Use ElementTree instead of minidom
* Use namespaces and case sensitive names
* Decouple parsing from usage in config/metadata dictionaries
* More clearly distinguish between NonAzureDataSource() and
BrokenAzureDataSource() exceptions. Only raise NonAzureDataSource()
exception if the ProvisioningSection in the windowsazure namespace
is not found. Any other parsing failures will result in
BrokenAzureDataSource() being raised.
* Streamline log messages
* Move logic into Azure helper module
There should be no effective change in behavior unless some bad XML
is in the wild and being ignored or failing silently.
Signed-off-by: Chris Patterson cpatterson@microsoft.com
|
|
|
|
|
|
|
|
|
|
|
| |
Ensure cloud_dir setting is respected rather than hardcoding
"/var/lib/cloud"
- Modules affected: cmd.main, apport, devel.logs (collect-logs),
cc_snap, sources.DataSourceAzure, sources.DataSourceBigstep,
util:fetch_ssl_details.
- testing: Extend and port to pytest unit tests, add integration test.
LP: #1976564
|
|
|
|
|
| |
It is unused and unsupported in Azure's ovf-env.xml. Remove it.
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|
|
|
|
|
|
|
|
|
| |
- seedfrom is not used by Azure ovf-env.xml, remove it.
- azure_data is capturing arbitrary keys and we already
have a redacted ovf-env.xml if we need to inspect any
(unused) properties.
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|
|
|
|
|
|
|
|
| |
This property is not found in Azure's ovf-env.xml.
Remove relevant code merging it into datasource config
and unit tests.
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|
|
|
|
|
|
|
| |
Azure does not populate ovf-env.xml with UserData, just CustomData.
Update tests accordingly.
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- Replace parse_network_config() with _generate_network_config()
instance method and consolidate cache checks into network_config.
- Update _generate_network_config_from_imds_metadata() to take
just network metadata portion of instance metadata and rename to
generate_network_config_from_instance_network_metadata().
- Consolidate relevant unit tests and refactor to pytest.
- Update net-convert to use
generate_network_config_from_instance_network_metadata().
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Also refactor network context managers into net.ephemeral
Currently EC2 is the only IMDS to make use of this.
IPv6 requires a link local address on interfaces. A
link local address is sufficient for the EC2 IMDS,
so no dhcp6 assignment is required for early boot
IMDS queries.
The kernel assigns this address using RFC 4291 [1]
during link initialization, so all cloud-init needs
to do is ensure that link is up.
This means that even if dhcp4 fails, an ipv6-enabled
instance may still succeed at crawling metadata.
[1] https://datatracker.ietf.org/doc/html/rfc4291#section-2.5.6
|
|
|
|
|
|
| |
- Remove references and dead code to Xenial, Eoan, Python < 3.7
- cc_ubuntu_drivers: Use python3-debconf instead of shell script
- add integration test for ubuntu_drivers
- bump pycloudlib for OCI subnet/jammy fixes
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For primary network config:
- Use `iSCSI` config if some `/run/net*` file exists, even if
`/run/initramfs/open-iscsi.interface` does not.
- If the instance is not an `iSCSI` one, then crawl the network
config from `IMDS` instead of falling back to "best guess".
- Remove unnecessary conditional use of dhcp.EphemeralDHCPv4
and use it always to crawl `IMDS`.
- Migrate tests to pytest.
- Extend unit test coverage.
- Add some types for mypy.
LP: #1967942
|
|
|
|
|
|
|
|
|
| |
- Add types to let mypy pass.
- Add mypy flags:
- detect unused ignores
- redundant casts
- Drop support of `ConfigParser` in Python 2
- Harden DataSourceLXD.network_config
- Convert old-style commented types to proper types.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If we haven't reported ready for source PPS then we can treat
the recovery boot like any other. The metadata on the OVF
and IMDS will indicate the PPS type correctly as the state
hasn't changed.
If we have reported ready for source PPS, we continue to fall
into _poll_imds() by way of setting pps_type to UNKNOWN if the
REPORTED_READY_MARKER is present and will not attempt to report
ready again.
This fixes a potential issue when recovering on Savable PPS.
If a recovery boot occurs after the recovery marker is created,
and without reporting ready, the subsequent boot will assume
pps type UNKNOWN and attempt to report ready in _poll_imds()
using the Running PPS netlink operations.
Add unit test coverage for complete recovery scenario.
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|
|
|
|
| |
- move datasource helpers to dedicated directory
- drop unnecessary executable bit on shebangless python files
|
|
|
|
|
|
| |
Commands may fail in the process of setting up DHCP, e.g.:
udevadm settle, ip link set dev eth0 up, etc.
Report these failures and retry until timeout.
|
|
|
|
|
|
|
| |
While this was a previously intended change, the actual logic was
backwards. Try for 20 minutes during provisioning, only 5 minutes
otherwise.
Add test coverage to verify the timeout for provisioning scenarios.
|
|
|
|
|
|
| |
Currently DS Azure waits for all nics to be up and running during the
restore phase of save-restore VMs. This change will alter the behavior
so that it will only wait for primary nic. This new behavior is consistent
with non-preprovisioning and running types.
|
|
|
|
|
|
| |
This provides a minor readability improvement.
subp.subp(cmd)[0] -> subp.subp(cmd).stdout
subp.subp(cmd)[1] -> subp.subp(cmd).stderr
|
|
|
|
|
|
|
|
|
|
| |
Wait up to 10 seconds for link to come up before continuing. This
typically takes just a few seconds once the NIC is hotplugged.
If it takes longer than 10 seconds for whatever reason, dhclient
should eventually succeed on its next attempt after the link does
come online.
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|
|
|
|
|
|
|
| |
Just a minor refactoring to cleanup the shim.
Update tests to use pytest parametrization.
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
With reporting ready now happening in local phase, we have access
to ephemeral DHCP lease options and no longer need to parse DHCP
lease files.
- Switch from tracking wireserver endpoint in its encoded form to the
IP string, parsing it only when read from lease options.
- Drop fallback_lease_file and dhcp_options parameters in favor of
processed endpoint string.
- Add some minor type information for mypy.
- Update various tests.
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|
|
|
|
|
|
|
|
|
|
|
| |
With debug logging, tight loops may result in huge log file sizes, e.g.:
"Unable to find fallback nic"
1. Raise NoDHCPLeaseMissingDhclientError to caller if no dhclient found
instead of retrying DHCP, retrying will not fix a missing dhclient.
2. For other DHCP failures, retry after sleeping one second.
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are two issues with IMDS retries:
1. IMDS_VER_WANT will never be attempted if retries=0, such as
when fetching network metadata with infinite=True.
2. get_imds_data_with_api_fallback() will attempt one request with
IMDS_VER_WANT. If the connection fails due to a timeout, connection
issue, or error code other than 400, an empty dictionary will be
returned without attempting the requested number of retries.
This PR:
- Updates get_imds_data_with_api_fallback() to invoke
get_metadata_from_imds() with the specified retries and infinite
parameters.
- Updates retry_on_url_exc to take a configurable set of HTTP error
codes and exception types to retry on.
- Add IMDS_RETRY_CODES set to retry with when fetching data from IMDS:
- 404 not found (yet)
- 410 gone / unavailable (yet)
- 429 rate-limited/throttled
- 500 server error
- Replace default callback with imds_readurl_exception_callback,
which configures retry_on_url_exc() with these error codes and
instances.
- Add new pytests for IMDS to eventually replace the unittest
equivalents and improve existing coverage.
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|
|
|
|
|
|
|
|
|
|
| |
If the VM is rebooted during provisioning, the PPS type will be
determined to be UNKNOWN and will poll for reprovision data.
Given that we will never enter _wait_for_all_nics_ready() in any
other condition than a fresh source instance in Savable PPS, we can
safely remove the now-unused code paths.
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|