Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gathering of resources from last migrated namespaces, gather PVs, StorageClasses (cluster-scoped), Routes, Service (just registry namespaces) #34

Closed
wants to merge 9 commits into from
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ FROM quay.io/openshift/origin-must-gather:4.5 as builder
FROM registry.access.redhat.com/ubi8-minimal:latest
RUN echo -ne "[centos-8-appstream]\nname = CentOS 8 (RPMs) - AppStream\nbaseurl = http://mirror.centos.org/centos-8/8/AppStream/x86_64/os/\nenabled = 1\ngpgcheck = 0" > /etc/yum.repos.d/centos.repo

RUN microdnf -y install rsync tar gzip graphviz findutils
RUN microdnf -y install rsync tar gzip graphviz findutils jq grep

COPY --from=gobuilder /opt/app-root/src/go/bin/pprof /usr/bin/pprof
COPY --from=builder /usr/bin/oc /usr/bin/oc
Expand Down
7 changes: 5 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,18 @@ The command above will create a local directory with a dump of the MTC state.

You will get a dump of:
- All namespaces where a MTC toolset is installed, including pod logs
- All velero.io and migration.openshift.io resources located in those namespaces
- All velero.io and migration.openshift.io resources located in MTC namespaces
- StorageClasses and PersistentVolume resources
- Route and Service resources from the `default` and `openshift-image-registry` namespaces
- All resources from namespaces migrated in the most recent migration attempt, with the exception of Secrets
- Prometheus metrics

**Essential-only gather**

Differences from full gather:
- Logs are only gathered from specified time window
- Skips collection of prometheus metrics, pprof. Removes duplicate logs from payload.
```
```sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is "sh" intentional here? It won't show up on README I think

Copy link
Author

@djwhatle djwhatle Mar 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so this bit just signals to the markdown renderer that this block of code should be rendered with syntax highlighting for shell scripts.

I believe this is the full list of supported syntax highlighting.

# Essential gather (available time windows: [1h, 6h, 24h, 72h, all])
oc adm must-gather --image=quay.io/konveyor/must-gather:latest -- /usr/bin/gather_24h_essential
```
Expand Down
33 changes: 33 additions & 0 deletions collection-scripts/gather
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,12 @@ for localns in $(/usr/bin/oc get migrationcontrollers.migration.openshift.io --a
done
echo "Will collect debug info from migclusters [${clusters[@]}]"

# Find the latest migration, plan, and associated namespaces so we can gather additional info from those
latest_migration=$(oc -n ${localns} get migmigration -o json | jq -r '.items|=sort_by(.metadata.creationTimestamp)' | jq -r '.items[-1].metadata.name')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familliar with jq. Does items[-1] fail when there are no migmigrations present?

Copy link
Author

@djwhatle djwhatle Mar 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it results in "null" getting set to the var and a 0 return code. Will try a full must-gather in this scenario and make sure it doesn't crash.

$ oc delete migmigration --all
migmigration.migration.openshift.io "8069a3d0-8c19-11eb-b0bf-5fcc05ad4e8c" deleted
migmigration.migration.openshift.io "b5a018a0-8b1e-11eb-ac4b-3b5c0d147d47" deleted

$ latest_migration=$(oc -n openshift-migration get migmigration -o json | jq -r '.items|=sort_by(.metadata.creationTimestamp)' | jq -r '.items[-1].metadata.name')

# Check return code
$  echo $?
0

# Check var output
$ echo $latest_migration
null

Copy link
Author

@djwhatle djwhatle Mar 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so basically the end-result is that all of the latest_* vars get set to null, and inside the loop that compares current cluster name to latest_cluster, we'll never enter the new block for gathering content of migrated namespaces.

Output from must-gather

[must-gather-n94dc] POD Error from server (NotFound): migmigrations.migration.openshift.io "null" not found
[must-gather-n94dc] POD jq: error (at <stdin>:191): Cannot iterate over null (null)
[must-gather-n94dc] POD [cluster=host][namespace=openshift-migration] Detected MTC installation
[must-gather-n94dc] POD [cluster=host] Not found within latest_migration_clusters=null
[must-gather-n94dc] POD null

So I think this is harmless and works as-is, however I'm open to input on whether we should squash the jq stderr and send it to /dev/null so that the error never shows up.

Copy link
Contributor

@pranavgaikwad pranavgaikwad Mar 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@djwhatle I was under impression that if a script in must-gather returns non-zero exit code, the temporaray must-gather pod exits with non-zero status. But maybe that is incorrect. I think it'd be OK to have stderr printed in this way since this doesn't really break the must-gather container.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct, it's just that the RC is always 0 even when jq doesn't find something.

latest_migration_plan=$(oc -n ${localns} get migmigration ${latest_migration} -o jsonpath={.spec.migPlanRef.name})
latest_migration_namespaces=$(oc -n ${localns} get migplan ${latest_migration_plan} -o json | jq -r '.spec.namespaces[]')
latest_migration_clusters=$(oc -n ${localns} get migplan ${latest_migration_plan} -o json | jq -r '.spec.destMigClusterRef.name, .spec.srcMigClusterRef.name')

# Iterate over all connected non-host OpenShift clusters
for cluster in ${clusters[@]}; do
unset KUBECONFIG
Expand All @@ -41,12 +47,29 @@ for localns in $(/usr/bin/oc get migrationcontrollers.migration.openshift.io --a
namespaces+=(${ns})
done

# Check if this cluster has a migrated namespace we want to oc adm inspect
for ns in ${latest_migration_namespaces}; do
echo "[cluster=${cluster}][namespace=${ns}][migration=${latest_migration}][plan=${latest_migration_plan}] Running oc adm inspect on namespace that is part of latest migration and plan"
mkdir -p must-gather/clusters/${cluster}/migrated-namespaces
/usr/bin/oc adm inspect --dest-dir must-gather/clusters/${cluster}/migrated-namespaces ns/${ns} &
done

# Collect all resources in MTC namespaces with must-gather
for ns in ${namespaces[@]}; do
echo "[cluster=${cluster}][namespace=${ns}] Running oc adm inspect"
/usr/bin/oc adm inspect --dest-dir must-gather/clusters/${cluster} --all-namespaces ns/${ns} &
done

# Collect PV and StorageClass info for cluster
echo "[cluster=${cluster}] Running oc adm inspect storageclasses,persistentvolumes"
usr/bin/oc adm inspect --dest-dir must-gather/clusters/${cluster} storageclasses,persistentvolumes &
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, any specific reason for not including PVCs here?

Copy link
Author

@djwhatle djwhatle Mar 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PVCs are namespace scoped resources, so they are gathered by the other invocations of oc adm inspect ns/<some-ns>


# Collect Route and Service info needed to troubleshoot direct image migration connection
for ns in default openshift-image-registry; do
echo "[cluster=${cluster}][namespace=${ns}] Running oc adm inspect route,service"
/usr/bin/oc adm inspect --dest-dir must-gather/clusters/${cluster} -n ${ns} route,service &
done

# Collect the migration and velero CRs
echo "[cluster=${cluster}] Gathering MTC and Velero CRs for namespaces [${namespaces[@]}]"
/usr/bin/gather_crs ${cluster} ${namespaces} &
Expand Down Expand Up @@ -84,6 +107,16 @@ else
find /must-gather/clusters/*/namespaces/*/pods/ -name '*.log' -delete
fi

# Wipe secrets from migrated-namespaces data
echo "Scrubbing secrets collected from migrated namespaces"
find must-gather/clusters/*/migrated-namespaces/namespaces/*/*/ -name *secrets.yaml* -delete

# Shorten logs from migrated-namespaces to last 100 lines
echo "Shortening logs collected from migrated namespaces to last 100 lines"
for logfile in $(find must-gather/clusters/*/migrated-namespaces/namespaces/*/*/ -name *.log); do
tail -100 "${logfile}" > tmp.log && mv tmp.log "${logfile}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand where these logs are coming from. Are these logs from the Rsync/Stunnel pods?

Copy link
Author

@djwhatle djwhatle Mar 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is any Pod logs gathered from the migrated-namespaces (e.g. the namespaces that were on migplan.spec.namespaces), so this is Pod logs for Rsync, Stunnel, Stage Pods, but also any other apps running in those namespaces.

The amount of logs that could be in these namespaces is potentially unbounded since this is where user apps run, so I thought it would make sense to only capture the last bits which would probably contain errors for Rsync or Stunnel or whatever. Do you think 100 lines is enough?

I also considered if we should only keep the logs for Rsync, Stunnel, Stage Pods, but since oc adm inspect grabs everything I'd need to know the naming conventions (like a regex) for all of those Pod names so I could delete everything except those. I like the idea of only grabbing our stuff better but it may be more brittle since we're still moving containers around (e.g. current work to combine Rsync and Stunnel)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@djwhatle Agreed. I am unsure about gathering user application logs though. Are we supposed to be doing that at all?

Copy link
Author

@djwhatle djwhatle Mar 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's not a great reason for us to. I think you make a good point, I should refine this further so that it only gathers logs for Pods belonging to us. oc adm inspect was the single-command solution but we can do better with some specialized logic.

This would also make me more comfortable keeping the full logs, which would be more useful.

done

# Tar all must-gather artifacts for faster transmission
echo "Tarring must-gather artifacts..."
archive_path="/must-gather-archive"
Expand Down