Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reactivate the integration tests for remote file access (GCS, S3) #210

Open
wants to merge 4 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions .github/workflows/dev-docker.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ env:
DEFAULT_IMAGE_INCREMENT: 0
DEFAULT_SERVER_REVISION: main
DEFAULT_PYTHON_VERSIONS: 3.8 3.9 3.10 3.11 3.12 3.13
DEFAULT_KHIOPS_GCS_DRIVER_REVISION: 0.0.10
DEFAULT_KHIOPS_S3_DRIVER_REVISION: 0.0.12
on:
pull_request:
paths: [packaging/docker/khiopspydev/Dockerfile.*, .github/workflows/dev-docker.yml]
Expand Down Expand Up @@ -34,6 +36,14 @@ on:
type: string
default: main
description: Khiops Server Revision
khiops-gcs-driver-revision:
type: string
default: 0.0.10
description: Driver version for Google Cloud Storage remote files
khiops-s3-driver-revision:
type: string
default: 0.0.12
description: Driver version for AWS-S3 remote files
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true
Expand All @@ -55,6 +65,8 @@ jobs:
echo "KHIOPSDEV_OS_CODENAME=$(echo '${{ matrix.khiopsdev-os }}' | tr -d '0-9.')" >> "$GITHUB_ENV"
echo "SERVER_REVISION=${{ inputs.server-revision || env.DEFAULT_SERVER_REVISION }}" >> "$GITHUB_ENV"
echo "IMAGE_URL=ghcr.io/khiopsml/khiops-python/khiopspydev-${{ matrix.khiopsdev-os }}" >> "$GITHUB_ENV"
echo "KHIOPS_GCS_DRIVER_REVISION=${{ inputs.khiops-gcs-driver-revision || env.DEFAULT_KHIOPS_GCS_DRIVER_REVISION }}" >> "$GITHUB_ENV"
echo "KHIOPS_S3_DRIVER_REVISION=${{ inputs.khiops-s3-driver-revision || env.DEFAULT_KHIOPS_S3_DRIVER_REVISION }}" >> "$GITHUB_ENV"
- name: Checkout khiops-python sources
uses: actions/checkout@v4
- name: Set up Docker Buildx
Expand All @@ -81,13 +93,18 @@ jobs:
- name: Build image and push it to GitHub Container Registry
uses: docker/build-push-action@v5
with:
# Special hostname used by the integration tests for remote file access
# added using inputs because /etc/hosts is read-only for alternate builders (buildx via moby buildkit)
add-hosts: s3-bucket.localhost:127.0.0.1
context: ./packaging/docker/khiopspydev/
file: ./packaging/docker/khiopspydev/Dockerfile.${{ env.KHIOPSDEV_OS_CODENAME }}
build-args: |
"KHIOPS_REVISION=${{ env.KHIOPS_REVISION }}"
"KHIOPSDEV_OS=${{ matrix.khiopsdev-os }}"
"SERVER_REVISION=${{ env.SERVER_REVISION }}"
"PYTHON_VERSIONS=${{ inputs.python-versions || env.DEFAULT_PYTHON_VERSIONS }}"
"KHIOPS_GCS_DRIVER_REVISION=${{ env.KHIOPS_GCS_DRIVER_REVISION }}"
"KHIOPS_S3_DRIVER_REVISION=${{ env.KHIOPS_S3_DRIVER_REVISION }}"
tags: ${{ env.DOCKER_IMAGE_TAGS }}
# Push only on manual request
push: ${{ inputs.push || false }}
Expand Down
47 changes: 42 additions & 5 deletions .github/workflows/unit-tests.yml → .github/workflows/tests.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
name: Unit Tests
name: Tests
folmos-at-orange marked this conversation as resolved.
Show resolved Hide resolved
env:
DEFAULT_SAMPLES_REVISION: 10.2.4
DEFAULT_KHIOPS_DESKTOP_REVISION: 10.2.4
Expand Down Expand Up @@ -43,7 +43,7 @@ jobs:
# because the `env` context is only accessible at the step level;
# hence, it is hard-coded
image: |-
ghcr.io/khiopsml/khiops-python/khiopspydev-ubuntu22.04:${{ inputs.image-tag || 'latest' }}
ghcr.io/khiopsml/khiops-python/khiopspydev-ubuntu22.04:${{ inputs.image-tag || '10.2.4.0.s3-gcs-remote-files' }}
credentials:
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
Expand Down Expand Up @@ -112,7 +112,29 @@ jobs:
- name: Prepare Unit Tests Environment
if: github.ref != 'dev' && github.ref != 'main' && ! inputs.run-long-tests
run: echo "UNITTEST_ONLY_SHORT_TESTS=true" >> "$GITHUB_ENV"
- name: Run Unit Tests
- name: Prepare Integration Tests on remote files
env:
AWS_ENDPOINT_URL: http://localhost:4569
shell: bash
run: |
# Prepare AWS-S3 credentials and configuration
mkdir -p ${HOME}/.aws/
cat << EOF > ${HOME}/.aws/credentials
[default]
aws_access_key_id=KEY
aws_secret_access_key=SECRET
EOF
cat << EOF > ${HOME}/.aws/configuration
[default]
endpoint_url=${AWS_ENDPOINT_URL}
region=eu-north-1
EOF
echo "Generated AWS credentials..."
cat ${HOME}/.aws/credentials
echo "Generated AWS configuration..."
cat ${HOME}/.aws/configuration
/scripts/run_fake_remote_file_servers.sh . # launch the servers in the background
- name: Run Unit & Integration Tests
env:
KHIOPS_SAMPLES_DIR: ${{ github.workspace }}/khiops-samples
KHIOPS_DOCKER_RUNNER_URL: https://localhost:11000
Expand All @@ -124,6 +146,21 @@ jobs:
rmaps_base_oversubscribe: true
# Oversubscribe for MPI > 4.x
OMPI_MCA_rmaps_base_oversubscribe: true
# for the tests with GCS
GCS_BUCKET_NAME: gcs-bucket
GOOGLE_APPLICATION_CREDENTIALS: XXX
# we take advantage of the built-in `STORAGE_EMULATOR_HOST` env variable
# that every GCS client can read and lets us use a local fake file server
STORAGE_EMULATOR_HOST: http://localhost:4443
GCS_DRIVER_LOGLEVEL: info # set to debug for diagnosis
S3_DRIVER_LOGLEVEL: info # set to debug for diagnosis
# for the tests with S3
S3_BUCKET_NAME: s3-bucket
# ${HOME} is not expanded, so we have to put its value
AWS_SHARED_CREDENTIALS_FILE: /github/home/.aws/credentials
AWS_CONFIG_FILE: /github/home/.aws/configuration
Comment on lines +159 to +161
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that since you are using these variables you are not obliged to store them at HOME.

Maybe give it a try with ${{ github.workspace }}/.aws/credentials and ${{ github.workspace }}/.aws/configuration. This would test also that we rely only on these variables and not "standard" locations.

Note that you have to use the GITHUB_WORKSPACE env var in the credentials setup above.

# common var for tests with GCS & S3
no_proxy: localhost
run: |
# This is needed so that the Git tag is parsed and the khiops-python
# version is retrieved
Expand All @@ -138,10 +175,10 @@ jobs:
$CONDA run --no-capture-output -n "$CONDA_ENV" coverage report -m
$CONDA run --no-capture-output -n "$CONDA_ENV" coverage xml -o "reports/$CONDA_ENV/py-coverage.xml"
done
- name: Display Unit Test Reports
- name: Display Test Reports
uses: dorny/test-reporter@v1
with:
name: Unit Tests ${{ matrix.python-version }}
name: Unit & Integration Tests ${{ matrix.python-version }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just call it Run Tests

path: >-
reports/py${{ matrix.python-version }}/TEST-tests.*.*.xml,
reports/py${{ matrix.python-version }}_conda/TEST-tests.*.*.xml
Expand Down
39 changes: 18 additions & 21 deletions khiops/core/internals/filesystems.py
Original file line number Diff line number Diff line change
Expand Up @@ -254,7 +254,7 @@ def copy_from_local(uri_or_path, local_path):
Raises
------
RuntimeError
If there was a problem when removing.
If there was a problem when copying.
"""
create_resource(uri_or_path).copy_from_local(local_path)

Expand All @@ -272,7 +272,7 @@ def copy_to_local(uri_or_path, local_path):
Raises
------
RuntimeError
If there was a problem when removing.
If there was a problem when copying.
"""
create_resource(uri_or_path).copy_to_local(local_path)

Expand Down Expand Up @@ -668,29 +668,26 @@ def remove(self):
)

def copy_from_local(self, local_path):
response = self.s3_client.Bucket(self.uri_info.netloc).upload_file(
local_path, self.uri_info.path[1:]
)
status_code = response["ResponseMetadata"]["HTTPStatusCode"]
copy_ok = 200 <= status_code <= 299
if not copy_ok:
raise RuntimeError(
f"S3 copy_from_local failed {self.uri} with code {status_code}: "
+ json.dumps(response)
# boto3.s3.transfer.upload_file is not expected to return an HTTP response
# but instead raises a S3UploadFailedError exception in case of error
try:
self.s3_client.Bucket(self.uri_info.netloc).upload_file(
local_path, self.uri_info.path[1:]
)
# normalize the raised exception
except Exception as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either use exc or error instead of e. And everywhere below where this pattern is present.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I'd catch the expected exception. So, when it is an unexpected one, we will know.

Copy link
Collaborator

@popescu-v popescu-v Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would restrict the "catch" here to S3UploadFailedError:

except S3UploadFailedError as exc:
    raise RuntimeError(...) from exc

Thus, wenever there is a different (unexpected) exception, it would not be handled and whence would be propagated to the caller code so that we know about it (e.g. from inspecting the CI logs).

raise RuntimeError(f"S3 copy_from_local failed {self.uri}") from e

def copy_to_local(self, local_path):
response = self.s3_client.Bucket(self.uri_info.netloc).download_file(
local_path, self.uri_info.path[1:]
)
status_code = response["ResponseMetadata"]["HTTPStatusCode"]
copy_ok = 200 <= status_code <= 299
if not copy_ok:
raise RuntimeError(
f"S3 download failed {self.uri} with code {status_code}: "
+ json.dumps(response)
# boto3.s3.transfer.download_file is not expected to return an HTTP response
# but instead raises a S3UploadFailedError exception in case of error
Comment on lines +682 to +683
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove this comment because the fact that a previous version relied on HTTP reponse isn't relevant anymore.

try:
self.s3_client.Bucket(self.uri_info.netloc).download_file(
self.uri_info.path[1:], local_path
)
return copy_ok
# normalize the raised exception
except Exception as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a more precise exception it should be catched.

raise RuntimeError(f"S3 download failed {self.uri}") from e

def list_dir(self):
# Add an extra slash to the path to treat it as a folder
Expand Down
28 changes: 17 additions & 11 deletions khiops/core/internals/runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -594,17 +594,23 @@ def _report_exit_status(

# Create the message reporting the errors and warnings
error_msg = ""
errors, fatal_errors, warning_messages = self._collect_errors(log_file_path)
if warning_messages:
error_msg += "Warnings in log:\n" + "".join(warning_messages)
if errors:
if error_msg:
error_msg += "\n"
error_msg += "Errors in log:\n" + "".join(errors)
if fatal_errors:
if error_msg:
error_msg += "\n"
error_msg += "Fatal errors in log:\n" + "".join(fatal_errors)
errors = fatal_errors = warning_messages = []
try:
errors, fatal_errors, warning_messages = self._collect_errors(log_file_path)
if warning_messages:
error_msg += "Warnings in log:\n" + "".join(warning_messages)
if errors:
if error_msg:
error_msg += "\n"
error_msg += "Errors in log:\n" + "".join(errors)
if fatal_errors:
if error_msg:
error_msg += "\n"
error_msg += "Fatal errors in log:\n" + "".join(fatal_errors)
# the log file can be missing
# (khiops had not the chance to start)
except:
pass
Comment on lines +597 to +613
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are handling the specific error of the log file missing then I'd rewrite this as

# If the log file exists: Collec the error and warnings messages
if fs.exists(log_file_path):
    errors, fatal_errors, warning_messages = self._collect_errors(log_file_path)
    # the rest ...
# Otherwise warn that the log file is missing
else:
    warnings.warn(
        f"Log file not found after {tool_name} execution."
        f"Path: {log_file_path}")
    )
    errors = fatal_errors = []


# Add stdout to the warning message if non empty
if stdout:
Expand Down
5 changes: 4 additions & 1 deletion packaging/conda/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,11 @@ requirements:
- pandas >=0.25.3
- scikit-learn >=0.22.2
run_constrained:
- boto3 >=1.17.39
# do not necessary use the latest version
# to avoid undesired breaking changes
- boto3 >=1.17.39,<=1.35.69
- google-cloud-storage >=1.37.0
- pyopenssl>=24.0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need pyopenssl ?

Copy link
Collaborator

@popescu-v popescu-v Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to list pyopenssl to impose a specific lower bound of the version?


outputs:
- name: {{ metadata.get('name') }}
Expand Down
42 changes: 38 additions & 4 deletions packaging/docker/khiopspydev/Dockerfile.ubuntu
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,14 @@ ARG KHIOPS_REVISION
RUN true \
# Install git (for khiops-python version calculation) and pip \
&& apt-get -y update \
&& apt-get -y --no-install-recommends install git python3-pip zip pandoc wget \
&& apt-get -y --no-install-recommends install git python3-pip zip pandoc wget ruby-dev \
# Get Linux distribution codename \
&& if [ -f /etc/os-release ]; then . /etc/os-release; fi \
# Obtain the Khiops native package \
&& KHIOPS_PKG_FILE=$KHIOPS_REVISION/khiops-core-openmpi_$KHIOPS_REVISION-1-$VERSION_CODENAME.amd64.deb \
&& wget -O KHIOPS_CORE.deb "https://github.com/KhiopsML/khiops/releases/download/${KHIOPS_PKG_FILE}" \
# Install the Khiops native package \
&& dpkg -i --force-all KHIOPS_CORE.deb \
# Install the Khiops native package : do not break immediately if dpkg fails because apt will install the missing dependencies \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd reword do not break ... -> Make it always succeed. If dpkg fails it is due to missing dependencies which will be installed by apt in the next line.

Make two comment lines if necessary.

&& (dpkg -i --force-all KHIOPS_CORE.deb || true) \
&& apt-get -f -y install \
&& rm -f KHIOPS_CORE.deb \
# Set python to python3 \
Expand Down Expand Up @@ -54,9 +54,43 @@ RUN true \

RUN mkdir -p /scripts
COPY ./run_service.sh /scripts/run_service.sh
RUN chmod +x /scripts/run_service.sh && \
COPY ./run_fake_remote_file_servers.sh /scripts/run_fake_remote_file_servers.sh
RUN chmod +x /scripts/run_service.sh /scripts/run_fake_remote_file_servers.sh && \
useradd -rm -d /home/ubuntu -s /bin/bash -g root -u 1000 ubuntu

# new gcs & s3 drivers (written in C++)
ARG KHIOPS_GCS_DRIVER_REVISION
ARG KHIOPS_S3_DRIVER_REVISION
RUN true \
&& wget -O khiops-gcs.deb https://github.com/KhiopsML/khiopsdriver-gcs/releases/download/${KHIOPS_GCS_DRIVER_REVISION}/khiops-gcs_${KHIOPS_GCS_DRIVER_REVISION}.deb \
&& wget -O khiops-s3.deb https://github.com/KhiopsML/khiopsdriver-s3/releases/download/${KHIOPS_S3_DRIVER_REVISION}/khiops-s3_${KHIOPS_S3_DRIVER_REVISION}.deb \
&& (dpkg -i --force-all khiops-gcs.deb khiops-s3.deb || true) \
&& apt-get -f -y install \
&& rm -f khiops-gcs.deb khiops-s3.deb \
&& true

FROM ghcr.io/khiopsml/khiops-server:${SERVER_REVISION} AS server
FROM fsouza/fake-gcs-server:1.50 AS gcs-server

FROM khiopsdev AS base
COPY --from=server /service /usr/bin/service

# GCS fake file server (only in the ubuntu container)
COPY --from=gcs-server /bin/fake-gcs-server /bin/fake-gcs-server

# Port on which gcs-server is listening
EXPOSE 4443

# S3 fake file server (only in the ubuntu container)
# Do not use the latest fakes3 version because starting from 1.3 a licence is required
# if fakes3 is no longer compatible think about switching to an alternative and fully compatible server
# (https://github.com/jamhall/s3rver:v3.7.1 is not yet for example)
RUN gem install fakes3:1.2.1 sorted_set
# Avoid resolving a fake s3-bucket.localhost hostname
# Alternate builders (buildx via moby buildkit) mount /etc/hosts read-only, the following command will fail
# echo "127.0.0.1 s3-bucket.localhost" >> /etc/hosts
# You will have to add the `add-hosts` input instead (https://github.com/docker/build-push-action/#inputs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we do this directly here? This Docker file is only useful in the CI context anyway.


# Port on which fakes3 is listening
EXPOSE 4569

25 changes: 25 additions & 0 deletions packaging/docker/khiopspydev/run_fake_remote_file_servers.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
#!/bin/bash

ROOT_FOLDER=${1:-.} # defaults to current folder

# File server for GCS (runs in background)
# WARNING : there are 3 major features activated by the options ...
# -data : exposes pre-provisioned files (not currently used feature) : the direct child folder will be the bucket name
# -filesystem-root : let upload and read new files remotely at the same location as the source
# -public-host : must expose localhost (https://github.com/fsouza/fake-gcs-server/issues/201)
echo "Launching fake-gcs-server in background..."
nohup /bin/fake-gcs-server \
-data "${ROOT_FOLDER}"/tests/resources/remote-access \
-filesystem-root "${ROOT_FOLDER}"/tests/resources/remote-access \
-scheme http \
-public-host localhost > /dev/null < /dev/null 2>&1 & # needs to redirect all the 3 fds to free the TTY

# File server for S3 (runs in background)
# WARNING :
# -r : exposes pre-provisioned files (not currently used feature) : the direct child folder will be the bucket name
# these files were uploaded once because fake-s3 creates metadata
echo "Launching fakes3 in background..."
PORT_NUMBER=${AWS_ENDPOINT_URL##*:}
nohup /usr/local/bin/fakes3 \
-r "${ROOT_FOLDER}"/tests/resources/remote-access \
-p "${PORT_NUMBER}" > /dev/null < /dev/null 2>&1 & # needs to redirect all the 3 fds to free the TTY
1 change: 1 addition & 0 deletions run_fake_remote_file_servers.sh
7 changes: 6 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,12 @@
],
cmdclass=versioneer.get_cmdclass(),
extras_require={
"s3": ["boto3>=1.17.39"],
"s3": [
# do not necessary use the latest version
# to avoid undesired breaking changes
"boto3>=1.17.39,<=1.35.69"
],
"gcs": ["google-cloud-storage>=1.37.0"],
"pyopenssl": ["pyopenssl>=24.0.0"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need pyopenssl?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it in order to impose minimal version of the pyopenssl package (dependency of the remote API packages)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the dependency should impose its restriction on this package. Unless there are serious problems we should remove this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the dependency should impose its restriction on this package. Unless there are serious problems we should remove this.

@tramora : Is there any specific version pinning issue with this package?

},
)
Empty file.
Loading
Loading