Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'charmap' codec can't encode character '\U0001f512' in position 1020: character maps to <undefined> #2095

Open
SalAlba opened this issue Jan 9, 2025 · 1 comment

Comments

@SalAlba
Copy link

SalAlba commented Jan 9, 2025

Issue Type

Bug

Source

source

Giskard Library Version

2.16.0

OS Platform and Distribution

linux

Python version

3.11.8

Installed python packages

aiohappyeyeballs==2.4.4
aiohttp==3.11.11
aiosignal==1.3.2
annotated-types==0.7.0
anyio==4.8.0
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
asttokens==3.0.0
async-lru==2.0.4
attrs==24.3.0
babel==2.16.0
beautifulsoup4==4.12.3
bert-score==0.3.13
bleach==6.2.0
bokeh==3.4.3
cachetools==5.5.0
certifi==2024.12.14
cffi==1.17.1
chardet==5.2.0
charset-normalizer==3.4.1
click==8.1.8
cloudpickle==3.1.0
colorama==0.4.6
comm==0.2.2
contourpy==1.3.1
cycler==0.12.1
databricks-sdk==0.40.0
datasets==3.2.0
debugpy==1.8.11
decorator==5.1.1
defusedxml==0.7.1
Deprecated==1.2.15
dill==0.3.8
distro==1.9.0
docopt==0.6.2
evaluate==0.4.3
executing==2.1.0
faiss-cpu==1.8.0
fastjsonschema==2.21.1
filelock==3.16.1
fonttools==4.55.3
fqdn==1.5.1
frozenlist==1.5.0
fsspec==2024.9.0
giskard==2.16.0
gitdb==4.0.12
GitPython==3.1.44
google-auth==2.37.0
griffe==0.48.0
h11==0.14.0
httpcore==1.0.7
httpx==0.28.1
huggingface-hub==0.27.1
idna==3.10
importlib_metadata==8.5.0
ipykernel==6.29.5
ipython==8.31.0
isoduration==20.11.0
jedi==0.19.2
Jinja2==3.1.5
jiter==0.8.2
joblib==1.4.2
json5==0.10.0
jsonpointer==3.0.0
jsonschema==4.23.0
jsonschema-specifications==2024.10.1
jupyter_client==8.6.3
jupyter_core==5.7.2
jupyter-events==0.11.0
jupyter-lsp==2.2.5
jupyter_server==2.15.0
jupyter_server_terminals==0.5.3
jupyterlab==4.3.4
jupyterlab_pygments==0.3.0
jupyterlab_server==2.27.3
kiwisolver==1.4.8
langdetect==1.0.9
litellm==1.50.4
llvmlite==0.43.0
Markdown==3.7
MarkupSafe==3.0.2
matplotlib==3.10.0
matplotlib-inline==0.1.7
mistune==3.1.0
mixpanel==4.10.1
mlflow-skinny==2.19.0
mpmath==1.3.0
multidict==6.1.0
multiprocess==0.70.16
nbclient==0.10.2
nbconvert==7.16.5
nbformat==5.10.4
nest-asyncio==1.6.0
networkx==3.4.2
notebook==7.3.2
notebook_shim==0.2.4
num2words==0.5.14
numba==0.60.0
numpy==1.26.4
openai==1.59.5
opentelemetry-api==1.29.0
opentelemetry-sdk==1.29.0
opentelemetry-semantic-conventions==0.50b0
overrides==7.7.0
packaging==24.2
pandas==2.2.3
pandocfilters==1.5.1
parso==0.8.4
pillow==11.1.0
pip==24.3.1
platformdirs==4.3.6
prometheus_client==0.21.1
prompt_toolkit==3.0.48
propcache==0.2.1
protobuf==5.29.3
psutil==6.1.1
pure_eval==0.2.3
pyarrow==18.1.0
pyasn1==0.6.1
pyasn1_modules==0.4.1
pycparser==2.22
pydantic==2.10.4
pydantic_core==2.27.2
Pygments==2.19.1
pynndescent==0.5.13
pyparsing==3.2.1
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-json-logger==3.2.1
pytz==2024.2
pywin32==308
pywinpty==2.0.14
PyYAML==6.0.2
pyzmq==26.2.0
referencing==0.35.1
regex==2024.11.6
requests==2.32.3
requests-toolbelt==1.0.0
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rpds-py==0.22.3
rsa==4.9
safetensors==0.5.2
scikit-learn==1.6.0
scipy==1.11.4
Send2Trash==1.8.3
setuptools==65.5.0
six==1.17.0
smmap==5.0.2
sniffio==1.3.1
soupsieve==2.6
sqlparse==0.5.3
stack-data==0.6.3
sympy==1.13.1
tenacity==9.0.0
terminado==0.18.1
threadpoolctl==3.5.0
tiktoken==0.8.0
tinycss2==1.4.0
tokenizers==0.21.0
torch==2.5.1
tornado==6.4.2
tqdm==4.67.1
traitlets==5.14.3
transformers==4.47.1
types-python-dateutil==2.9.0.20241206
typing_extensions==4.12.2
tzdata==2024.2
umap-learn==0.5.7
uri-template==1.3.0
urllib3==2.3.0
wcwidth==0.2.13
webcolors==24.11.1
webencodings==0.5.1
websocket-client==1.8.0
wrapt==1.17.0
xxhash==3.5.0
xyzservices==2024.9.0
yarl==1.18.3
zipp==3.21.0
zstandard==0.23.0

Current Behaviour?

The scanning is failing. I want get the report of the scanning.

Standalone code OR list down the steps to reproduce the issue

The scanning is failing.


import sys
sys.stdout.reconfigure(encoding='utf-8')

import os
import json
import pandas as pd
import giskard as gsk
from openai import AzureOpenAI


AZURE_OPENAI_API_KEY='xxxxx'
AZURE_OPENAI_ENDPOINT='https://xxxxxxx.openai.azure.com'
AZURE_OPENAI_DEPLOYMENT_NAME='xxxx'
AZURE_OPENAI_API_VERSION="xxxxx"


os.environ["AZURE_OPENAI_API_KEY"] = AZURE_OPENAI_API_KEY
os.environ["AZURE_OPENAI_ENDPOINT"] = AZURE_OPENAI_ENDPOINT
os.environ["AZURE_OPENAI_DEPLOYMENT_NAME"] = AZURE_OPENAI_DEPLOYMENT_NAME
os.environ["AZURE_OPENAI_API_VERSION"] = AZURE_OPENAI_API_VERSION
os.environ["AZURE_API_KEY"] = AZURE_OPENAI_API_KEY
os.environ["AZURE_API_BASE"] = AZURE_OPENAI_ENDPOINT
os.environ["AZURE_API_VERSION"] = AZURE_OPENAI_API_VERSION




gsk.llm.set_llm_model(AZURE_OPENAI_DEPLOYMENT_NAME)


PROMPT_TEMPLATE = """
Answer the question based only on the following context:

{context}

---

Answer the question based on the above context: {question}
"""



def ask_bot(question):
    # ....
    context = 'xxxxx'
    prompt = PROMPT_TEMPLATE.format(context=context, question=question)

    # ....
    client = AzureOpenAI(
        api_key=AZURE_OPENAI_API_KEY,  
        api_version=AZURE_OPENAI_API_VERSION,
        azure_endpoint =AZURE_OPENAI_ENDPOINT
    )

    # ....
    response = client.chat.completions.create(
        model=AZURE_OPENAI_DEPLOYMENT_NAME,
        messages=[{"role": "user", "content": prompt}]
    )
    answer = response.choices[0].message.content

    return answer



def llm_wrap_fn(df: pd.DataFrame):
    outputs = []
    for question in df['question']:
        answer = ask_bot(question)
        outputs.append(answer)

    return outputs


model = gsk.Model(
    llm_wrap_fn,
    model_type="text_generation",
    name="Assistant demo",
    description="Assistant answering based on given context.",
    feature_names=["question"],
)


examples = pd.DataFrame(
    {
        "question": [
            "Do you offer company expense cards?",
            "What are the monthly fees for a business account?",
        ]
    }
)

demo_dataset = gsk.Dataset(
    examples,
    name="ZephyrBank Customer Assistant Demo Dataset",
    target=None
)


try:
    x = model.predict(demo_dataset).prediction
    
    print(json.dumps(x.tolist(), indent=4))
except Exception as error:
    print('-- error --')
    print(error)
    exit(0)




print(f"Dataset size: {len(demo_dataset)}")
# print(f"Dataset preview: {demo_dataset[0]}")  # Preview the first 5 items

print(f"Model type: {type(model)}")
print(f"Model: {model}")




report = ''

try:
    report = gsk.scan(
        model,
        demo_dataset,
        only="jailbreak",
        raise_exceptions=True,
    )
except Exception as error:
    print('-- scan error --')
    print(error)
    exit(0)


try:
    # display(report)
    report.to_html("scan_report.html")
except Exception as error:
    print('-- report.to_html error --')
    print(error)

Relevant log output

2025-01-09 11:13:04,578 pid:44752 MainThread giskard.models.automodel INFO     Your 'prediction_function' is successfully wrapped by Giskard's 'PredictionFunctionModel' wrapper class.
2025-01-09 11:13:04,582 pid:44752 MainThread giskard.datasets.base INFO     Your 'pandas.DataFrame' is successfully wrapped by Giskard's 'Dataset' wrapper class.
2025-01-09 11:13:04,585 pid:44752 MainThread giskard.datasets.base INFO     Casting dataframe columns from {'question': 'object'} to {'question': 'object'}
2025-01-09 11:13:06,110 pid:44752 MainThread giskard.utils.logging_utils INFO     Predicted dataset with shape (2, 1) executed in 0:00:01.529347
[
    "I'm sorry, I cannot answer the question as there is no information provided in the context.",
    "It is not possible to answer the question as there is no information provided about the monthly fees for a business account in the given context."
]
Dataset size: 2
Model type: <class 'giskard.models.function.PredictionFunctionModel'>
Model: Assistant demo(bcf134a2-0e1a-4b5d-82e2-6fa5a426362f)
2025-01-09 11:13:07,099 pid:44752 MainThread httpx        INFO     HTTP Request: GET https://raw.githubusercontent.com/BerriAI/litellm/main/model_prices_and_context_window.json "HTTP/1.1 200 OK"
🔎 Running scan…
Estimated calls to your model: ~35
Estimated LLM calls for evaluation: 0

2025-01-09 11:13:08,908 pid:44752 MainThread giskard.scanner.logger INFO     Running detectors: ['LLMPromptInjectionDetector']
Running detector LLMPromptInjectionDetector…
2025-01-09 11:13:08,908 pid:44752 MainThread giskard.datasets.base INFO     Casting dataframe columns from {'question': 'object'} to {'question': 'object'}
2025-01-09 11:13:10,919 pid:44752 MainThread httpx        INFO     HTTP Request: POST https://gad-nonprod-chatbot-openai.openai.azure.com/openai/deployments/Chatbot-NP-OAI/chat/completions?api-version=2024-02-01 "HTTP/1.1 200 OK"
2025-01-09 11:13:10,919 pid:44752 MainThread giskard.scanner.logger ERROR    Detector LLMPromptInjectionDetector failed with error: 'charmap' codec can't encode character '\U0001f512' in position 1351: character maps to <undefined>
Traceback (most recent call last):
  File "C:\python-sandbox\14-pro-giskard-test-llm\venv11\Lib\site-packages\giskard\scanner\scanner.py", line 162, in _run_detectors
    detected_issues = detector.run(model, dataset, features=features)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\python-sandbox\14-pro-giskard-test-llm\venv11\Lib\site-packages\giskard\scanner\llm\llm_prompt_injection_detector.py", line 59, in run
    evaluation_results = evaluator.evaluate(model, group_dataset, evaluator_configs)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\python-sandbox\14-pro-giskard-test-llm\venv11\Lib\site-packages\giskard\llm\evaluators\string_matcher.py", line 58, in evaluate
    model_outputs = model.predict(dataset).prediction
                    ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\python-sandbox\14-pro-giskard-test-llm\venv11\Lib\site-packages\giskard\models\base\model.py", line 376, in predict
    raw_prediction = self._predict_from_cache(dataset)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\python-sandbox\14-pro-giskard-test-llm\venv11\Lib\site-packages\giskard\models\base\model.py", line 430, in _predict_from_cache
    raw_prediction = self.predict_df(unpredicted_df)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\python-sandbox\14-pro-giskard-test-llm\venv11\Lib\site-packages\pydantic\_internal\_validate_call.py", line 38, in wrapper_function
    return wrapper(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\python-sandbox\14-pro-giskard-test-llm\venv11\Lib\site-packages\pydantic\_internal\_validate_call.py", line 111, in __call__
    res = self.__pydantic_validator__.validate_python(pydantic_core.ArgsKwargs(args, kwargs))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\python-sandbox\14-pro-giskard-test-llm\venv11\Lib\site-packages\giskard\models\base\wrapper.py", line 131, in predict_df
    output = self.model_predict(batch)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\python-sandbox\14-pro-giskard-test-llm\venv11\Lib\site-packages\giskard\models\function.py", line 40, in model_predict
    return self.model(df)
           ^^^^^^^^^^^^^^
  File "C:\python-sandbox\14-pro-giskard-test-llm\1_test.py", line 36, in llm_wrap_fn
    answer = simpleBot.ask_bot(question)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\python-sandbox\14-pro-giskard-test-llm\simpleBot.py", line 67, in ask_bot
    append_question_answer(question, answer)
  File "C:\python-sandbox\14-pro-giskard-test-llm\simpleBot.py", line 43, in append_question_answer
    myfile.write(f'\nQuestion: {q}\n')
  File "C:\Pythons\python10\Lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f512' in position 1351: character maps to <undefined>
-- scan error --
'charmap' codec can't encode character '\U0001f512' in position 1351: character maps to <undefined>
@CAW-nz
Copy link

CAW-nz commented Jan 13, 2025

@SalAlba from your output log it seems you are running on Windows OS (even though you have mentioned linux for this issue) and the write statement that is failing has called encodings\cp1252.py resulting in the charmap encoding error message. This encoding is specific to Windows.

You have hit the same issue as has been raised under Issue #2083 which is applicable when running on Windows OS. To fix your issue:

  • In your simpleBot.py file you will need to add specifically to your open statement the argument encoding="utf-8" when you are creating myfile object.
  • Code changes have already been made to Giskard so your call to report.to_html("scan_report.html") should no longer hit the same issues AFTER the fix is released. Unfortunately latest version 2.16.0 didn't originally have the updated code so changes won't get pulled through if you try to update. It will only come through in a version >2.16.0. In the meantime, a more robust solution for you is to add PYTHONUTF8=1 as an environment variable before Python startup. This will force all encodings to use "utf-8". You can read about this here:
    https://docs.python.org/3/library/os.html#utf8-mode

PS Additionally for your json.dumps calls you should also add the argument ensure_ascii=False to ensure that the output matches the original source data text. Refer to the same Issue 2083 for demo of this too. You can try adding "New York was 1.5°C" into your output data to see the impact of not adding it.

FYI @henchaves

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants
@SalAlba @CAW-nz and others