Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New UI for Observe #22

Open
GALLLASMILAN opened this issue Jan 2, 2025 · 2 comments
Open

New UI for Observe #22

GALLLASMILAN opened this issue Jan 2, 2025 · 2 comments
Assignees

Comments

@GALLLASMILAN
Copy link
Contributor

GALLLASMILAN commented Jan 2, 2025

Observe UI

Current state

We use the mlflow as a UI tool for traces right now. This approach has several limitations I will describe.
The motivation is to have software that will eliminate these limitations.

image

mlflow limitations

Some things limit us right now.

  • We need to follow the external routes format in mlflow API = when We upload the traces to the mlflow we have to use the mlflow API to upload data there. But the API does not accept the exact OpenTelemetry format and we use the custom logic to parse it. The API is still experimental and adds to another complexity to keep Observe up to date with mlflow.
  • The mlflow is not scalable = You can deploy more pods, but it has no effect on performance and when we use the COS and database to save data, we have timeout problems.
  • The mlflow ui is not user friendly = You cannot easily find the specific trace by id, or the UI misses the smart sorting and filtering. The next thing is the tabs are confusing and users cannot easily find the traces page.

UI solutions

The evaluation system is a good business opportunity

bee-ui included ⛔

This solution creates the dependency between the framework and the bee-ui and it's inefficiently complicated.

New UI app ✅

This is a better solution from my point of view. We can create a very simple app that will be part of the bee-stack and bee-agent-framework-starter.
This application will have only one dependency on the bee-observer (API server).

!! The scope is important !!

Features

Trace list

The paginated table with traces and base information about each one.

  • The table will be sortable
  • the search input will filter the traces by id.
  • There will be an option showing only the errored traces

Trace detail

The page detail data with the dependency tree and some picked data, that are important for us for quick debugging.
The picked data for the trace execution:

  • token count
  • execution time

The picked data for each iteration:

  • raw prompt
  • token count
  • execution time

Features v2

Features V3 (Evaluation)

The simple version of evaluation without the datasets and runs

When the Observe accepts the trace and saves it to the database, it calls the BullMQ evaluation job.
The Python service will accept the traceId from BullMQ and will call inference to get the evaluation metrics. Then, The service saves the evaluation metrics to the trace entity

Implementation:

  • [UI] The evaluation metrics pages => only the simple judge type without the expected answer. (Only the static list, could be hardcoded in the code for the first evaluation version.
  • [Observe] = Add the patch /v1/traces/${traceId} route that will accept the list of evaluation metrics. TODO: specify the format
  • [Observe] = Create a queue for an evaluation job. This job will be called automatically when the trace is created.
  • [Python-eval] = Update the evaluation service to work with BullMQ. @jezekra1 TODO: what inference will we use (try to avoid the bee-api dependency)
  • [Infrastructure] = Add the Python-eval to the bee-agent-framework-starter and bee-stack

Evalutation (part 2)

  • Datasets in Observe
  • Run entity in Observe
  • Compare functionality

Inspiration and opportunities

Depending on this analysis I make the summary of what would be the right way for us.

  • well-arranged UI for traces with picked important information. = All tools I analyzed have partly messy UI for tracing.
  • good evaluation UI = Only langfuse has a good UI integration. There is a big opportunity for us
  • Good open-source solution = When you wanna try tools like agentOps and langtrace, you are navigated to the hosted app. It's great for the first use, but when you develop locally, the easily runnable docker is a git advantage. The goal for us could be to simplify the docker image so that it can run without other dependencies in one command.

Let's simplify the Observe to allow the possibility of running it without Redis and MongoDB.

  • 👎 more universal solution - we should stay connected with our framework solution and not make some universal tool for observability. Because when we stay to support only the defined data frame, we can work with the data more efficiently and visualize them well-arranged for the user.
@GALLLASMILAN
Copy link
Contributor Author

GALLLASMILAN commented Jan 13, 2025

Crew AI Observability

AgentOps.ai

crewAI - agentops-observability => AgentOps.ai => (repo), default UI data, Node sdk. Set via AGENTOPS_API_KEY env and the dependency pip install 'crewai[agentops]' must be installed. Then

import agentops
agentops.init()

cons:

  • DOES NOT HAVE EVALUATION FN
  • the UI is not intuitive
  • The trace detail page (session record) tree detail is a joke.

prons:

  • The trace detail page header with picked information is cool.

Langtrace

Agent Monitoring with Langtrace => Langtrace, (repo), Evalution docs page = but the UI is very naive.

from langtrace_python_sdk import langtrace
langtrace.init(api_key='<LANGTRACE_API_KEY>')

cons:

  • the evaluation and observability is not implemented in the native crewAI stack but in the external Langtrace tool.
  • 👎 The user cannot run the evaluation in the UI directly. He is redirect to the evaluation page with the instructions how to use the command line to create a evalution.
  • The docs page does not have valid instructions.
  • The trace page is not clean. The default page contains only irrelevant information and the table is very messy. The user needs to click several times to show something.
  • Bugs in UI (e.g. datasets)

props:

OpenLIT

  • Agent Monitoring with OpenLIT => OpenLIT, evaluation docs = it looks like a good tool for the tracing in general, but not for the LLM solution like ours. The trace (request) detail is very simple. The product core is not very useful. It has some functions like prompt management and secret management, but it's not the additional value for the base telemetry use cases. It looks like an open-source tool made from the custom tool 😄

They have an auto-evaluation task. But it is not implemented yet.

portkey

Does not have evaluation, but provides only some predefined guardrails.

Some base guardrails are implemented in the UI (see docs), but for custom once, they only provide the webhook solution.

This is the only library, that is set in the LLM provider class.

------------------------ bonus -------------------------------------

Langfuse

template => config (join traces and templates)

cons:

  • The default trace table contains a lot of empty columns. They should simplify the trace list and provide users only the important ones by default.
  • I cannot filter traces by a user prompt.

props:

  • very nice trace detail
  • session for trace grouping

@jezekra1
Copy link
Contributor

I created evaluation-observe integration proposal as a separate issue:

https://github.com/i-am-bee/internal/issues/90

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants