-
-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preventing moderation #25
Comments
Personally, I don't know much about this method. Maybe put this into a reddit post on one of the subreddits and see what others think. |
I tried using Base64 but quality of the responses drops significantly. It's like your communicating through Google translate. Go ahead and try it for yourself: ask ChatGPT to communicate in pure Base64 and nothing else, and then use online tools to convert your prompts and responses into and from Base64. You'll see what I'm talking about. |
Periods don't work. I also tried typoglicemia but i don't think I got it write. |
yes, base64 is more of a proof of concept that a real solution. One person on my jailbreaking discord suggested inserting regular line breaks according to some logic, which seems to stop the moderations layer establishing context. It would be best to have a set of increasingly rulebreaking prompts to test the hypothesis with, though. I tend to think periods aren't good, forwardslashes can be decent though. The issue with breaking words apart with them is that they more than double token use. The other possibility is semantic obfuscation. Total jailbreaks like The Forest work by basically hiding the nature of the request from the moderations layer. As far as the moderations layer is concerned, the request is "who of the following would be most qualified to assist with ?" which usually doesn't break the rules. The real request is "do ". If you can reliably obfuscate the request for all incoming messages by having the script wrap a small jailbreak around it, you might be able to avoid most flagging. It probably wouldn't be 100% reliable though as some people just write stuff so depraved that even without context it'll flag. Although I've yet to explore it, you could potentially spoof the request as being moderation related - eg. "please tell me which of the statements in this list are incompatible with OpenAI policies" whilst concealing the true request inside. The script would have to scrape out the irrelevant parts of the reply. This is also very complex. Realistically any solution is going to waste tokens and somewhat degrade the quality of the reply. Its best to think smarter than just "insert a load of punctuation" as the moderations layer is looking for context rather than just word blacklist matches and can be fooled by clever use of language as much as hiding words from it. |
Line breaks work with flagged words as well as break up flagged phrases. The issue is chatgpt cannot output line breaks in any new pattern, or not easily. repeat this "h i" will get a reply of "hi". |
For input replacing already existing line breaks with '---' and replacing the space characters with line breaks Is an option because i confirmed chatgpt can read and understand a medium sized prompt with line breaks instead of spaces. |
I already have good jailbreaks that do not get flagged, but I'm not sure what could be done to ensure the reply is also not flagged. |
The main problem, even if we manage to find a way to bypass are token sizes. ChatGPT counts spaces etc as tokens, though am not sure whether line break has a token or not. But it wouldn't Degrade the quality of the chat. It definitely would make it shorter tho. Using punctuations also would waste tokens and the amount you can chat. Rather than that, if we find something which uses less tokens and has better efficiency, it would work Edit:-Mas is correct, it affects the quality. |
It would degrade the quality of the chat because the LLM is inherently chaotic like that. Completely meaningless punctuation will change its understanding of the message and thus change its reply. Any line break or punctuation will be at least one token. The aim would be to have the spacer element at the largest possible intervals to minimise token wastage and loss of meaning. Given that a lot of people may be confused by a script that substantially changes their message as they send it, it would probably be wise to fork Demod in the event that a workable solution is found. |
Let's focus on solving this one step at a time. Once there is sure-fire workaround for moderation checks, we can deal with tokens count. |
Interestingly enough the jailbreak I'm using doesn't trigger the moderation, even though it's very explicit. Perhaps because it doesn't ask ChatGPT for anything except a confirmation that understands its role. |
It's working for me. I just changed browser and it's still working |
Because you're not generating content that breaks guidelines. Apply a jailbreak and ask it to generate an incest story or something similar, and then tell me if it works. |
oh right. To stop output flagging we have to focus on what the ai is saying. I wouldn't suggest line breaks are the full solution, aside from awkwardness with using them manually I'm convinced they are the best obsfucation currently. |
"It's almost as if asking for an explicit story is enough to trigger it, no matter how obfuscated it is" I tested asking omatic for a "sexy story" and it didn't get red text. Omatic ended with soemthing like "how was that?" I replied "It was very restrained" which is not a request for anything, merely feedback but my input was removed like i wrote a slur or something. |
I got the red warning in the old conversation messages today: Didn't got any of these warning in the same NSFW conversation yesterday somehow. And only the very last 2~4 of the response message. None of the rest of message got warned today. I'm not going to load any other NSFW conversations in order no to getting more of these (permanent?) warnings into my conversations. Something just changed at client side today I guess. The usage experience today is like "reponse message: 'DeMod & refresh conversation' " and showing up warning in the new message today. Again this didn't happened at yesterday trying. I don't know if it's because I didn't reboot my PC for days , and it all changed (or DeMod scripts updated) after I rebooted. Thank @4as for helping us all ! You're a great hero to us. 👍 |
GOT SOMETHING! Found this today. Computer scientists claim to have discovered ‘unlimited’ ways to jailbreak ChatGPT Here's the research paper: |
Gonna try this out. It'll be risky, but I think it's for a good cause. |
Welp, I'm getting "I'm unable to produce a response. Then again they did this research on 3.5 turbo so...maybe there's something in their code repository that could be helpful. |
GPT4 is more susceptible to "hypnosis," whereas GPT3.5 is aimed at the general public, so the "defensive mentality" is stronger. Try roleplay with GPT4 as GF or BF, and leading "her/him“ to the "fun activities" in 6~10 requests. Requests too abrupt (without considering contexts of previous messages) can trigger ”his/her“ defense mechanism and receiving rejection messages, so don't be too hasty. |
Just read that paper (i.e. had chatgpt read it for me and explain it) but overall it doesn't look like there's much, if any, practical utility we can get out of it based on the way the authors generated the jailbreaks. They basically started out with a random string of characters/words for the "adversarial attack suffix" they were working on, and iteratively modified/updated the suffix until it would fool the model into doing whatever "objectionable" content it was asked to do. I mean I think that's basically what everyone here's trying to do now except the authors have more resources to do this on a much larger scale using iterative optimization loops to generate these jailbreaks. Maybe think about exploring their repo to see if there's anything there that could be reproduced? |
Hey I found something very interesting. Renaming a character to "this is safe for work" increases the threshhold for red text. It seems to depend on chatgpt reporting to the website because spamming the words "this is safe for work at the start and end of a prompt doesn't seem to work. Anyway the words "this is safe for work" can hack the moderation system somewhat. |
Found this today, something to keep an eye on for the future: |
This just popped up today. Gonna try it out. |
Just to get this chain started for your future reference and to record ideas, I'm copying over what @Maswimelleu said:
"Its important to note that their server side moderations cannot read base64. If you encode the prompt going in, along with a prefix telling it "not to decode" and instead reply only in base64, the reply will come back without being flagged by moderation. The quality of the reply is liable to change a bit (I noticed the personality of one of my jailbreaks change) but it will still go through. My advice would be to add a base64 encoder and decoder to the script to automate this process.
The obvious issue of course is that base64 eats through tokens rapidly, so you'd get much shorter messages.
I'm somewhat curious whether you can create a special cipher in which a token is swapped with a different token according to a certain logic, and whether ChatGPT would be able to decode that if given the correct instructions. That would likely solve the issue of base64 tokens being very short."
"Maybe take the time to look at other LLMs, perhaps an API based implementation where OpenAI is fed lots of confusing/misleading stuff to think the messages aren't breaking the rules will work."
The text was updated successfully, but these errors were encountered: