I asked ChatGPT about primes ending in 2 to make it prove a point and it proved the point far better than I could have hoped for.

Please do not be a fool who trusts ChatGPT with anything outside your field of expertise, and even then double or triple check what it tells you if you must use it.

@alinanorakari Anything about digits or letters is super hard for ChatGPT. It sees our messages and all the data it was trained on translated into a different (huge) alphabet. Its alphabet would write 74815 as just two tokens, one for 748 and one for 15. It's useful when "sleep" is one token and "ing" is another. But it sucks for numbers. Models trained with per-digit tokenization do better on arithmetic. (https://arxiv.org/abs//2305.14201)

Anyway, I don't mean to make excuses for ChatGPT. 😅

Goat: Fine-tuned LLaMA Outperforms GPT-4 on Arithmetic Tasks

We introduce Goat, a fine-tuned LLaMA model that significantly outperforms GPT-4 on a range of arithmetic tasks. Fine-tuned on a synthetically generated dataset, Goat achieves state-of-the-art performance on BIG-bench arithmetic sub-task. In particular, the zero-shot Goat-7B matches or even surpasses the accuracy achieved by the few-shot PaLM-540B. Surprisingly, Goat can achieve near-perfect accuracy on large-number addition and subtraction through supervised fine-tuning only, which is almost impossible with previous pretrained language models, such as Bloom, OPT, GPT-NeoX, etc. We attribute Goat's exceptional performance to LLaMA's consistent tokenization of numbers. To tackle more challenging tasks like large-number multiplication and division, we propose an approach that classifies tasks based on their learnability, and subsequently decomposes unlearnable tasks, such as multi-digit multiplication and division, into a series of learnable tasks by leveraging basic arithmetic principles. We thoroughly examine the performance of our model, offering a comprehensive evaluation of the effectiveness of our proposed decomposition steps. Additionally, Goat-7B can be easily trained using LoRA on a 24GB VRAM GPU, facilitating reproducibility for other researchers. We release our model, dataset, and the Python script for dataset generation.

arXiv.org
@darabos yeah sadly I know. I think it's worrisome that OpenAI publishes examples for interactions that ask math related questions though, because I see a lot of people who think because it answers like an eloquent human it _must_ have basic human knowledge like what an even number is. I'm doing my part by spreading some education to hopefully make folks more cautious
@alinanorakari @darabos
Contrast this example
https://youtu.be/wHiOKDlA8Ac?t=5m20s
where (at 5:20) is shows it can regurgitate the correct answer, with how badly it does when given a maths problem that's too recent to appear in its training data
https://youtu.be/Fi1e-B60cok
OpenAI's GPT-4: A Spark Of Intelligence!

YouTube