| ▲ | ThePhysicist 4 hours ago | |||||||
This is really cool! I am trying to find a way to accelerate LLM inference for PII detection purposes, where speed is really necessary as we want to process millions of log lines per minute, I am wondering how fast we could get e.g. llama 3.1 to run on a conventional NVIDIA card? 10k tokens per second would be fantastic but even at 1k this would be very useful. | ||||||||
| ▲ | freakynit 3 hours ago | parent [-] | |||||||
PII redaction is a really good use-case. Also, "10k tokens per second would be fantastic" might not be sufficient (even remotely) if you want to "process millions of log lines per minute". Assuming a single log line at just 100 tokens, you need (100 * 2 million / 60) ~ 3.3 million tokens per second processing speed :) | ||||||||
| ||||||||