It is usually not cost effective to go after small time copyright thieves (see RIAA), but AI companies are flush with cash, so the payout (or settlement) could be substantial.
That is the phase 1 battle.
In phase 2, the AI companies train and copy directly from the interactions with the users. This data is lower quality, and theoretically would belong to the AI companies to do as they please. Except it can still be poisoned by users with a bit of cleverness: 1) if they deliberately create gibberish or wordsalad sessions, perhaps with the help of another AI, thereby reducing the value of the data, and 2) if they copy and paste copyright restricted documents illegally into the session, forcing the AI companies to filter the data or be liable for copyright infringement if it is found out.
In phase 3, the AI companies pay for all the content that they consume for training etc. The content is explicitly licensed for only particular tasks and number of users or "seats", and correct usage is periodically verified. This is the norm in the business world already, so the AI companies' biggest costs will be content licensing, insurance and hardware/power consumption. There is no reason to suppose that licensing will not become the main component of the cost after insurance as advances occur.
In any case, the current LLM craze is not as impactful in the rest of the world as it may seem to residents of eg California.
But where is the "host of technological developments"?
AWS, duh!
Also, if you upgrade to the new Amazon Prime Alien Search Bundle, they will keep your instances alive two hours longer so that you can finish the data analysis before the overnight cronjob cleans all your tempfiles.
"Fischer-Tropsch process with a computer"
Aha!
The obvious table encoding (think CSV or similar) is too confusing for the AI, so they have written a bunch of tools that takes a basic spreadsheet and annotates (explains) its structure using heuristic rules, so that the AI can pick up on the summarized structure as if a user had explained it. Then the LLM can try to complete it for the user.
They call the annotated spreadsheet a "compressed" version, because one of the things that they do is to remove empty cells which only confuse the AI and cause a lot of memory and computation to be wasted for noninformative features (prompt cowboys, there's your jailbreak!). They also render the result in JSON with copious hints for the AI.
Current LLMs use a transformer architecture, which effectively looks at all pairs of tokens in the input to try to see if they are related and should trigger a subnetwork to do something. You can see that if you have a large table with a lot of empty cells, then looking at all pairs of empty cells can quickly lead to a lot of wasted effort, that's why they sat down with some programmers to create heuristics that can weed out such problem spots for the AI.
Likewise, OpenOffice was built to attract the Microsoft Office crowd. And pulseAudio, systemD were built to make the Linux system more desktop like, to cater for expectations from people who are used to a Windows laptop.
Meanwhile the "desktop" of most people on Earth is now a phone in their pocket which runs a web browser, social billboard apps, dating apps and small games which don't require great perfomance. These phones run proprietary OSes, and do things complely differently from Linux/systemD. Microsoft too is leaving the desktop aside for a natural language interface. The serious computing is done, as it always was, on headless servers. A lot of it in Python, it seems.
Civilization, as we know it, will end sometime this evening. See SYSNOTE tomorrow for more information.