The fuss about AI

With the jury still out on whether artificial intelligence (AI) has been a net positive or loss for humanity, there’s been a host of debates surrounding the technology alongside its rapid development and adoption.

In the EU, for instance, companies are running into regulatory roadblocks over their non-compliance with data privacy regulations in the region.

Meta announced back in June that it would be incorporating European users’ social media posts into its AI training data for the Llama AI model. With Llama being a multimodal AI, this would include everything from text and images to video and audio.

Soon after, the Irish Data Protection Commission (DPC) requested that the social media giant refrain from utilising the region’s social media content to train large language models, prompting Meta to delay its AI launch in the EU to the joy of other European regulators.

Users in the UK and EU have the “right to object”, which would effectively opt them out from having their data used by Meta in training its AI model. However, the process has been described as tedious and awkward.

Meanwhile, those in other regions without data protection regulations as stringent as the GDPR do not have this option, according to a report from online tech publication Mashable.

X (formerly Twitter) made a similar move to utilise user posts in training its Grok AI model at the end of July, prompting the DPC and the UK’s Information Commissioner’s Office to question the platform over the data harvesting.

The Elon Musk-owned platform had users opt-in by default to allow for AI training on their data, which goes against the UK’s General Data Protection Regulations (GDPR) according to a report from The Guardian.

ALSO READ: Kerajaan tak halang import ikan dari Jepun - Mat Sabu

Under UK’s GDPR, companies are not allowed to implement consent by default, which is the case in X’s settings page.

Mired in controversy

Reports over the contents of training data fed into AI models have also been extensive, covering allegations of stolen and illicit content.

Back in December last year, researchers from the Stanford Internet Observatory found that the Laion-5B dataset contained 1,679 illegal images consisting of child sexual abuse material (CSAM) scraped from social media posts and popular adult websites.This dataset was used to train the popular AI image generator Stable Diffusion.

Further discussion over the datasets in AI training involves where the content is sourced from. Nonprofit EleutherAI compiled a massive dataset named “The Pile” which was reported in July to contain the captions of over 170,000 YouTube videos.

These captions were taken without permission and allegedly used by major companies like Anthropic, Nvidia, and Salesforce.

Initial reports had listed Apple as having used this stolen data in AI training, but the company has since refuted such claims.

Content from influencers such as MrBeast, PewDiePie, Marques Brownlee and Jacksepticeye were included in the dataset, alongside various talk shows and news outlets such as The Wall Street Journal, NPR, and the BBC.

Brownlee’s videos were also allegedly subject to further use as training data for an AI video generation tool developed by Runway without consent, according to reports in late July. Reports say that 1,709 videos of his videos were used for AI training.

Claims of outright copyright violation have also been the subject of legal battles in the AI space, with a coalition of music companies including Universal Music Group, Sony Music, and Warner Records alleging that AI music generation companies Udio and Suno infringed on their copyright.

ALSO READ: ‘This is my new number’ messenger scams continuing to work on victims

The lawsuit claims that the AI firms trained their music-generation models using copyrighted materials owned by the music labels, seeking US$150,000 (RM674,805) in damages per song used in training.

A similar lawsuit was also launched by The New York Times in the case of journalistic content against ChatGPT-maker OpenAI (and its owner Microsoft) last December and is currently ongoing.

Claims from The New York Times allege that copyrighted content from the American newspaper had been unlawfully used in developing artificial intelligence products that “threatens the Times’ ability to provide that service”.

Both OpenAI and Microsoft have shot back that its products do not serve as a substitute for the reporting offered by the publication, requesting courts to dismiss the lawsuit.

Other news providers such as the Associated Press, Axel Springer, FT Group, News Corp, and Vox Media have reached licensing deals with OpenAI, allowing their content to be used in the training of the company’s large language models (LLMs).

More direct claims of plagiarism came from magazines Forbes and Wired in June, with accusations that AI search and chatbot startup Perplexity had been scraping content from the magazines’ respective websites.

Forbes claimed that Perplexity had utilised AI to generate an article, podcast, and YouTube video based on a story that the publication put up on its website. This was done without permission from and attributing credit to Forbes for the original report.

The publication further alleged to have found other plagiarised stories republished by Perplexity with uncited information sourced from Bloomberg and CNBC.

ALSO READ: Lapan pekerja trak penunda ditahan kerana gaduh

Wired on the other hand claimed that Perplexity had been ignoring “robots.txt”, a file used in the Robots Exclusion Protocol intended to disallow web scrapers and crawlers, which the AI company said it honoured.

In a report, the tech magazine claims that it monitored network traffic to its website and linked it to bots associated with Perplexity, following a prompt about an article published on its site.

It further drew a comparison to OpenAI’s ChatGPT and Anthropic’s Claude, which offered a hypothesis about the story in question, but explicitly stated that they did not have access to it.

On July 30, the startup reached an agreement with several outlets including the Times, Fortune, and WordPress on a revenue-sharing programme that would pay the publications for articles cited in AI-generated responses.

Related articles

Subscribe to Newsletter

Latest posts

Hot Highlights

Trending Buzz