Dropsitenews published a list of websites Facebook uses to train its AI on. Multiple Lemmy instances are on the list as noticed by user BlueAEther

Hexbear is on there too. Also Facebook is very interested in people uploading their massive dongs to lemmynsfw.

Full article here.

Link to the full leaked list download: Meta leaked list pdf

  • lazynooblet
    link
    fedilink
    English
    47
    edit-2
    11 days ago

    My instance gets pillaged once a day for 20 minutes by what I think is a scraper for an LLM.

    The scraper grabs every post and profile page and the load on the server triggers alerts but the site stays usable.

    I haven’t been able to put a stop to it as the requests come from 1500+ IP addresses, with different user agents.

    • Phoenixz
      link
      fedilink
      2311 days ago

      Yeah, they’re scraping alright and it’s all purposefully done in such a way that you can’t stop it, you can’t control it.

      AI companies are criminal as far as I am concerned

    • @gazby@lemmy.zip
      link
      fedilink
      711 days ago

      Run your access logs through something that will report the ASN for the client IPs. Goaccess would be my recommendation. It will require access to a GeoIP database which you can get from Maxmind by signing up for a free API key, or download them directly from P3TERX/GeoLite.mmdb on Github. We have identified a number of bot networks this way. Happy to help further if you’d like a hand 👍