Jump to content

Commons talk:Database reports/Self-categorized categories

Add topic
From Wikimedia Commons, the free media repository

Very useful but only includes direct self-categorization

[edit]

@MZMcBride: This report is very useful. It is needed also for getting the deepcategory search operator to work properly which enables a wall of images view among many other use-cases. However, it currently only show directly self-categorized categories but not categories that contain themselves in some of their subcategories if I'm not mistaken. Could this please be changed? Maybe one could scan for the first the category levels at first and gradually increase that later on. Prototyperspective (talk) 10:38, 10 October 2024 (UTC)Reply

Walking category trees isn't particularly enjoyable in my experience, but it's probably doable. The exact issue you're mentioning—where a category contains a subcategory that contains the parent category—is why it often becomes a pain. Any script or tool that tries to walk the tree has to be very careful about not ending up in a loop. If you can provide some concrete examples, someone can probably take a look at generating a report. There may also be existing tools that can find these cases. --MZMcBride (talk) 18:10, 10 October 2024 (UTC)Reply
I corrected all such cases once I noticed them so I don't have an example of a still existing case but one could easily produce one for testing purposes. One example is the self-categorization corrected recently here. I only meant subcategories of subcategories containing the category and I think one may not have to walk the category tree but could implement a scan with a query and maybe such a query is not much different than the one used here. However, if a script is developed instead it would be more easy to prevent getting stuck in a loop by storing all the scanned categories in some array and then checking whether the category title is in that array, going through the branches one layer at a time. If there already is an existing tool or script, please let me know or link it at the page. Prototyperspective (talk) 18:40, 10 October 2024 (UTC)Reply
On enwiki SD0001 runs a script to populate w:User:SD0001/Category cycles. They could probably be convinced to do the same thing here. * Pppery * it has begun... 19:44, 15 October 2024 (UTC)Reply
That's exactly what's missing here, thanks a lot for this info! The link doesn't work – it's User:SDZeroBot/Category cycles. @SD0001: Could you adjust the script for commons and have it create a report on Commons? It would solve this problem and may also make deepcategory work more often until phab:T376440 gets solved. Prototyperspective (talk) 22:26, 15 October 2024 (UTC)Reply
Sorry. * Pppery * it has begun... 04:25, 16 October 2024 (UTC)Reply
Under what root page should I put it? Commons:Database reports/Category cycles? SD0001 (talk) 17:50, 17 October 2024 (UTC)Reply
Oh boy, commons has more than 16.7 million categories while enwiki has "only" 2.4 million. No wonder my script is hiccuping. Let me see if I can get it to work. SD0001 (talk) 18:09, 17 October 2024 (UTC)Reply
Yes, that page would be good – alternatively, something like Commons:Report self-categorized categories. Maybe there is some issue with the API now due to which it needs some sleep times in between requests or less parallel queries. Prototyperspective (talk) 21:20, 17 October 2024 (UTC)Reply
✓ Done There you are: Commons:Database reports/Category cycles. Due to the sheer volume of data involved, even the initial step to fetch the list of all subcategory associations from the database was failing both on Toolforge bastions and on the Toolforge kubernetes cluster. I then spun up a 16-core, 32 GB RAM instance on Cloud VPS, and there it all worked out! SD0001 (talk) 20:48, 20 October 2024 (UTC)Reply
Amazing, thank you! It's rare to see a technical issue actually get solved and especially so quickly. Made a new Village Pump post about it so people hear about and can better find this report and to explain it. It's not good there's so many cycles, I think the shorter ones should be prioritized.
Not sure what to do about the thousands of longer ones, maybe at some point some AI bot can fix these.
Do you know if it's possible that it possible to make it rescan the cats of a particular page and then update just that page? Because it doesn't seem like it would be a good idea or feasible to run the whole thing frequently. It would recheck each category linked after the * on the page. (e.g. page 3) and then write to the page again with the remaining categories. Prototyperspective (talk) 21:45, 20 October 2024 (UTC)Reply