Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How did you go about selecting a sample for manual tagging, and how did you decide what tags to end up using?


The samples were selected randomly at the beginning. Tags were thought up on the go, trying to generalize into broader categories. After the initial tagging, we counted the number of samples for each category and grouped similar underrepresented tags together + added additional samples for tagging that could go into the smaller categories just by filtering those who matched specific keywords and further hand-picking them.

We initially tried training the classifier only with GitHub based samples and using the user-given tags from there. Although we grouped the tag base into a reasonable number of distinct categories, the way how GitHub users tag their projects turned out to be just too inconsistent and often unrelated to the titles, so manual tagging was seen as a better option for getting decent results fast enough.

If you have any more specific questions feel free to drop me a mail to arturs@finch.io


For those of us in industries at the fringes (but still of interest) to HN, want to have an 'Other' category? :-) We make scientific/genetic software - and I'm not sure where we might have fit in your sidebar.

Getting to the front page via 'Show HN' was very helpful to us. It'd be nice (for others too) to be able to both replicate that success, and soften the blow when you get a grand total of 2 upvotes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: