Due to the complexity of the data from Yahoo.jp Q&A, I’d like to discuss and decide the final format of the data, so it can create higher quality of responses as training data. Also due to the large # of contents, possible over millions of Q&A, repetitive scraping is not in our best interest and resource allocation. So I hope we can get one-shot scrap done right.
Example
嫁の日記を見てしまいました。共同の本棚の本の奥に隠してありました。たまたま見つけ、何かと思い中を読んだら日記でした。 - そこに... - Yahoo!知恵袋
This is one of the popular Q&A which has quite a few responses. also, we only scrap questions with answers.
- ID: str - uuid
- category: list[str] - can be multiple
- createdDate: timestamp
- questionText: str
- questionLike: int - not sure if related to training, but let’s scrape it
- answersContent: list[str] - includes all the answerText, stored in a list
- answerNiceCount: list[int] - nice count on each answer, list index same as answersContent
- additionalText: list[[str]] - some answer has additional comment, but it may be too complicated to included in training data, so we only save them for potential use. ex:
[[ "add text 1", "add text 2", ...], [..., ...]]
- selectedAnswer: int - index # of the answerContent
Note: selectedAnswer
needs to be randomized in the answersContent to ensure index 0 is not always the selected answer to avoid bias.
- expertTag: list[bool] - yahoo has this expert tag on verified expert? account, if this tag exist, we should save it, and possible to give higher weights to expert answer in training. I’m not fully understanding the benefit yet, but we can have this field for now, and discuss how to handle it during training, the list index same as answersContent.
- categoryMasterTag: list[bool] - non-verified expert who answers lots of questions or get more likes get this tag on their account, but i don’t see it being super relevant to training, but let’s save it first, the list index same as answersContent.
- thankComment: str - OP replies to the best answer, not really sure if relevant to the training, but scrape first, and maybe it can be base of polite Japanese ai 😆
Note from Hiroki:
Im doing some research on the yahoo japan知恵袋 webscraping, which we thought would be good data source for the StackLLaMa model. Before scraping, I am doing some research regarding the policies on yahoo japan.
I want to make sure if web-scraping is allowed before we retrieve since I want to be transparent about what kind of data we used if we were to open source it.
Scraping is against on some services like yahoo finance :