Due to the complexity of the data from Yahoo.jp Q&A, I’d like to discuss and decide the final format of the data, so it can create higher quality of responses as training data. Also due to the large # of contents, possible over millions of Q&A, repetitive scraping is not in our best interest and resource allocation. So I hope we can get one-shot scrap done right.

Example

嫁の日記を見てしまいました。共同の本棚の本の奥に隠してありました。たまたま見つけ、何かと思い中を読んだら日記でした。 - そこに... - Yahoo!知恵袋

This is one of the popular Q&A which has quite a few responses. also, we only scrap questions with answers.

Note: selectedAnswer needs to be randomized in the answersContent to ensure index 0 is not always the selected answer to avoid bias.

Note from Hiroki:

Im doing some research on the yahoo japan知恵袋 webscraping, which we thought would be good data source for the StackLLaMa model. Before scraping, I am doing some research regarding the policies on yahoo japan. I want to make sure if web-scraping is allowed before we retrieve since I want to be transparent about what kind of data we used if we were to open source it. Scraping is against on some services like yahoo finance :