Which kind of model is better for keyword-set classification? #14

Closed
opened 2025-11-02 00:01:23 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @guotong1988 on GitHub (Dec 26, 2019).

There exists a similar task that is named text classification.

But I want to find a kind of model that the inputs are keyword set. And the keyword set is not from a sentence.

For example:

input ["apple", "pear", "water melon"] --> target class "fruit"
input ["tomato", "potato"] --> target class "vegetable"

Another example:

input ["apple", "Peking", "in summer"]  -->  target class "Chinese fruit"
input ["tomato", "New York", "in winter"]  -->  target class "American vegetable"
input ["apple", "Peking", "in winter"]  -->  target class "Chinese fruit"
input ["tomato", "Peking", "in winter"]  -->  target class "Chinese vegetable"

Thank you.

Originally created by @guotong1988 on GitHub (Dec 26, 2019). There exists a similar task that is named text classification. But I want to find a kind of model that the inputs are keyword set. And the keyword set is not from a sentence. For example: ``` input ["apple", "pear", "water melon"] --> target class "fruit" input ["tomato", "potato"] --> target class "vegetable" ``` Another example: ``` input ["apple", "Peking", "in summer"] --> target class "Chinese fruit" input ["tomato", "New York", "in winter"] --> target class "American vegetable" input ["apple", "Peking", "in winter"] --> target class "Chinese fruit" input ["tomato", "Peking", "in winter"] --> target class "Chinese vegetable" ``` Thank you.
Author
Owner

@GokuMohandas commented on GitHub (Jan 3, 2020):

Hey @guotong1988 , you'll want to first gather enough data for the types of entities (fruit, vegetable etc.) that you care about. You can use an off-the-shelf set of embeddings (ex. GloVe) to train because these are common tokens and the embeddings for entities in the same class will already be clustered since they all used large, generic datasets to learn embeddings from.

In the second example, where you have labels like "Chinese fruit", you'll want to treat this as a multiclass classification problem (ex. output is [0, 1, 1, 0] instead of being one unique class [0, 1, 0, 0]. However, you can just make more classes like "fruit", "chinese fruit" but your model is going to start confusing classes because there will be a lot of overlap. You can also create two separate models to predict "fruit" and then "chinese" from the set of keywords but this is assuming every prediction has both labels.

Hope that helps.

@GokuMohandas commented on GitHub (Jan 3, 2020): Hey @guotong1988 , you'll want to first gather enough data for the types of entities (fruit, vegetable etc.) that you care about. You can use an off-the-shelf set of embeddings (ex. GloVe) to train because these are common tokens and the embeddings for entities in the same class will already be clustered since they all used large, generic datasets to learn embeddings from. In the second example, where you have labels like "Chinese fruit", you'll want to treat this as a multiclass classification problem (ex. output is [0, 1, 1, 0] instead of being one unique class [0, 1, 0, 0]. However, you can just make more classes like "fruit", "chinese fruit" but your model is going to start confusing classes because there will be a lot of overlap. You can also create two separate models to predict "fruit" and then "chinese" from the set of keywords but this is assuming every prediction has both labels. Hope that helps.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/Made-With-ML#14