Dataset is heavily imbalanced

Dataset and categories are heavily imbalanced (#explainer).
Possible solutions (just randomly picked websites):

  1. https://www.kdnuggets.com/2020/01/5-most-useful-techniques-handle-imbalanced-datasets.html
  2. https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
  3. https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28
# Categories # Cases # cases [%]
49 fruity 892 20.6673
47 floral 632 14.6432
109 woody 588 13.6237
56 herbal 564 13.0677
55 green 556 12.8823
48 fresh 504 11.6775
97 sweet 451 10.4495
87 resinous 370 8.5728
95 spicy 302 6.9972
12 balsamic 270 6.2558
90 rose 258 5.9778
41 earthy 234 5.4217
43 ethereal 216 5.0046
29 citrus 213 4.9351
76 oily 181 4.1937
70 mint 172 3.9852
101 tropicalfruit 172 3.9852
44 fatty 171 3.9620
74 nut 171 3.9620
22 camphor 168 3.8925
96 sulfuric 157 3.6376
14 berry 153 3.5449
106 waxy 148 3.4291
72 musk 141 3.2669
103 vegetable 135 3.1279
11 apple 132 3.0584
19 burnt 130 3.0120
66 meat 130 3.0120
81 phenolic 123 2.8499
84 powdery 121 2.8035
23 caramellic 120 2.7804
26 chemical 115 2.6645
73 musty 114 2.6413
40 dry 111 2.5718
64 lily 111 2.5718
2 aldehydic 109 2.5255
9 animalic 106 2.4560
85 pungent 101 2.3401
102 vanilla 101 2.3401
63 lemon 97 2.2475
61 leaf 95 2.2011
3 alliaceous 94 2.1779
57 honey 85 1.9694
104 violetflower 82 1.8999
39 dairy 80 1.8536
54 grass 80 1.8536
6 ambery 75 1.7377
21 cacao 75 1.7377
59 jasmin 74 1.7146
94 sour 73 1.6914
89 roasted 72 1.6682
30 clean 71 1.6450
77 orange 70 1.6219
69 metallic 68 1.5755
46 fermented 65 1.5060
4 almond 64 1.4829
33 coffee 63 1.4597
37 cooling 63 1.4597
67 medicinal 60 1.3902
100 tobacco 59 1.3670
75 odorless 57 1.3207
79 pear 57 1.3207
65 liquor 55 1.2743
25 cheese 54 1.2512
35 coniferous 52 1.2048
68 melon 52 1.2048
36 cooked 51 1.1816
20 butter 50 1.1585
15 blackcurrant 49 1.1353
62 leather 49 1.1353
108 wine 49 1.1353
28 cinnamon 48 1.1121
13 banana 47 1.0890
99 terpenic 47 1.0890
10 anisic 46 1.0658
71 mushroom 46 1.0658
32 coconut 45 1.0426
53 grapefruit 45 1.0426
58 hyacinth 45 1.0426
86 rancid 44 1.0195
50 geranium 43 0.9963
80 pepper 42 0.9731
42 ester 41 0.9500
52 grape 41 0.9500
17 body 38 0.8804
51 gourmand 38 0.8804
93 smoky 38 0.8804
107 whiteflower 37 0.8573
60 lactonic 36 0.8341
83 plum 34 0.7878
98 syrup 34 0.7878
24 cedar 33 0.7646
27 cherry 33 0.7646
31 clove 32 0.7414
105 watery 32 0.7414
91 seafood 31 0.7183
92 sharp 25 0.5792
1 alcoholic 22 0.5097
5 ambergris 22 0.5097
88 ripe 20 0.4634
38 cucumber 19 0.4402
82 plastic 18 0.4171
18 bread 17 0.3939
34 cognac 12 0.2780
78 overripe 10 0.2317
7 ambrette 8 0.1854
8 ammoniac 8 0.1854
45 fennel 7 0.1622
16 blueberry 6 0.1390
1 Like

Dear @Tobi, yes data are imbalanced.

  • Some of the terms called Families in Perfumery like Fruity or Floral are vague and used to describe a multiple/mixed source of fruits or flowers, like rose, jasmine, etc… difficult to properly “identified”.
  • The none family terms like rose are very precise and characteristic of the odor. A unique molecule (including a unique chirality in 3D) is described by multiple terms cause when you smell it you perceive several “Notes” that can be liked to it.
    You can see the description like a olfactive signature of the molecule.

Best regards,

Guillaume

1 Like