Data is the fuel for AI. Without quality data, AI cannot learn. This topic explores data types, how data is collected and cleaned, the critical split between training and testing data, and the dangers of biased data.
Structured data: organised in rows and columns (spreadsheets, databases). Easy to process. Unstructured data: no fixed format — images, videos, text documents, social media posts. Semi-structured: JSON, XML (has some structure). Data collection methods: surveys, sensors (IoT), web scraping, APIs, public datasets (Kaggle, data.gov.in). Data quality matters: garbage in = garbage out (GIGO). AI trained on bad data gives bad predictions.
Data cleaning: remove duplicates, fill/remove missing values, fix errors, standardise formats. Data visualisation: bar charts (compare), line charts (trends), pie charts (proportions), scatter plots (relationships), histograms (distribution). Training data: portion of data AI learns from (~70-80%). Testing data: unseen data to evaluate (~20-30%). Why split? If tested on training data, AI might just memorise (overfitting) instead of truly learning. Data bias: if data isn't representative, AI will be unfair. Example: speech recognition trained mostly on adult voices may fail for children.
Overfitting occurs when an AI model learns the training data too well — including its noise and random patterns — instead of learning general rules. It's like memorising answers to a specific question paper vs understanding the subject. Signs: high accuracy on training data but poor performance on new/test data. Causes: too little training data, model too complex. Solutions: more data, simpler model, cross-validation, regularisation. This is why the training/testing split is essential.
Book a Trial + Diagnostic session. Get a personalized Learning Path with clear milestones, tutor match, and a plan recommendation — all within 24 hours.
Book Trial + Diagnostic →