Jump to content

News

fordem

Members
  • Posts

    2
  • Joined

  • Last visited

Everything posted by fordem

  1. First of all, this post requires from the user the knowledge and application of Shannon's entropy although it's an old and already college wise embedded topic even if you are not an engineer Shannon's entropy as much as Boltzmann's entropy came to the world with the intent to describe the organization level of a system, more in a sense that high entropy the organization level is low and low entropy the organization level is high. Which also can lead us to fact finding ideas like, if a system is extremely organized it can be easily predictable and if it is extremely disorganized it's hard to predict. So thinking in this concept, i created a password validator, that relates with what i just described. double calculateBitwiseEntropy(const std::string& input) const { int totalBits = input.length() * 8; int ones = 0, zeros = 0; for (unsigned char c : input) { for (int i = 0; i < 8; ++i) { ((c >> i) & 1) ? ++ones : ++zeros;\\number of zeros and ones } } double p1 = static_cast<double>(ones) / totalBits; \\ tackling the average of each sum. double p0 = static_cast<double>(zeros) / totalBits; double entropy = 0.0; if (p1 > 0.0) entropy -= p1 * std::log2(p1); \\based on the average we calculate the shannon's entropy. if (p0 > 0.0) entropy -= p0 * std::log2(p0); return entropy * totalBits; } First thing is calculating the bitwise entropy of the password. Basically we are calculating the number of 0's and 1's, averaging them out and than applying the formula. One important detail although this is a way to calculate randomness it's not a magic wand for example the bit string 11110000, has maximum entropy but it's not really random, so the use of this methodology must be done with parsimony and care. And then comes the validation process where i enforce that business rules should not be forgotten. Things like, min password length, must have lower and upper case characters, must have numbers and symbols. bool validatePassword(const std::string& password, nlohmann::json & reason) const { constexpr int MIN_LENGTH = 8; constexpr double MIN_ENTROPY_RATIO = 0.5; if (password.length() < MIN_LENGTH) { reason["msg"] = "Password must be at least 8 characters."; reason["entropy"]= 0; reason["eratio"]= 0; return false; } bool hasLower = false, hasUpper = false, hasDigit = false, hasSymbol = false; for (unsigned char c : password) { hasLower |= std::islower(c); hasUpper |= std::isupper(c); hasDigit |= std::isdigit(c); hasSymbol |= !std::isalnum(c); } if (!(hasLower && hasUpper && hasDigit && hasSymbol)) { reason["msg"] = "Password must include lowercase, uppercase, digit, and symbol."; reason["entropy"]= 0; reason["eratio"]= 0; return false; } double entropy = calculateBitwiseEntropy(password); double ratio = entropy / (password.length() * 8.0); if (ratio < MIN_ENTROPY_RATIO) { reason["msg"] = "Password score too low. Increase randomness."; reason["entropy"]= entropy; reason["eratio"]= ratio; return false; }else { reason["msg"] = "Password score high enough."; reason["entropy"]= entropy; reason["eratio"]= ratio; return true; } } This set of restrictions enable you to do two things, first the most clear one enforce a password policy and second the entropy will serve an easy way to score the password without using any modules, packages and things like that. Just one thing this implementation may differ a bit in javascript. OK guys, have a nice one and hang tight.
  2. Multi language design serves several purposes, being inclusion and reachability the most important ones. In this small tutorial i'm going explain what i learned from Chatgpt about utf-8 validation that allows us to receive multibyte utf-8 characters and once verified they are valid, we consider the input valid. And than other validation methods can be employed to search for more intrinsic errors related to your business rules. From here on everything came from chatgpt i'm not a lier and i don't like to get credit from something i didn't do. So here we go. Let's go step by step and rigorously understand how UTF-8 byte sequences work, and why the hex constants (like 0xC0, 0xE0, etc.) are used in UTF-8 validation. 🧠 What is UTF-8? UTF-8 is a variable-length encoding for Unicode characters using 1 to 4 bytes per character. Each byte starts with a specific bit pattern that tells you what kind of byte it is: Byte Type Format (binary) Example (hex) Meaning 1-byte (ASCII) 0xxxxxxx 0x00–0x7F Standard ASCII (U+0000 to U+007F) 2-byte lead 110xxxxx 0xC0–0xDF Start of 2-byte sequence 3-byte lead 1110xxxx 0xE0–0xEF Start of 3-byte sequence 4-byte lead 11110xxx 0xF0–0xF7 Start of 4-byte sequence Continuation 10xxxxxx 0x80–0xBF Must follow a multi-byte lead ⚠️ Every byte after the lead must match 10xxxxxx, i.e., it must be in range 0x80–0xBF. πŸ“¦ How the Validator Works Let’s walk through what each part of the validator does: πŸ”Ή Case 1: 1-byte (ASCII) if (bytes[i] <= 0x7F) If the byte is 0x00–0x7F, it's a valid ASCII char β€” move to next byte. πŸ”Ή Case 2: 2-byte sequence else if ((bytes[i] & 0xE0) == 0xC0) βš™οΈ Breakdown: bytes[i] & 0xE0 masks the first 3 bits: 11100000 We're testing for 110xxxxx β€” the lead byte of a 2-byte character 0xC0 = 11000000 So if bytes[i] & 0xE0 == 0xC0, we know it's a 2-byte lead βœ… Then check: (bytes[i + 1] & 0xC0) == 0x80 This ensures that the next byte is a continuation byte (10xxxxxx), which must be true for valid UTF-8. πŸ”Ή Case 3: 3-byte sequence else if ((bytes[i] & 0xF0) == 0xE0) βš™οΈ Breakdown: Mask top 4 bits: 11110000 0xE0 = 11100000 β†’ matching 1110xxxx, the lead for a 3-byte sequence We check that the next 2 bytes are valid continuation bytes: (bytes[i+1] & 0xC0) == 0x80 && (bytes[i+2] & 0xC0) == 0x80 πŸ”Ή Case 4: 4-byte sequence else if ((bytes[i] & 0xF8) == 0xF0) βš™οΈ Breakdown: Mask top 5 bits: 11111000 0xF0 = 11110000 β†’ matches 11110xxx, the lead for a 4-byte UTF-8 char Check next 3 continuation bytes: (bytes[i+1] & 0xC0) == 0x80 && (bytes[i+2] & 0xC0) == 0x80 && (bytes[i+3] & 0xC0) == 0x80 ❌ Default: Invalid Byte else { return false; } If it doesn’t match any valid UTF-8 lead or continuation format β†’ the sequence is invalid. πŸ§ͺ Example: How It Validates "ζ—₯本" (U+65E5 U+672C) Encoded in UTF-8: E6 97 A5 E6 9C AC E6 β†’ 11100110 = 3-byte lead 97 β†’ 10010111 = continuation A5 β†’ 10100101 = continuation βœ… Pass Then: E6 β†’ 3-byte lead 9C β†’ continuation AC β†’ continuation βœ… Pass βœ… Final Takeaway The hex masks like 0xE0, 0xF0, 0xC0, etc., are used to extract the bit patterns that define UTF-8 byte types. The bitwise & tests identify which type of byte it is by focusing on the leading bits that encode structure. Let me know if you want a version of the validator that logs which byte failed or shows the character boundaries.
Remove Ads
Γ—
Γ—
  • Create New...