Lexical Analysis
Lexical Analysis MCQs
Q1. The lexical analyzer breaks the input source code into tokens. Which of the following is not a valid token class?
A. Identifier
B. Literal
C. Macro
D. Keyword
Answer: C
Solution: Macros are handled during preprocessing, not lexical analysis. The lexer recognizes identifiers, keywords, and literals.
Q2. Which of the following pairs is correctly matched?
A. Lexical analysis → Syntax tree generation
B. Parsing → Token generation
C. Lexical analysis → Pattern recognition using regular expressions
D. Syntax analysis → Symbol table creation
Answer: C
Solution: The lexer identifies tokens using patterns defined by regular expressions. Syntax tree generation happens in parsing, not lexical analysis.
Q3. The input string abc12$ is scanned by a lexer with the rule:
ID → [a-zA-Z][a-zA-Z0-9]*
NUM → [0-9]+
What is the longest prefix that matches a token?
A. abc12
B. abc
C. abc12$
D. abc1
Answer: A
Solution: abc12 matches [a-zA-Z][a-zA-Z0-9]*, forming a valid identifier. $ ends scanning as a special character.
Q4. A lexical analyzer uses the following tokens:
digit → [0-9]
number → digit+
For the input 234x, how many tokens will be produced?
A. 1
B. 2
C. 3
D. 0
Answer: B
Solution: 234 → one token (number), x → another token (possibly identifier). Total = 2 tokens.
Q5. Which data structure is commonly used in lexical analysis for recognizing keywords efficiently?
A. Stack
B. Hash table
C. Queue
D. Linked list
Answer: B
Solution: Hash tables allow O(1) lookup of reserved keywords during token recognition.
Q6. If a lexical analyzer reads 200 characters and produces 180 tokens, the average lexeme length is approximately:
A. 0.9
B. 1.1
C. 1.8
D. 2.0
Answer: C
Solution: Average lexeme length = total characters / total tokens = 200 / 180 ≈ 1.11 (rounded → ~1.1). Correct = B (typo corrected).
Q7. Which phase of the compiler handles error recovery for illegal characters like @ in variable names?
A. Syntax analysis
B. Lexical analysis
C. Semantic analysis
D. Intermediate code generation
Answer: B
Solution: Illegal tokens or invalid characters are caught during lexical analysis, which performs token validation.
Q8. Consider the input "for1 = 2". If the lexer recognizes keywords before identifiers, how will "for1" be classified?
A. Keyword
B. Identifier
C. Constant
D. Syntax error
Answer: B
Solution: for is a keyword, but for1 includes an extra digit, making it a valid identifier — not a keyword.
Q9. Which of the following tools automatically generates a lexical analyzer from regular expression specifications?
A. YACC
B. LEX
C. GCC
D. CPP
Answer: B
Solution: LEX (or Flex) takes regular expressions and produces a lexer in C. YACC is for parsing.
Q10. The end-of-file (EOF) is treated by the lexical analyzer as:
A. A valid token
B. An invalid lexeme
C. A delimiter indicating end of input
D. A syntax error
Answer: C
Solution: EOF signals the lexer that the input stream has ended. It is not a token but a control marker.
- Which principle ensures the lexical analyzer always picks the longest possible lexeme when multiple rules match?
A. Leftmost-longest rule
B. Maximal-munch (longest-match) rule
C. Earliest-match rule
D. Greedy-backtracking rule
Answer: B
Solution: Maximal-munch (longest-match) requires choosing the longest token that matches at the current input position.
- POSIX regular expression priority for matches is generally:
A. Shortest match first then rule order
B. Leftmost-longest; if equal lengths choose earlier rule
C. Rightmost-shortest; then later rule
D. Random if same length
Answer: B
Solution: POSIX uses leftmost-longest; when lengths tie, the rule earlier in specification wins.
- In lex, when two rules match the same longest lexeme, which decides which action is taken?
A. Rule with longer regex
B. Rule written earlier in file
C. Rule with lexically smaller name
D. Alphabetical order of regex
Answer: B
Solution: Lex executes the action of the first rule listed that matches the longest lexeme.
- Which data structure is commonly used to implement subset construction (NFA→DFA)?
A. Stack of states
B. Queue for unmarked DFA states
C. Priority queue by state id
D. Hash table only
Answer: B
Solution: Subset construction uses a worklist/queue to process unmarked DFA states produced from NFA state sets.
- The two-buffer (sentinel) scheme in lex aims to:
A. Avoid copying tokens to the symbol table
B. Allow efficient forward and retract scanning without expensive checks
C. Remove need for DFAs
D. Convert NFA to DFA at runtime
Answer: B
Solution: Two buffers with sentinels let the scanner read ahead and retract cheaply without per-character bounds checking.
- For the regex
(a|ab), which string demonstrates ambiguity if a naive leftmost-shortest strategy is used?
A.a
B.ab
C.b
D. “ (empty)
Answer: B
Solution: (a|ab) with leftmost-shortest would choose a for input ab, leaving b unmatched; maximal-munch chooses ab.
- Which approach resolves token conflicts using semantic checks after recognizing a lexeme?
A. Syntax analysis
B. Semantic actions in lexer
C. DFA minimization
D. Regular grammar rewriting
Answer: B
Solution: Lexers can attach semantic actions (C-code or similar) to confirm or reclassify tokens after matching.
- The primary disadvantage of using backtracking-based regex engines in lexing is:
A. They produce smaller DFAs
B. Potential exponential worst-case running time
C. They cannot handle character classes
D. They always require more memory than DFAs
Answer: B
Solution: Backtracking can cause exponential time on crafted inputs; DFAs avoid backtracking.
- Which is not a token attribute stored in a typical compiler’s token record?
A. Token type (kind)
B. Lexeme string or pointer
C. Pointer to machine instruction
D. Source line number
Answer: C
Solution: The lexer records token type, lexeme, and position. Machine code pointers belong later phases.
- To distinguish
==and=tokens lexically, a scanner must:
A. Use longest-match to prefer==
B. Use lookbehind of one character
C. Treat them at parser level only
D. Use semantic predicates
Answer: A
Solution: Longest-match ensures == is matched when present, otherwise = matches.
- When converting regex to NFA using Thompson’s construction, how many new states are added for a concatenation of two NFAs?
A. 0
B. 1
C. 2
D. Depends on alphabet size
Answer: A
Solution: Concatenation connects one NFA’s final states to the start of the other via ε-transitions; no extra states beyond linking.
- Which lexing scenario requires more than one-symbol lookahead?
A. Distinguishing<=vs<
B. Distinguishingavsabwhen both are tokens
C. Distinguishing/*comment start vs/divide operator if/*is not present
D. Distinguishing whitespace vs newline
Answer: B
Solution: a and ab both start with a; to decide, whichever yields longest-match requires looking ahead to see if b follows.
- Which of the following is true about minimal DFA for a regular language?
A. Always larger than any NFA for the same language
B. Unique (up to renaming states) and smallest DFA accepting the language
C. Always has 2^n states for n-characters alphabet
D. Depends on lex rule order
Answer: B
Solution: Minimal DFA for a regular language is unique up to isomorphism and is smallest among DFAs.
- Which of the following is a correct reason to use perfect hashing for keyword lookup in a lexer?
A. It always uses less memory than any other table
B. O(1) worst-case lookup with no clashes for fixed keyword set
C. Automatically updates when new keywords are added at runtime
D. Eliminates need for token types
Answer: B
Solution: Perfect hash functions give O(1) lookup with no collisions for a known fixed set of keywords.
- In the context of lexers, what is a “start condition”?
A. A compiler optimization phase
B. A lexical state controlling which rules are active
C. A parser’s first production
D. A runtime assertion
Answer: B
Solution: Start conditions (states) enable rules selectively—for example, different rules inside comment or string literal modes.
- Which property does an NFA have that a DFA does not (in structure)?
A. Multiple start states
B. ε-transitions and multiple transitions for same input from a state
C. Deterministic single transition per symbol
D. No accepting states
Answer: B
Solution: NFAs can have ε-moves and multiple transitions on same input; DFAs cannot.
- Which algorithm transforms an NFA with ε-transitions into an equivalent NFA without ε-transitions?
A. Subset construction
B. Epsilon-closure elimination (ε-elimination)
C. Hopcroft minimization
D. KMP string search
Answer: B
Solution: ε-closure/e-closure computation removes ε-transitions by adding transitions to closure states.
- A lexical rule
COMMENT -> "/*" .* "*/"is problematic in lex because:
A..*is greedy and may cross unmatched*/boundaries, causing excessive scanning
B. lex cannot handle/*
C. Comments are parsed in parser only
D..*adds non-determinism that breaks DFA
Answer: A
Solution: .* can be greedy and match across intervening */ in naive implementations; lexers typically use non-greedy patterns or states.
- When lex generates a DFA table, what is a common memory optimization?
A. Using adjacency matrices per state with every ASCII column filled
B. Using transition compression and default transitions
C. Storing transitions as full 2D arrays always
D. Duplicating transitions for each rule
Answer: B
Solution: Table compression (default transitions, packed tables) saves memory compared to full 2D arrays.
- If a lexeme is too long (exceeds buffer), the lexical analyzer should:
A. Truncate silently and continue
B. Signal a lexical error (lexeme too long)
C. Convert it to multiple tokens without notice
D. Jump to parser to resolve
Answer: B
Solution: Lexers should report an error when lexeme exceeds allowed buffer limits—silent truncation is unsafe.
- Which of the following correctly identifies a token classification conflict that requires lexer redesign?
A. Keywordsifand identifierifvarrecognized properly
B. Overlapping tokens<=and<handled by longest match
C. Two rules match different token types for identical lexeme (e.g.,0as INT and FLOAT)
D. Whitespace removal
Answer: C
Solution: If two rules accept the same lexical string but intend different token types, resolution strategy or redesign needed.
- In interactive or incremental editors, incremental lexing must be:
A. Stateless and single-pass only
B. Able to re-lex only affected regions efficiently
C. Always restart from file beginning on edits
D. Done at the parser stage
Answer: B
Solution: Incremental lexing re-analyzes only changed regions to be efficient for interactive use.
- Which regex operator has higher precedence?
A. Union|
B. Concatenation
C. Kleene*
D. Alternation|and concatenation same
Answer: C
Solution: Kleene star has the highest precedence, then concatenation, then alternation.
- For lexing floating constants, the minimal DFA must recognize
123.,.456, and123.456. A robust approach is to:
A. Use separate rules with start conditions for each form
B. Reject123.as invalid
C. Use a single regex allowing optional integer or fractional parts
D. Let parser combine123and.tokens
Answer: C
Solution: A single regex (with optional parts) or combined rules ensure all valid forms are recognized consistently.
- Which of these is NOT true about DFAs used in lexers?
A. They consume input deterministically
B. They require backtracking to match longest token
C. They can be minimized for efficiency
D. They can be implemented as transition tables
Answer: B
Solution: DFAs do not backtrack; they deterministically process input to longest matches by design.
- Which technique helps lexers process Unicode input efficiently?
A. Treat each code point as single 32-bit char and build large tables
B. Use UTF-8 aware scanning and character class ranges, not per-codepoint tables
C. Reject non-ASCII input
D. Convert to UTF-16 and use full two-byte tables
Answer: B
Solution: UTF-8-aware scanners and range-based classes avoid huge tables and handle variable-length encoding.
- The lexing phase records line and column numbers for tokens. Which situation complicates accurate column tracking?
A. ASCII only files
B. Tabs with varying width and mixed line endings
C. No whitespace in program
D. Single-line files only
Answer: B
Solution: Tabs and mixed CRLF vs LF require special handling to compute correct column positions.
- Using a hash table for symbol (identifier) internment during lexing provides which advantage?
A. Faster keyword recognition than perfect hashing
B. Avoids copying the string by storing a unique pointer per identifier
C. Removes need for a parser
D. Guarantees alphabetical ordering of symbols
Answer: B
Solution: Interning stores one copy of lexeme and returns pointer, saving memory and enabling fast identity checks.
- Which is the main difference between lexical analysis and preprocessing (e.g., C preprocessor)?
A. Lexical analysis changes the meaning of code via macros
B. Preprocessing transforms source before lexing (macro expansion, includes)
C. There is no difference; both are same phase
D. Preprocessing only happens after parsing
Answer: B
Solution: Preprocessing expands macros and includes, producing source that lexing then tokenizes.
- Suppose you have token rules for
0|[1-9][0-9]*(integers) and0[xX][0-9a-fA-F]+(hex). Input0x1should be:
A. Two tokens0andx1
B. A single HEX_INTEGER token
C. Invalid token sequence
D. Two tokens0xand1
Answer: B
Solution: Longest-match with proper regex ensures 0x1 matches hex rule as a single token.
- Which statement about greedy vs non-greedy lexing is correct?
A. Lexers always use non-greedy matching by default
B. Greedy matching (maximal-munch) is typical; non-greedy may be used for specific constructs
C. Non-greedy is faster than greedy always
D. Greedy and non-greedy are identical for lexers
Answer: B
Solution: Lexers typically use maximal-munch; non-greedy may be implemented when needed (e.g., inside strings).
- When lexing nested comments (e.g.,
/* ... /* ... */ ... */), what is required?
A. Regular expressions suffice
B. A stack or nested-comment depth counter in lexer state
C. Parser must handle nesting
D. Impossible in any lexer
Answer: B
Solution: Nested comments are context-free in nesting depth; a counter or stack in lexer state handles them.
- Which of the following is best for recognizing identifiers in case-insensitive languages?
A. Convert lexeme to canonical case (e.g., lowercase) before lookup
B. Store both uppercase and lowercase variants in keyword table
C. Compare case-sensitively but accept both
D. Use parser to enforce case-insensitivity
Answer: A
Solution: Normalizing (e.g., to lowercase) before keyword lookup simplifies matching.
- JFlex or Flex generate scanners in which language by default?
A. Java for JFlex, C for Flex (lex)
B. Python for both
C. C++ for both
D. JavaScript for both
Answer: A
Solution: JFlex emits Java scanners; Flex (lex) typically emits C code.
- Which of the following helps detect unterminated string literals at lexing phase?
A. Parser
B. Lexical analyzer with start condition and end-of-file check
C. Semantic analyzer
D. Code generator
Answer: B
Solution: Lexer tracks string start and searches for closing delimiter; EOF without closing triggers a lexical error.
- For the regex
a*b*, which language does it describe?
A. All strings with equal numbers of a’s then b’s
B. Any number of a’s followed by any number of b’s (including empty string)
C. Strings with a and b interleaved arbitrarily
D. Exactly one a followed by one b
Answer: B
Solution: a* then b* allows zero or more a’s followed by zero or more b’s; empty string accepted.
- Which method reduces DFA state count by merging equivalent states?
A. Subset construction
B. Hopcroft or Moore minimization algorithms
C. Thompson construction
D. Grep algorithm
Answer: B
Solution: Hopcroft and Moore algorithms minimize DFA by merging indistinguishable states.
- What’s the effect of ambiguous token specification like
ifas both keyword and identifier if implemented incorrectly?
A. Always parse as identifier
B. May causeifto be tokenized as identifier instead of keyword depending on order
C. No effect; parser decides
D. System crash
Answer: B
Solution: If rules treat if as identifier before keyword or vice versa, ordering/resolution must be explicit to ensure correct classification.
- In lex specification, the special rule
.*placed at the end is often used to:
A. Give precedence to other rules and catch any remaining characters as errors or default tokens
B. Speed up matching of other rules
C. Ensure comments are captured first
D. Build parking lot for tokens
Answer: A
Solution: A catch-all rule acts as fallback to catch unrecognized characters or default behavior after other rules.
- A scanner uses a deterministic finite automaton table with 500 states and 128 ASCII columns. Full table size would be 64,000 entries. A compression technique to reduce memory is:
A. Expand to 256 columns for Unicode
B. Use row/column packing and default transitions to store only explicit transitions
C. Convert DFA back to NFA
D. Store as full array anyway
Answer: B
Solution: Table compression stores only non-default transitions or uses row/transition packing to reduce memory.
- Which of the following is true about lexical error reporting?
A. Lexers should always abort compilation on first error
B. Good lexers provide line/column location and a helpful message and try to resume safely
C. Lexical errors are ignored by convention
D. Parser corrects lexical errors automatically
Answer: B
Solution: Informative messages with location and recovery strategies are best for user experience.
- How does the
yylengvariable in lex/Flex typically behave?
A. Holds the current line number
B. Stores the length of the current matched lexeme
C. Contains lookahead buffer pointer
D. Tracks number of tokens emitted
Answer: B
Solution: yyleng is commonly used to indicate the length of the lexeme just matched.
- Which statement about lexical scanning of multi-character operators like
>>=is correct?
A. They should be split at parser level always
B. Lexer needs longest-match to handle>>=vs>>and=
C. They cannot be represented by regular expressions
D. Only the parser knows operator length
Answer: B
Solution: Longest-match lets lexer prefer >>= if present rather than splitting into >> and =.
- For a lexer implementing lookahead k, what is the complexity cost for large k?
A. Linearly decreases memory usage
B. Time and memory grow exponentially in worst-case depending on implementation
C. No impact
D. Always constant time
Answer: B
Solution: Large fixed lookahead increases state expansions and complexity; naive implementations can blow up.
- In a multi-threaded compiler, a reentrant lexer must:
A. Use global variables for lex state only
B. Avoid global mutable state and keep scanner state per-thread or per-instance
C. Not be used because lexers cannot be reentrant
D. Always be single-threaded
Answer: B
Solution: Reentrancy requires per-instance state and no shared mutable globals to be thread-safe.
- Consider tokens
A -> ab|aandB -> a. For inputab, which token(s) chosen under maximal-munch and lex file ordering with A before B?
A.A(ab)
B.B(a) thenberror
C.Acannot match
D.Bonly
Answer: A
Solution: Maximal-munch picks ab for A; since A is first and longest, it matches.
- Which of the following is an advantage of using finite automata for scanning instead of ad-hoc code?
A. Easier formal reasoning, guaranteed linear-time scanning, and automatic generator tools
B. Always uses less memory than ad-hoc code
C. Can parse context-free languages too
D. Eliminates need for parser
Answer: A
Solution: Finite automata give formal guarantees, linear-time scanning, and can be generated automatically from regexes.
- Which regex describes strings over
{0,1}with an even number of0s?
A.(1*01*01*)*1*
B.(0*1*)*
C.(01)*
D..*0.*0.*
Answer: A
Solution: (1*01*01*)*1* describes even number of 0s interspersed with any number of 1s.
- In lex-generated scanners,
yywrap()being non-zero tells flex that:
A. No more input; scanner should stop
B. Use alternative start condition
C. Continue scanning from same file
D. Restart with backup buffer
Answer: A
Solution: When yywrap() returns non-zero, flex treats it as EOF and stops scanning (or user-defined behavior).
- Why are character classes (e.g.,
[a-zA-Z]) converted to ranges or sets in DFA construction?
A. To make NFA huge
B. To expand into explicit transitions per character or compressed class transitions for table generation
C. They are ignored in DFA
D. To parse them in the parser
Answer: B
Solution: Classes reduce regex size but expand to transitions for characters in the class (often compressed as ranges).
- If a language allows nested multi-line strings where quotes alternate types, how should a lexer handle this?
A. Single regex suffices always
B. Use lexer state machine and possibly nesting counter to track depth and closing delimiters
C. Let parser handle nested strings
D. Disallow nested strings
Answer: B
Solution: Nested constructs require lexer states and counters; regex alone may be insufficient.
- Which is a correct description of “start anchors”
^and$in regex for lexers?
A. They match beginning and end of input or line depending on mode
B. Always invalid in lexers
C. Represent lookbehind and lookahead respectively
D. Equivalent to\Aand\Zin all contexts
Answer: A
Solution: Anchors match start/end of line or input depending on engine and flags (multiline vs singleline).
- Which of the following best explains why
\b(word boundary) can be tricky in lexers?
A. It’s the same in all languages
B. Word boundaries depend on Unicode categories and surrounding characters, complicating implementation
C.\bis always ASCII-only
D. It’s only meaningful at parser stage
Answer: B
Solution: Word boundaries involve character classification which can be complex for Unicode-aware lexers.
- A lexical analyzer that outputs tokens annotated with semantic values uses which structure to return values?
A. Global variables only
B. Token structure/record containing token type and attribute (e.g., integer value, identifier pointer)
C. Print to stdout for parser to read
D. No need to return attributes
Answer: B
Solution: Token records include type and attributes (e.g., numeric value) passed to parser.
- Suppose a scanner uses a fixed-size token buffer of 1024 bytes. Which tactic avoids buffer overflow for long string literals?
A. Reject all strings longer than 1024 silently
B. Dynamically reallocate or stream the lexeme into a resizable buffer and report errors as needed
C. Truncate at 1024 and proceed without notice
D. Save only first 100 chars
Answer: B
Solution: Resizable buffers with checks are correct; silent truncation is unsafe.
- Which of the following best explains the role of
yytextin lex/flex?
A. It contains the entire input file
B. It points to the current matched lexeme (null-terminated)
C. It is a reserved keyword for parsing
D. It’s the symbol table pointer
Answer: B
Solution: yytext points to the matched text for the current token.
- Which lexical technique is used to implement lexical macros or parameterized tokens?
A. DFA minimization
B. Semantic actions and lexer code that constructs tokens dynamically
C. Regex alone without actions
D. Parser-level substitution
Answer: B
Solution: Semantic actions allow constructing tokens dynamically for macros or parameterized tokens.
- Which is true when converting a large set of regex rules into a single DFA?
A. The DFA size is sum of regex lengths only
B. The DFA state count can be exponential in the size of the combined NFA in worst case
C. DFA is always smaller than any NFA
D. Combined DFAs never share states
Answer: B
Solution: Subset construction can yield exponential numbers of DFA states in worst-case regex combinations.
- A scanner should classify numeric lexeme
00123as what in languages that disallow leading zeros (except zero itself)?
A. Integer token with value 123 regardless
B. Lexical error or separate token depending on language rule (likely error)
C. Octal by default always
D. Float token
Answer: B
Solution: Rules differ; if language disallows leading zeros, lexer should flag error or follow spec (some languages treat as octal).
- Which commonly-used algorithm finds the longest common prefix among strings and can help in trie-based keyword match?
A. KMP algorithm
B. Trie traversal or radix tree compression
C. Dijkstra’s algorithm
D. Hopcroft minimization
Answer: B
Solution: Trie/radix trees exploit common prefixes for efficient keyword matching.
- When constructing a DFA table, what does a “dead state” represent?
A. Start of input
B. A non-accepting sink state from which no accepting state is reachable
C. An accepting final state
D. A temporary NFA state
Answer: B
Solution: Dead (sink) state catches characters that cannot lead to acceptance; all transitions return to it.
- Which of the following is true about scanning with lexical modes for embedded languages (e.g., HTML with embedded JS)?
A. Single flat lex can handle everything easily
B. Use start conditions to switch to different mode rules when entering embedded language and switch back on exit
C. Parser must parse embedded language only
D. Embedded languages cannot be lexed
Answer: B
Solution: Start conditions (modes) allow different rule sets for embedded languages.
- Which is an advantage of table-driven DFA scanners over hand-coded switch statements?
A. Tables are always faster in practice
B. Tables can be auto-generated and can use compression; hand-coded may be optimized but is error-prone
C. Switch statements cannot implement DFAs
D. Tables remove need for transitions
Answer: B
Solution: Table-driven scanners are auto-generated and maintainable; hand-coded may be faster but complex.
- Which one-liner regex would accept identifiers that start with a letter followed by letters or digits?
A.[0-9][a-zA-Z]*
B.[a-zA-Z][a-zA-Z0-9]*
C.[a-zA-Z0-9]+
D.\d+
Answer: B
Solution: [a-zA-Z][a-zA-Z0-9]* enforces initial letter and subsequent alphanumerics.
- Which of the following is true about lexical analysis speed?
A. Lexical analysis is usually the slowest phase in compilers
B. Well-designed lexical analyzers run in linear time O(n) in input length and are fast in practice
C. DFAs make lexing superlinear always
D. Speed depends only on parser
Answer: B
Solution: DFA-based scanners parse input in linear time and are typically not the bottleneck.
- In a scanner, what is the typical role of “semantic value” or
yylvalin YACC/Bison integration?
A. Holds parser configuration
B. Holds attribute values (like integer value, identifier pointer) passed from lexer to parser
C. Tracks buffer size
D. Prevents errors
Answer: B
Solution: yylval conveys token attributes (semantic values) from lexer to parser.
- Which action is advised when a lexer sees an invalid multibyte UTF-8 sequence?
A. Treat as ASCII and continue
B. Emit a lexical error and try best-effort recovery or replace with replacement character U+FFFD
C. Crash the compiler
D. Silently drop bytes
Answer: B
Solution: Best practice is to report and recover (replace with U+FFFD) rather than crash or silently ignore.
- Which regex matches C-style identifier including underscore as first character?
A.[a-zA-Z][a-zA-Z0-9_]*
B.[a-zA-Z_][a-zA-Z0-9_]*
C.[0-9_]+
D.[_0-9][a-z]*
Answer: B
Solution: C identifiers allow underscores as first character: [a-zA-Z_][a-zA-Z0-9_]*.
- The lexing of
//single-line comments requires the lexer to:
A. Skip characters until newline or EOF and then resume normal scanning
B. Switch parser mode to comment parsing
C. Start a nested comment counter
D. Convert to block comments
Answer: A
Solution: Single-line comments consume until newline/EOF; lexer then resumes.
- Which technique is recommended to make scanners more maintainable when rules are many and complex?
A. Put everything in one huge regex line
B. Modularize rules using macros, start conditions, and clear actions—use generator tools (Flex/JFlex)
C. Hard-code character pointers in C only
D. Avoid generator tools always
Answer: B
Solution: Modularization, macros, states, and generators keep lexers maintainable.
- Which scenario requires the lexer and parser to cooperate (lexer provides richer info to parser)?
A. When lexical classification depends on grammar context (context-sensitive lexing)
B. When lexing only ASCII
C. When tokens are independent of grammar
D. Lexers never cooperate with parsers
Answer: A
Solution: Some languages require context-sensitive classification (e.g., distinguishing typedef-defined names), so lexer and parser may communicate.
- Which of the following is a property of a DFA transition function δ?
A. δ maps (state, input) to a set of states
B. δ maps (state, input) to a single state
C. δ maps states to regex
D. δ is only defined for start state
Answer: B
Solution: DFA transition function returns exactly one next state for each (state, input) pair.
- Why might lexers prefer integer token codes instead of strings for token types?
A. Integers are faster for switch/case in parser and smaller in tables; strings are heavier to compare
B. Strings are illegal in C
C. Parser accepts only integers
D. No difference
Answer: A
Solution: Integer codes are compact and permit fast dispatch in parser; attributes can hold strings if needed.
- Which is correct about using bitsets for character classes in DFA?
A. Not useful at all
B. Bitsets allow quick membership checks and set operations during NFA→DFA conversion
C. Bitsets force Unicode rejection
D. Bitsets are slower than looping over ranges
Answer: B
Solution: Bitsets are efficient for representing character classes and performing set ops during conversion.
- Which of the following lexing situations is best handled by longest-match plus rule order?
A. Distinguishing<=from<and=<types
B. Context-sensitive tokenization based on grammar
C. Unicode normalization
D. Nested parentheses counting
Answer: A
Solution: Longest-match + rule order resolves overlapping operator tokens like <= vs <.
- Which of these is a correct trade-off when splitting lexing into stages (pre-lex → lex → post-lex)?
A. More phases reduce clarity always
B. Multiple stages allow complex preprocessing like macro expansion but complicate implementation and error reporting
C. Stages make lexing real-time impossible
D. Only single-stage works
Answer: B
Solution: Multi-stage pipelines allow flexibility (e.g., macro expansion) but make tracing and errors more complex.
- Which of the following best describes “scannerless parsing” approach?
A. Lexer runs before parser
B. No separate lexical phase; grammar includes lexical patterns and parser handles everything
C. Lexerless code cannot be parsed
D. Scannerless parsing only for binary files
Answer: B
Solution: Scannerless parsing embeds lexical patterns into parser grammar, eliminating separate lexer.
- Which of these helps detect keyword vs identifier efficiently if keywords are few?
A. Linear scan of keyword list per identifier
B. Hash table mapping strings to keyword tokens
C. Always treat as identifier and then check in parser
D. Use regex priority only
Answer: B
Solution: Hashing keywords yields O(1) lookup and is efficient even with few keywords.
- For a language where whitespace is significant (e.g., Python indentation), the lexer must:
A. Discard all whitespace always
B. Generate INDENT/DEDENT tokens based on changes in leading whitespace levels
C. Leave whitespace to parser only
D. Replace whitespace with semicolons
Answer: B
Solution: Languages like Python require lexer to emit INDENT/DEDENT tokens reflecting block structure due to leading whitespace.
- Which is true about the lexing of raw string literals that allow arbitrary contents including
"characters until a terminator?
A. A single regex is always enough
B. Start conditions with explicit terminator matching are safer to implement
C. Parser should handle raw strings only
D. They cannot be implemented
Answer: B
Solution: Start conditions and specific terminator matching are robust for raw strings; single regex may be fragile.
- A lexical DFA has 1000 states. Minimization via Hopcroft yields 200 states. Which is likely true?
A. Initial DFA was non-deterministic
B. Many states were equivalent and merged; minimization reduced redundancy
C. Minimization fails to preserve language
D. Minimization doubled states usually
Answer: B
Solution: Minimization merges equivalent states, reducing the DFA to a minimal equivalent form.
- When lexing numeric literals with exponent parts (e.g.,
1.2e-10), the lexer must:
A. Split into tokens1.2,e,-,10always
B. Recognize the entire float literal (including sign of exponent) as one token using regex or DFA
C. Let parser combine pieces
D. Reject such literals as invalid
Answer: B
Solution: Float regex should include optional exponent sign to recognize the whole literal as one token.
- Which error is a lexical analyzer least likely to detect?
A. Unterminated comment
B. Illegal character outside alphabet
C. Type mismatch between variables
D. Unterminated string literal
Answer: C
Solution: Type mismatches are semantic errors handled by later phases (semantic analysis), not by lexers.
- If a DFA is implemented with vectors of transitions per state, but most transitions point to same default, which improvement is best?
A. Keep vector as is
B. Use default transition optimization to store only exceptions per state
C. Expand to map per character
D. Convert to NFA
Answer: B
Solution: Default transition optimization stores only non-default transitions, saving memory.
- Which of the following helps lexers detect tokens spanning multiple lines while tracking correct line numbers?
A. Ignore newline counts
B. Update line/column counters when newlines are consumed inside multi-line tokens (e.g., block comments or strings)
C. Reset line count every token
D. Send tokens to parser without positions
Answer: B
Solution: Lexer must increment line count for every newline inside tokens to keep accurate positions.
- Which is a common technique to prevent ReDoS (regular expression denial of service) in web-facing lexers?
A. Allow arbitrary user-supplied regexes unchecked
B. Use DFA-based engines or limit backtracking depth and timeouts for regex evaluation
C. Use backtracking regex with no limits
D. Ignore user input
Answer: B
Solution: DFA engines avoid catastrophic backtracking; if backtracking allowed, enforce timeouts/limits.
- The main purpose of combining multiple regex rules into one big NFA before subset construction is:
A. To confuse compiler writers
B. To allow a single DFA to choose between all token rules and implement longest-match and rule priority mechanisms
C. To ensure the parser receives raw characters
D. To avoid generating token types
Answer: B
Solution: Combining rules produces a single DFA that can handle all tokens and resolve conflicts per longest-match and rule order.
- Which of the following best describes “semantic predicates” in lexer generators like ANTLR?
A. They are tokens only
B. Boolean conditions evaluated by user code to decide whether a rule applies at runtime
C. A DFA minimization hint
D. A type of regex operator
Answer: B
Solution: Semantic predicates let user code at lex time accept/reject matches based on context or state.
- A lexer recognizes integer token
INTwith regex[0-9]+. Parser expectsINTwith value modulo 2^16. Where should conversion/truncation happen?
A. In lexer semantic action to compute numeric value and store in token attribute
B. Parser must substring lexeme manually
C. Code generator will later fix it
D. Lexer should ignore numeric values
Answer: A
Solution: Lexer semantic action converts lexeme to numeric attribute and can apply truncation/modulo before passing to parser.
- When writing MCQ-worthy tricky lexing question yourself, which of the following ensures best practice?
A. Use ambiguous wording and multiple valid answers intentionally
B. Ensure only one correct option, state assumptions clearly, and vary numeric values and edge-case inputs for trickiness
C. Copy questions from standard sources verbatim
D. Mix parsing and lexing intentionally to confuse
Answer: B
Solution: Good MCQs are unambiguous, state assumptions, and include subtle edge cases while having a single correct answer.