Updated AS5 draft.

Originally committed to SVN as r1401.
This commit is contained in:
Rodrigo Braz Monteiro 2007-07-10 03:29:54 +00:00
parent ddda631b69
commit 51bf4fce32
2 changed files with 32 additions and 2 deletions

Binary file not shown.

View file

@ -54,7 +54,7 @@ That is, it must be a plain-text file.
\end{itemize} \end{itemize}
The character set of a subtitle file can be autodetermined by its Byte-Order Mark or by The character set of a subtitle file can be autodetermined by its Byte-Order Mark or by
the value of the first four bytes. See below. the value of the first two bytes. See below.
\subsection{File Structure} \subsection{File Structure}
The file is divided in \emph{sections}, which are uniquely identified by a string inside The file is divided in \emph{sections}, which are uniquely identified by a string inside
@ -62,11 +62,33 @@ square brackets, in a line of its own. From that point on, every next line is co
to be part of the last found section until another section is found. There is no end-of-section to be part of the last found section until another section is found. There is no end-of-section
termination mark; they always end at the start of the next one or at the end of the file. termination mark; they always end at the start of the next one or at the end of the file.
Each section is divided in lines, each line representing one command or definition. Empty
lines \emph{MUST} be ignored. It is recommended that programs generating AS5 files insert
a blank line at the end of each section to increase readability. There \emph{MUST} always
be a blank line at the end of the file (as every line is required to end in a line break).
Each line in a section takes the general form of \textit{Type: data1,data2,...,dataN}. An
unknown \textit{Type} \emph{MUST} be ignored by a parser. It is recommended that subtitle
editing programs keep such ignored lines in the file after re-saving it.
There are two sections which are required, \emph{[AS5]} and \emph{[Data]}, the equivalents of
\emph{[Script Info]} and \emph{[Events]} in previous formats. If either of those sections is
missing, the file is deemed invalid and \emph(MUST) be refused by the parser. Any other section
can be ommitted from the file, and need not be implemented by all parsers. However, any unknown
section \emph{MUST} be preserved in the file by a subtitle editing program when it re-saves a
file with sections that it does not recognize. It can, however, be removed at the user's discretion.
Finally, there is a special type of undefined group, \emph{[Private:PROGNAME]}, which
\emph{MUST} be \emph{ENTIRELY} preserved by other programs when re-saving it. This is used to
store program-specific data, for example, Aegisub would create a group called
\emph{[Private:Aegisub]} to store its data inside. This type of group should be identified
by the fact that it starts with \emph{"`[Private:"'}.
\subsubsection{[AS5]} \subsubsection{[AS5]}
This must be the first section in every AS5 file. If the very first line of the file is not This must be the first section in every AS5 file. If the very first line of the file is not
[AS5], the file \emph{MUST} be rejected by the parser as invalid. Note, however, that the first [AS5], the file \emph{MUST} be rejected by the parser as invalid. Note, however, that the first
line is allowed to contain a Byte-Order Mark (BOM), which is the character U+FEFF encoded in line is allowed to contain a Byte-Order Mark (BOM), which is the character U+FEFF encoded in
the encoding used for the rest of the script. The first four bytes will therefore be: the encoding used for the rest of the script\cite{Unicode BOM}. The first four bytes will therefore be:
\begin{itemize} \begin{itemize}
\item 0xEF 0xBB 0xBF 0x5B - UTF-8 (with BOM) \item 0xEF 0xBB 0xBF 0x5B - UTF-8 (with BOM)
@ -77,6 +99,11 @@ the encoding used for the rest of the script. The first four bytes will therefor
\item 0x00 0x5B 0x00 0x41 - UTF-16 BE (without BOM) \item 0x00 0x5B 0x00 0x41 - UTF-16 BE (without BOM)
\end{itemize} \end{itemize}
It is possible, therefore, to determine the encoding of the file by checking its first two bytes.
This section \emph{MUST} declare the following properties:
\addcontentsline{toc}{section}{References} \addcontentsline{toc}{section}{References}
\begin{thebibliography}{1} \begin{thebibliography}{1}
@ -108,6 +135,9 @@ the encoding used for the rest of the script. The first four bytes will therefor
\bibitem{UTF-16} The Internet Society, RFC 2781, "`UTF-16, an encoding of ISO 10646"'. Website, 2000.\\ \bibitem{UTF-16} The Internet Society, RFC 2781, "`UTF-16, an encoding of ISO 10646"'. Website, 2000.\\
\url{http://tools.ietf.org/html/rfc2781} \url{http://tools.ietf.org/html/rfc2781}
\bibitem{Unicode BOM} Unicode, Inc, The Unicode Standard, Chapter 13. PDF, 1991-2000.\\
\url{http://www.unicode.org/unicode/uni2book/ch13.pdf}
\end{thebibliography} \end{thebibliography}
\end{document} \end{document}