PEP: XXX
Title: Generalised String Coercion
Version: $Revision$
Last-Modified: $Date$
Author: Neil Schemenauer <nas@arctrix.com>
Status: Draft
Type: Standards Track
Content-Type: text/plain
Created: 02-Aug-2005
Post-History:
Python-Version: 2.5


Abstract

    This PEP proposes the introduction of a new built-in function,
    text(), that provides a way of generating a string representation
    of an object.  This function would make it easier to write library
    code that processes string data without forcing the use of a
    particular string type.


Rationale

    Python has had a Unicode string type for some time now but use of
    it is not yet widespread.  There is a large amount of Python code
    that assumes that string data is represented as str instances.
    The long term plan for Python is to phase out the str type and use
    unicode for all string data.  Clearly, a smooth migration path
    must be provided.

    We need to upgrade existing libraries, written for str instances, to
    be made capable of operating in an all-unicode string world.
    We can't change to an all-unicode world until all essential
    libraries are made capable for it.  Upgrading the libraries in one
    shot does not seem feasible.  A more realistic strategy is to
    individually make the libraries capable of operating in an
    all-unicode world while preserving their current all-str world
    behavior.

    To achieve this, it is desirable to be able to write code that
    works with either string type but does not unnecessarily create
    unicode instances.  Creating a unicode instance should only be
    necessary if one of the inputs is already a unicode instance.
    Code that receives only str inputs should produce str results.
    Let us label such code as str-stable.
    
    Sometimes it is simple to write str-stable code.  For example, the
    following function just works:

        def appendx(s):
            return s + 'x'

    That's not too surprising since the unicode type is designed to
    make the task easier.  The principle is that when str and unicode
    objects meet, the result is a unicode object.  One notable
    difficulty arises when code requires a string representation of an
    object; an operation traditionally accomplished by using the str()
    built-in function.
    
    Replacing a str() call with a unicode() call is undesirable since
    it unnecessarily creates unicode instances (i.e. it is not
    str-stable).  Also, it could raise a UnicodeDecodeError if passed
    a str that contains non-ASCII characters.  Using a string format
    almost accomplishes the goal but not quite.  Consider the
    following code:

        def text(obj):
            return '%s' % obj

    It behaves as desired except if 'obj' is not a basestring instance
    and needs to return a Unicode representation of itself.  Defining
    a __unicode__ method does not help since it will only be called if
    the right-hand operand is a unicode instance.  Using a unicode
    instance for the right-hand operand does not work because the
    function is no longer str-stable.


Specification

    A Python implementation of the text() built-in follows:

        def text(s):
            """Return a nice string representation of the object.  The
            return value is a basestring instance.
            """
            if isinstance(s, basestring):
                return s
            r = s.__str__()
            if not isinstance(s, basestring):
                raise TypeError('__str__ returned non-string')
            return r
            
    Note that it is currently possible, although not very useful, to
    write __str__ methods that return unicode instances.

    The %s format specifier for str objects would be changed to call
    text() on the argument.  Currently it calls str() unless the
    argument is a unicode instances (in which case the object is
    substituted as is).

    The following function would be added to the C API and would be the
    equivalent of the text() function:

        PyObject *PyObject_Text(PyObject *o);

    A reference implementation is available on Sourceforge [1] as a
    patch.

                
Backwards Compatibility

    The change to the %s format specifier would result in some %
    operations returning a unicode instance rather than raising a
    UnicodeDecodeError exception.  It seems unlikely that the change
    would break currently working code.


Alternative Solutions

    Rather than adding the text() built-in, if PEP 246 were
    implemented then adapt(s, basestring) could be equivalent to
    text(s).  The advantage would be one less built-in function.  The
    problem is that PEP 246 is not implemented.

    Fredrik Lundh has suggested [2] that perhaps a new slot should be
    added (e.g. __text__), that could return any kind of string that's
    compatible with Python's text model.  That seems like an
    attractive idea but many details would still need to be worked
    out.

    Instead of providing the text() built-in, the %s format specifier
    could be changed and a string format could be used instead of
    calling text().  However, it seems like the operation is important
    enough to justify a built-in.  People would probably become upset
    if unicode() was removed and they were told to use a unicode
    format operation instead.

    Instead of providing the text() built-in, the basestring type
    could be changed to provide the same functionality.  That would
    possibly be confusing behavior for an abstract base type.

    Some people have suggested [3] that an easier migration path would
    be to change the default encoding to be UTF-8.  Code that is not
    Unicode safe would then encode Unicode strings as UTF-8 and
    operate on them as str instances, rather than raising a
    UnicodeDecodeError exception.  Other code would assume that str
    instances were encoded using UTF-8 and decode them if necessary.
    While that solution may work for some applications, it seems
    unsuitable as a general solution.  For example, some applications
    get string data from many different sources and assuming that all
    str instances were encoded using UTF-8 could easily introduce
    subtle bugs.


References

    [1] http://www.python.org/sf/1159501
    [2] http://mail.python.org/pipermail/python-dev/2004-September/048755.html
    [3] http://blog.ianbicking.org/illusive-setdefaultencoding.html


Acknowledgements

    The following people assisted in the writing of this PEP: David
    Binger, Marc-Andre Lemburg.


Copyright

    This document has been placed in the public domain.


Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
End: